ElasticSearch Posts

Efficient Chinese Search with Elasticsearch


Mandarin is the official language in China and the most spoken worldwide.

Chinese characters are logograms, they represents a word or a morpheme (the smallest meaningful unit of language). Put together, their meaning can change and represent a whole new word. Another difficulty is that there is no space between words or sentences, making it very hard for a computer to know where a word starts or ends.

语素文字,表示词或语素(语言的最小语义单位)的文字。在非正式场合被称为表意文字或象形文字,但实际不同。来源 维基百科 语素文字

Let’s see an example: the word “volcano” (火山) is in fact the combination of:

  • 火: fire
  • 山: mountainsky

Our tokenizer must be clever enough to avoid separating those two logograms, because the meaning is changed when they are not together.

At the time of this writing, here are the solutions available with Elasticsearch:

  • the default Chinese analyzer, based on deprecated classes from Lucene 4;
  • the paoding plugin, sadly not maintened but based on very good dictionaries;
  • the cjk analyzer that makes bi-grams of your contents;
  • the smart chinese analyzer, distributed under an officialy supported plugin;
  • and finally the ICU plugin and his tokenizer.
双字母组 或称 二元语法(英语:bigrams,或称digrams),作为统计分析文本使用非常广泛;它是由两个字母,或者两个音节,或者两个词构成的双字母组。这种组被用在最成功的一种语音识别的语言模型中。它们是N字母组的一种特例。 来源 维基百科 双字母组

These analyzers are very different and we will compare how well they perform with a simple test word: 手机. It means “Cell phone” and is composed of two logograms, which mean “hand” and “machine” respectively. The 机 logogram also composes a lot of other words:

机票: plane ticket 机器人: robot 机枪: machine gun 机遇: opportunity

Our tokenization must not split those logograms, because if I search for “Cell phone”, I do not want any documents about Rambo owning a machine gun and looking bad-ass.

We are going to test our solutions with the great _analyze API:

curl -XGET ‘http://localhost:9200/chinese_test/_analyze?analyzer=paoding_analyzer1’ -d ‘手机’

The default Chinese analyzer

Already available on your Elasticsearch instance, this analyzer uses the ChineseTokenizer class of Lucene, which only separates all logograms into tokens. So we are getting two tokens: 手 and 机.

The Elasticsearch standard analyzer produces the exact same output. For this reason, Chinese is deprecated and soon to be replaced by standard, and you should avoid it .

The paoding plugin

Paoding is almost an industry standard and is known as an elegant solution. Sadly, the plugin for Elasticsearch is unmaintained and I only managed to make it work on version 1.0.1, after some modifications. Here is how to install it manually:

After this clumsy installation process (to be done on all your nodes), we now have a new paoding tokenizer and two collectors: max_word_len and most_word. No analyzer is exposed by default so we have to declare a new one:

The cjk analyzer

Very straightforward analyzer, it only transforms any text into a bi-gram . “Batman” becomes a list of meaningless tokens: Ba, at, tm, ma, an. For Asian languages, this tokenizer is a good and very simple solution at the price of a bigger index and sometime not perfectly relevant results.

In our case, a two-logogram word, only 手机 is indexed, which is looking good, but if we take a longer word like 元宵节 (Lantern festival), two tokens are generated: 元宵 and 宵节, meaning respectively lantern and Xiao Festival.

The smart chinese plugin

Very easy to install thanks to the guys at Elasticsearch maintaining it:

bin/plugin -install elasticsearch/elasticsearch-analysis-smartcn/2.3.0

It exposes a new smartcn analyzer, as well as as the smartcn_tokenizer tokenizer, using the SmartChineseAnalyzer from Lucene .

It operates a probability suite to find an optimal separation of words , using the Hidden Markov model and a big number of training texts. So there is already a training dictionary embedded which is quite good on common text – our example is properly tokenized.

TODO Hidden Markav model

TODO 如果用于特定领域,是否可以自己做训练?

The ICU plugin

Another official plugin. Elasticsearch supports the “International Components for Unicode” libraries.

bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1

This plugin is also recommended if you deal with any language other than English , I use it all the time for French content!

It exposes an icu_tokenizer tokenizer that we will use, as well as a lot of great analysis tools like icu_normalizer, icu_folding, icu_collation, etc.

It works with a dictionary for Chinese and Japanese texts, containing information about word frequency to deduce logogram groups . On 手机, everything is fine and works as expected, but on 元宵节, two tokens are produced: 元宵 and 节 – that’s because lantern and festival are more common than Lantern festival.

Going further with Chinese?

There is no perfect one-size-fits-all solution for analyzing with Elasticsearch, regardless of the content you deal with, and that’s true for Chinese as well. You have to compose and build your own analyzer with the information you get. For example, I’m going with cjk and smartcn tokenization on my search fields, using multi-fields and the multi-match query.

To learn more about Chinese I recommand Chineasy which is a great way to get some basic reading skills! Learning such a rich language is not easy and you should also read this article before going for it, just so you know what’s you’re getting into! 快乐编码!

From my point of view, paoding and smartcn get the best results. The chinese tokenizer is very bad and the icu_tokenizer is a bit disappointing on 元宵节, but handles traditional Chinese very well.

使用 elasticsearch 获取 top common words

Multi-set analysis

A simpler way to perform analysis across multiple categories is to use a parent-level aggregation to segment the data ready for analysis.

        "aggregations": {
                "forces": {
                        "terms": {"field": "force"},
                        "aggregations": {
                                "significantCrimeTypes": {
                                        "significant_terms": {"field": "crime_type"}