Elasticsearch: The Definitive Guide

入门

当然 Elasticsearch 并不仅仅是 Lucene 那么简单,它不仅包括了全文搜索功能,还可以进行以下工作:

  • 分布式实时文件存储,并将每一个字段都编入索引,使其可以被搜索。
  • 实时分析的分布式搜索引擎。
  • 可以扩展到上百台服务器,处理PB级别的结构化或非结构化数据。

在 Elasticsearch 中,文档属于一种 类型(type),各种各样的类型存在于一个 索引 中。你也可以通过类比传统的关系数据库得到一些大致的相似之处:

关系数据库 ⇒ 数据库 ⇒ 表 ⇒ 行 ⇒ 列(Columns) Elasticsearch ⇒ 索引 ⇒ 类型 ⇒ 文档 ⇒ 字段(Fields)

查询字符串(query string) 搜索:

GET /megacorp/employee/_search?q=last_name:Smith

搜索

精简搜索

+ 必须要满足的条件, - 必须不能匹配的条件, 没有 + ‘-‘ 表示可选条件.

字段 _all

当我们在索引一个文档时, Elasticsearch 会将所有的数值都汇总到一个大的字符串中, 并将它索引成一个特殊的字段 _all. 当字段 _all 没有使用价值时, 可以将它关掉.

Inside a Shard

https://www.elastic.co/guide/en/elasticsearch/guide/current/making-text-searchable.html

Making Text Searchable

The inverted index may hold a lot more information than the list of documents that contain a particular term. It may store a count of the number of documents that contain each term, the number of times a term appears in a particular document, the order of terms in each document, the length of each document, the average length of all documents, and more. These statistics allow Elasticsearch to determine which terms are more important than others, and which documents are more important than others, Immutability

Writing a single large inverted index allows the data to be compressed, reducing costly disk I/O and the amount of RAM needed to cache the index. 为什么写一个单独的文件允许数据被压缩? 压缩后怎么读取?

不可变的 index 也有它的缺点, 主要因为它是不可变的. 如果你想使一个新文档可搜索, 你不得不重建整个 index. 这明显限制了一个 index 可以包含的数据量, 或者一个 index 可以被更新的频率.

Dynamically Updatable Indices

how to make an inverted index updatable without losing the benefits of immutability? The answer turned out to be: use more than one index. Instead of rewriting the whole inverted index, add new supplementary indices to reflect more-recent changes. Each inverted index can be queried in turn—starting with the oldest—and the results combined.

Lucene, the Java libraries on which Elasticsearch is based, introduced the concept of per-segment search. A segment is an inverted index in its own right, but now the word index in Lucene came to mean a collection of segments plus a commit point—a file that lists all known segments(这些一起是Elasticsearch里的一个shard), as depicted in Figure 16, “A Lucene index with a commit point and three segments”. New documents are first added to an in-memory indexing buffer, as shown inFigure 17, “A Lucene index with new documents in the in-memory buffer, ready to commit”, before being written to an on-disk segment, as in Figure 18, “After a commit, a new segment is added to the commit point and the buffer is cleared”

Deletes and Updates

Segments are immutable, so documents cannot be removed from older segments, nor can older segments be updated to reflect a newer version of a document. Instead, every commit point includes a .del file that lists which documents in which segments have been deleted.

A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.

Making Changes Persistent

Elasticsearch added a translog, or transaction log, which records every operation in Elasticsearch as it happens.

Every so often—such as when the translog is getting too big—the index is flushed; a new translog is created, and a full commit is performed

The translog provides a persistent record of all operations that have not yet been flushed to disk. When starting up, Elasticsearch will use the last commit point to recover known segments from disk, and will then replay all operations in the translog to add the changes that happened after the last commit.

The translog is also used to provide real-time CRUD. When you try to retrieve, update, or delete a document by ID, it first checks the translog for any recent changes before trying to retrieve the document from the relevant segment. This means that it always has access to the latest known version of the document, in real-time.(filesystem cache 中不需要经过 translog 就能 real-time CRUD? 只有 in memory buffer 中的才需要)

That said, it is beneficial to flush your indices before restarting a node or closing an index. When Elasticsearch tries to recover or reopen an index, it has to replay all of the operations in the translog, so the shorter the log, the faster the recovery.

How Safe Is the Translog?

The purpose of the translog is to ensure that operations are not lost. This begs the question: how safe is the translog?

Writes to a file will not survive a reboot until the file has been fsync’ed to disk. By default, the translog is fsync’ed every 5 seconds. Potentially, we could lose 5 seconds worth of data—if the translog were the only mechanism that we had for dealing with failure.

Fortunately, the translog is only part of a much bigger system. Remember that an indexing request is considered successful only after it has completed on both the primary shard and all replica shards. Even if the node holding the primary shard were to suffer catastrophic failure, it would be unlikely to affect the nodes holding the replica shards at the same time.

Controlling Relevance

Full-text relevance formulae, or similarity algorithms, combine several factors to produce a single relevance _score for each document . In this chapter, we examine the various moving parts and discuss how they can be controlled.

Of course, relevance is not just about full-text queries; it may need to take structured data into account as well . Perhaps we are looking for a vacation home with particular features (air-conditioning, sea view, free WiFi). The more features that a property has, the more relevant it is. Or perhaps we want to factor in sliding scales like recency, price, popularity, or distance, while still taking the relevance of a full-text query into account.

We will start by looking at the theoretical side of how Lucene calculates relevance, and then move on to practical examples of how you can control the process.

Theory Behind Relevance Scoring

Lucene (and thus Elasticsearch) uses the `Boolean model`_ to find matching documents, and a formula called `the practical scoring function`_ to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the `vector space model`_ but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting .

TODO 上面三个链接

Query-Time Boosting

You could use the boost parameter at search time to give one query clause more importance than another. For instance:

GET /_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "quick brown fox",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "content": "quick brown fox"
          }
        }
      ]
    }
  }
}

Boosting an Index

When searching across multiple indices, you can boost an entire index over the others with the indices_boost parameter. This could be used, as in the next example, to give more weight to documents from a more recent index:

GET /docs_2014_*/_search
{
  "indices_boost": {
    "docs_2014_10": 3,
    "docs_2014_09": 2
  },
  "query": {
    "match": {
      "text": "quick brown fox"
    }
  }
}

constant_score Query

Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of 1 to any documents that match, regardless of TF/IDF .

Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more .

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "description": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "garden" }}
        }},
        { "constant_score": {
          "boost":   2
          "query": { "match": { "description": "pool" }}
        }}
      ]
    }
  }
}

function_score Query

The function_score query is the ultimate tool for taking control of the scoring process. It allows you to apply a function to each document that matches the main query in order to alter or completely replace the original query _score .

In fact, you can apply different functions to subsets of the main result set by using filters, which gives you the best of both worlds: efficient scoring with cacheable filters.

It supports several predefined functions out of the box:

weight
Apply a simple boost to each document without the boost being normalized: a weight of 2 results in 2 * _score.
field_value_factor
Use the value of a field in the document to alter the _score, such as factoring in a popularity count or number of votes.
random_score
Use consistently random scoring to sort results differently for every user, while maintaining the same sort order for a single user .
Decay functions—linear, exp, gauss
Incorporate sliding-scale values like publish_date, geo_location, or price into the _score to prefer recently published documents, documents near a latitude/longitude (lat/lon) point, or documents near a specified price point.
script_score
Use a custom script to take complete control of the scoring logic. If your needs extend beyond those of the functions in this list, write a custom script to implement the logic that you need.

Without the function_score query, we would not be able to combine the score from a full-text query with a factor like recency . We would have to sort either by _score or by date; the effect of one would obliterate the effect of the other. This query allows you to blend the two together: to still sort by full-text relevance, but giving extra weight to recently published documents, or popular documents, or products that are near the user’s price point. As you can imagine, a query that supports all of this can look fairly complex. We’ll start with a simple use case and work our way up the complexity ladder.

组合 script score 和 field_value_factor

The default scripting language in Elasticsearch is Groovy .

"functions": [
  {
    "field_value_factor": {
      "field": "n_comments",
      "modifier": "ln2p"
    }
  },
  {
    "script_score": {
      "script": "n_useful = doc['n_useful'].value; n_useless = doc['n_useless'].value;\nif(n_useful > n_useless) return log(E+n_useful-n_useless);\nreturn 1/log(E+n_useless-n_useful);"
    }
  }
]

Boosting by Popularity

We would like more-popular posts to appear higher in the results list, but still have the full-text score as the main relevance driver.

At search time, we can use the function_score query with the field_value_factor function to combine the number of votes with the full-text relevance score :

GET /blogposts/post/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query":    "popularity",
          "fields": [ "title", "content" ]
        }
      },
      "field_value_factor": {
        "field": "votes"
      }
    }
  }
}
  1. The function_score query wraps the main query and the function we would like to apply.
  2. The main query is executed first.
  3. The field_value_factor function is applied to every document matching the main query.
  4. Every document must have a number in the votes field for the function_score to work.

In the preceding example, the final _score for each document has been altered as follows:

new_score = old_score * number_of_votes

This will not give us great results. The full-text _score range usually falls somewhere between 0 and 10 . A blog post with 10 votes will completely swamp the effect of the full-text score, and a blog post with 0 votes will reset the score to zero.

modifier

A better way to incorporate popularity is to smooth out the votes value with some modifier . In other words, we want the first few votes to count a lot, but for each subsequent vote to count less. The difference between 0 votes and 1 vote should be much bigger than the difference between 10 votes and 11 votes .

A typical modifier for this use case is log1p, which changes the formula to the following:

new_score = old_score * log(1 + number_of_votes)

Set the modifier to log1p:

"field_value_factor": {
  "field":    "votes",
  "modifier": "log1p"
}

factor

The strength of the popularity effect can be increased or decreased by multiplying the value in the votes field by some number, called the factor.

Adding in a factor changes the formula to this:

new_score = old_score * log(1 + factor * number_of_votes)

boost_mode

Perhaps multiplying the full-text score by the result of the field_value_factor function still has too large an effect. We can control how the result of a function is combined with the _score from the query by using the boost_mode parameter, which accepts the following values: multiply / sum / min / max / replace .

默认是 multiply

Weight

The weight score allows you to multiply the score by the provided weight . This can sometimes be desired since boost value set on specific queries gets normalized, while for this score function it does not. The number value is of type float.

"weight" : number

max_boost

Finally, we can cap the maximum effect that the function can have by using the max_boost parameter:

"function_score": {
  "query": {
    ...
  },
  "field_value_factor": {
    "field":    "votes",
    "modifier": "log1p",
    "factor":   0.1
  },
  "boost_mode": "sum",
  "max_boost":  1.5
}

Query DSL

Boosting Query

The boosting query can be used to effectively demote results that match a given query. Unlike the “NOT” clause in bool query, this still selects documents that contain undesirable terms, but reduces their overall score.

这个 negative 是集成 query 的环节,而不是单独计算 score 时。所以貌似不能实现按 n_useless 的大小给响应 negative score weight 的需求。

这样的需求可以直接在 function_score 给具体的 function 添加 weight .

Segment Merging

This is the moment when those old deleted documents are purged from the filesystem. Deleted documents (or old versions of updated documents) are not copied over to the new bigger segment.

Query Demo

cardinality

A single-value metrics aggregation that calculates an approximate count of distinct values.

{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "_type:chat_log",
          "analyze_wildcard": true
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "count_per_month": {
    "date_histogram": {
      "field": "timestamp",
        "interval": "day"
      },
      "aggs": {
        "uuid_count": {
            "cardinality": {
            "field": "uuid"
          }
        }
      }
    }
  }
}

Daily Conversation Depth - Average

{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "_type:chat_log",
          "analyze_wildcard": true
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "daily_stats": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "day"
      },
      "aggs": {
        "uuid_terms": {
          "terms": {
            "field": "uuid",
            "size": 10000,
          }
        },
        "daily_depth": {
          "avg_bucket": {
            "buckets_path": "uuid_terms>_count"
          }
        }
      }
    }
  }
}

或者使用 percentiles_bucket 获取更准确的分段结果。

过滤输出字段:

http://search-es.svc:8080/dae_bender/chat_log/_search?filter_path=aggregations.daily_stats.buckets.daily_depth,aggregations.daily_stats.buckets.doc_count,aggregations.daily_stats.buckets.key_as_string

Delete by query

import time
import json
import requests
from datetime import date, timedelta

query = '''
{
    "size": 3000,
    "query": {
        "constant_score": {
            "filter": {
                "bool": {
                    "must": {
                        "range": {"timestamp": {"gt": "%sT00:00:00+08"}}
                    },
                    "filter": {
                        "term": {
                            "_type": "movie_review"
                        }
                    }
                }
            }
        }
    }
}
''' % (date.today() - timedelta(days=15))


print query


while True:
    result = requests.get('http://es-host/bender/_search', data=query)
    result = result.json()
    print result['hits']['total'], 'remain'
    hits = result['hits']['hits']
    if not hits:
        break
    body = []
    for hit in hits:
        body.append(json.dumps({'delete': {'_index': 'bender', '_type': '_review', '_id': hit['_id']}}))
        body.append('\n')
    requests.post('http://search-es.svc:8080/_bulk', data=''.join(body))
    time.sleep(0.5)

Put mapping

PUT http://es-host/bender/_mapping/chat_log

{
    "chat_log": {
      "properties": {
        "timestamp": {
            type: "date",
            format: "strict_date_optional_time||epoch_millis"
        },
        "uuid": {
            type: "string",
            index: "not_analyzed"
        },
        "request": {
            type: "string"
        },
        "response": {
            type: "string"
        }
      }
    }
}

Significant Terms Aggregation

{
  "size": 0,
  "query": {
        "match_all" : {}
  },
  "aggregations": {
        "significantText": {
          "significant_terms" : {
                  "field" : "text",
                  "size": 100,
                  "min_doc_count": 20}
          }
  }
}

有很多应该属于 stopwords 的单字。

TOOD 是否可以设置 token 的最短长度?