ES忽略TF-IDF评分——使用constant_score-阿里云开发者社区

ES忽略TF-IDF评分——使用constant_score

2017-11-15 2332

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

Ignoring TF/IDF

Sometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:

WiFi
Garden
Pool

The vacation home documents look something like this:

{ "description": "A delightful four-bedroomed house with ... " }

We could use a simple match query:

GET /_search
{
  "query": { "match": { "description": "wifi garden pool" } } }

However, this isn’t really full-text search. In this case, TF/IDF just gets in the way. We don’t care whether wifi is a common term, or how often it appears in the document. All we care about is that it does appear. In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1, and if it isn’t, 0.

constant_score Query

Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of1 to any documents that match, regardless of TF/IDF:

GET /_search
{
  "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "query": { "match": { "description": "pool" }} }} ] } } }

Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:

GET /_search
{
  "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "boost": 2  "query": { "match": { "description": "pool" }} }} ] } } }

A matching pool clause would add a score of 2, while the other clauses would add a score of only 1 each.

The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.

We could improve our vacation home documents by adding a not_analyzed features field to our vacation homes:

{ "features": [ "wifi", "pool", "garden" ] } 这样改写有什么好处？省索引空间吗？

参考：https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html#ignoring-tfidf

本文转自张昺华-sky博客园博客，原文链接：http://www.cnblogs.com/bonelee/p/6475950.html，如需转载请自行联系原作者

文章标签：

索引

ES忽略TF-IDF评分——使用constant_score

Ignoring TF/IDF

constant_score Query

热门文章

最新文章

相关电子书