ES忽略TF-IDF评分——使用constant_score

简介:

Ignoring TF/IDF

Sometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:

  • WiFi
  • Garden
  • Pool

The vacation home documents look something like this:

{ "description": "A delightful four-bedroomed house with ... " }

We could use a simple match query:

GET /_search
{
  "query": { "match": { "description": "wifi garden pool" } } }

However, this isn’t really full-text search. In this case, TF/IDF just gets in the way. We don’t care whether wifi is a common term, or how often it appears in the document. All we care about is that it does appear. In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1, and if it isn’t, 0.

constant_score Query

Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of1 to any documents that match, regardless of TF/IDF:

GET /_search
{
  "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "query": { "match": { "description": "pool" }} }} ] } } }

Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:

GET /_search
{
  "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "boost": 2  "query": { "match": { "description": "pool" }} }} ] } } }

A matching pool clause would add a score of 2, while the other clauses would add a score of only 1 each.

Note

The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.

We could improve our vacation home documents by adding a not_analyzed features field to our vacation homes:

{ "features": [ "wifi", "pool", "garden" ] } 这样改写有什么好处?省索引空间吗?

参考:https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html#ignoring-tfidf














本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6475950.html,如需转载请自行联系原作者

相关文章
|
4月前
|
算法
TF-IDF算法是什么呢?
TF-IDF(Term Frequency-Inverse Document Frequency)是一种常用于信息检索和文本挖掘的统计方法,用于评估一个词在文档集或一个语料库中的重要程度。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
sklearn.metric.accuracy_score评价指标介绍和使用
sklearn.metric.accuracy_score评价指标介绍和使用
193 0
|
10月前
|
人工智能 Java
Elasticsearch:使用 function_score 中的weight和gauss衰减函数定制搜索结果的分数
Elasticsearch:使用 function_score 中的weight和gauss衰减函数定制搜索结果的分数
sklearn中的cross_val_score交叉验证
sklearn中的cross_val_score交叉验证
|
算法
TF-IDF算法
TF-IDF(Term Frequency-Inverse Document Frequency, 词频-逆文件频率).
152 0
TF-IDF算法
TF-IDF及相似度计算
TF-IDF:衡量某个词对文章的重要性由TF和IDF组成 TF:词频(因素:某词在同一文章中出现次数) IDF:反文档频率(因素:某词是否在不同文章中出现) TF-IDF = TF*IDF TF :一个单词在一篇文章出现次数越多越重要 IDF: 每篇文章都出现的单词(如的,你,我,他) ,越不重要
254 0
TF-IDF及相似度计算
|
TensorFlow 算法框架/工具
成功解决AttributeError: module 'tensorflow.nn.rnn_cell' has no attribute 'linear'
成功解决AttributeError: module 'tensorflow.nn.rnn_cell' has no attribute 'linear'
|
机器学习/深度学习 BI TensorFlow
TF之RNN:实现利用scope.reuse_variables()告诉TF想重复利用RNN的参数的案例
TF之RNN:实现利用scope.reuse_variables()告诉TF想重复利用RNN的参数的案例
|
TensorFlow 算法框架/工具 索引
[转载]Tensorflow 的reduce_sum()函数的axis,keep_dim这些参数到底是什么意思?
转载链接:https://www.zhihu.com/question/51325408/answer/125426642来源:知乎 这个问题无外乎有三个难点: 什么是sum 什么是reduce 什么是维度(indices, 现在均改为了axis和numpy等包一致) sum很简单,就是求和,那么问题就是2和3,让我们慢慢来讲。
1516 0
|
算法
TF-IDF
TF为"词频",IDF为"逆文档频率",将这两个值相乘,就得到了一个词的TF-IDF值。某个词对文章的重要性越高,它的TF-IDF值就越大。所以,排在最前面的几个词,就是这篇文章的关键词。
1518 0