ES搜索排序,文档相关度评分介绍——Vector Space Model

简介:

Vector Space Model

The vector space model provides a way of comparing a multiterm query against a document. The output is a single score that represents how well the document matches the query. In order to do this, the model represents both the document and the query as vectors.

A vector is really just a one-dimensional array containing numbers, for example:

[1,2,5,22,3,8]

In the vector space model, each number in the vector is the weight of a term, as calculated with term frequency/inverse document frequency.

Tip

While TF/IDF is the default way of calculating term weights for the vector space model, it is not the only way. Other models like Okapi-BM25 exist and are available in Elasticsearch. TF/IDF is the default because it is a simple, efficient algorithm that produces high-quality search results and has stood the test of time.

Imagine that we have a query for “happy hippopotamus.” A common word like happy will have a low weight, while an uncommon term like hippopotamus will have a high weight. Let’s assume that happyhas a weight of 2 and hippopotamus has a weight of 5. We can plot this simple two-dimensional vector—[2,5]—as a line on a graph starting at point (0,0) and ending at point (2,5), as shown inFigure 27, “A two-dimensional query vector for “happy hippopotamus” represented”.

Figure 27. A two-dimensional query vector for “happy hippopotamus” represented

The query vector plotted on a graph

 

Now, imagine we have three documents:

  1. I am happy in summer.
  2. After Christmas I’m a hippopotamus.
  3. The happy hippopotamus helped Harry.

We can create a similar vector for each document, consisting of the weight of each query term—happy and hippopotamus—that appears in the document, and plot these vectors on the same graph, as shown in Figure 28, “Query and document vectors for “happy hippopotamus””:

  • Document 1: (happy,____________)[2,0]
  • Document 2: ( ___ ,hippopotamus)[0,5]
  • Document 3: (happy,hippopotamus)[2,5]

Figure 28. Query and document vectors for “happy hippopotamus”

The query and document vectors plotted on a graph

 

The nice thing about vectors is that they can be compared. By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document. The angle between document 1 and the query is large, so it is of low relevance. Document 2 is closer to the query, meaning that it is reasonably relevant, and document 3 is a perfect match.

Tip

In practice, only two-dimensional vectors (queries with two terms) can be plotted easily on a graph. Fortunately, linear algebra—the branch of mathematics that deals with vectors—provides tools to compare the angle between multidimensional vectors, which means that we can apply the same principles explained above to queries that consist of many terms.

You can read more about how to compare two vectors by using cosine similarity.

Now that we have talked about the theoretical basis of scoring, we can move on to see how scoring is implemented in Lucene.












本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6474138.html,如需转载请自行联系原作者


相关文章
|
6月前
|
前端开发
Spartacus search box 里显示的产品列表数据是从哪里进行搜索的
Spartacus search box 里显示的产品列表数据是从哪里进行搜索的
41 0
|
8月前
list转tree,并支持搜索
list转tree,并支持搜索
40 0
|
10月前
|
SQL 搜索推荐 Linux
Seurat -> RunPrestoAll 替代FindAllMarkers 加速DE 搜索
本文分享了一种在Seurat 流程里面加速大型数据集执行 DE 分析的方法 RunPrestoAll 的用法示例,以供参考学习
257 0
|
10月前
|
搜索推荐
item_search - 按关键字搜索商品
一:便捷和快速。使用关键词检索可以很快地找到所需要的信息,只需要输入关键词,搜索引擎就会返回相关结果。 二:范围广。关键词检索可以搜索全平台范围内的网页和相关信息,用户能够快速获取信息。 三:检索准确度高。通过对关键词的筛选和搜索引擎的智能推荐,用户可以获得更加准确匹配的搜索结果。
|
11月前
|
SQL 索引
白话Elasticsearch03- 结构化搜索之基于bool组合多个filter条件来搜索数据
白话Elasticsearch03- 结构化搜索之基于bool组合多个filter条件来搜索数据
258 0
|
11月前
|
SQL Java
白话Elasticsearch04- 结构化搜索之使用terms query搜索多个值以及多值搜索结果优化
白话Elasticsearch04- 结构化搜索之使用terms query搜索多个值以及多值搜索结果优化
460 0
|
11月前
|
SQL
白话Elasticsearch05- 结构化搜索之使用range query来进行范围过滤
白话Elasticsearch05- 结构化搜索之使用range query来进行范围过滤
73 0