wukong搜索引擎源码解读

简介:

转自:https://ayende.com/blog/171745/code-reading-wukong-full-text-search-engine

I like reading code, and recently I was mostly busy with moving our offices, worrying about insurance, lease contracts and all sort of other stuff that are required, but not much fun. So I decided to spend a few hours just going through random code bases and see what I’ll find.

I decided to go with Go projects, because that is a language that I can easily read, and I headed out to this page. Then I basically scanned the listing for any projects that took my fancy. The current one is Wukong, which is a full text search engine. I’m always interested in those, since Lucene is a core part of RavenDB, and knowing how others are implementing this gives you more options.

This isn’t going to be a full review, merely a record of my exploration into the code base. There is one problem with the codebase, as you can see here:

image

All the text is in Chinese, which I know absolutely nothing about, so comments aren’t going to help here. But so far, the code looks solid. I’ll start from the high level overview:

image

We can see that we have a searcher (of type Engine), and we add documents to it, then we flush the index, and then we can search it.

Inside the engine.Init, the first interesting thing is this:

image 

This is interesting because we see sharding from the get go. By default, there are two shards. I’m sure what the indexers are, or what they are for yet.

Note that there is an Indexer and a Ranker for each shard. Then we have a bunch of initialization routines that looks like so:

image

So we create two arrays of channels (the core Go synchronization primitive), and initialize them with a channel per shard. That is a pretty standard Go behavior, and usually there is a go routine that is spinning on each of those channels. That means that we have (by default) two “threads” for adding documents, and two for doing lookups. No idea what either one of those are yet.

There is a lot more like this, for ranking, removing ranks, search ranks, etc, but that is just more of the same, and I’ll ignore it for now. Finally, we see some actions when we start producing threads to do actual works:

image

As you can see, spin off go routines (which will communicate with the channels we already saw) to do the actual work. Note that per shard, we’ll have as many index lookup and rank workers as we have CPUs.

There is an option to persist to disk, using the kv package, this looks like this:

image

On the one hand, I really like the “let us just import a package from github” option that go has, on the other hand. Versioning control has got to be a major issue here for big projects.

Anyway, let us get back to the actual hot path I’m interested in, indexing documents. This is done by this function:

imagea

Note that we have to supply the docId externally, instead of creating it ourselves. Then we just hash the docId and the actual content and send the data to the segmenter to do work. Currently I think that the work of the segmenter is similar to the work of the analyzer in Lucene. To break apart the content to discrete terms. Let us check…

image

this certainly seems to be the case here, yep. This code run on as many cores as the machine has, each one handling a single segmentation request. Once that work is done, it is sent to another thread to actually do the indexing:

image

Note that the indexerRequest is basically an array of all the distinct tokens, their frequencies and the start positions in the original document. There is also handling for ranking, but for now I don’t care about this, so I’ll ignore this.

Sending to the indexing channel will end up invoking this code:

image

And the actual indexing work is done inside AddDocument. A simplified version of it is show here:

image

An indexer is protected by a read/write mutex (which explains why we want to have sharding, it gives us better concurrency, without having to go to concurrent data structures and it drastically simplify the code.

So, what is going on in here? For each of the keywords we found on the document, we create a new entry in the table’s dictionary. With a value that contains the matching document ids. If there are already documents for this keyword, we’ll search for the appropriate position in the index (using simple binary search), then place the document in the appropriate location. Basically, the KeywordIndices is (simplified) Dictionary<Term : string , SortedList<DocId : long>>.

So that pretty much explains how this works. Let us look at how searches are working now…

The first interesting thing that we do when we get a search request is tokenize it (segmentize it, in Wukong terminology):

image

Then we call this code:

image

This is a fairly typical Go code. We create a bounded channel (that has a capacity as the same number of the shards), and we send it to all the shards. We’ll get the reply from all of the shards, then do something with the results from all the shards.

I like this type of interaction because it is easy to model concurrent interactions with it, and it is pervasive in Go. Seems simpler than the comparable strategies in .NET, for example.

Here is a simple example of how this is used:

image

This is the variant without the timeout. And we are just getting the results from all the shards, note that we don’t have any ordering here, we just add all the documents into one big array. I’m not sure how/if Wukong support sorting, there was a lot of stuff about ranking earlier in the codebase, but that doesn’t seem to be apparent in what I saw so far, I’ll look at it later. For now, let us see how a single shard is handling a search lookup request.

What I find most interesting is that there is rank options there, and document ids, which I’m not sure what they are used for. We’ll look at that later, for now, the first part of looking up a search term is here:

image

We take a read lock, then look at the table. We need to find entries for all the keywords that we have in the query to get a result back.

This indicates that a query to Wukong has an implicit AND between all the terms in the query. The result of this is an array with all the indices for each keyword. It then continues to perform set intersection between all the matching keywords, to find all the documents that appear in all of them. Following that, it will compute the BM25 (a TF-IDF function that is used to compute ranking).  After looking at the code, I found where it is actual compute the ranking. It is doing that after getting the results from all the shards, and then it is going to sort them according to their overall scores.

So that was clear enough, and it makes a lot of sense. Now, the last thing that I want to figure out before we are done, is how does Wukong handles deletions?

It turns out that I actually missed part in the search process. The indexer will just find the matching documents, but their BM25 score. It is the ranker (which is sent from the indexer, and then replying to the engine) that will actually sort them. This gives the user the chance to add their own scoring mechanism. Deletion is actually handled as a case where you have nothing to score with, and it gets filtered along the way as an uninteresting value. That means that the memory cost of having a document index cannot be alleviated by deleting it. The actual document data is still there and is kept.

It also means that there is no real facility to update a document. For example, if we have a document whose content used to say Ayende and we want to change it to Oren. We have no way of going to the Ayende keyword and removing it from there. We need to delete the document and create a new one, with a new document id.

Another thing to note is that this has very little actual functionality. There is no possibility of making complex queries, or using multiple fields. Another thing that is very different from how Lucene works is that is runs entirely in memory. While it has a persistent option, that option is actually just a log of documents being added and removed. On startup, it will need to go through the log and actually index all of them again. That means that for large systems, it is going to be a prohibitly expensive startup cost.

All in all, that is a nice codebase, and it is certainly simple enough to grasp without too much complexity. But one need to be aware of the tradeoffs associated with actually using it. I expect it to be fast, although the numbers mentioned in the benchmark page (if I understand the translated Chinese correctly) are drastically below what I would expect to see. Just to give you some idea, 1,400 requests a second are a very small number for an in memory index. I would expect something like 50,000 or so, assuming that you can get all cores to participate. But maybe they are counting this through the network ?  













本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6288946.html,如需转载请自行联系原作者


相关文章
|
3天前
|
监控 搜索推荐 安全
面经:Elasticsearch全文搜索引擎原理与实战
【4月更文挑战第10天】本文是关于Elasticsearch面试准备的博客,重点讨论了四个核心主题:Elasticsearch的分布式架构和数据模型、CRUD操作与查询DSL、集群管理与性能优化,以及安全与插件扩展。文中通过代码示例介绍了如何进行文档操作、查询以及集群管理,并强调理解Elasticsearch的底层原理和优化策略对面试和实际工作的重要性。
22 6
|
9月前
|
自然语言处理 搜索推荐 关系型数据库
Elasticsearch搜索引擎原理理解通俗易懂
记得小马最早期刚参加工作的时候全文索引用的是Sphinx。 当一个功能需要对表中的text varchar等文本进行like查询时,MySQL全表扫描很慢,需要Sphinx。Sphinx能解决性能和中文分词问题。
103 1
Elasticsearch搜索引擎原理理解通俗易懂
|
10月前
|
搜索推荐 程序员
谈一谈|搜索引擎的运用
谈一谈|搜索引擎的运用
53 0
|
数据采集 存储 搜索推荐
搜索引擎爬虫的工作原理是什么?底层原理是什么?
搜索引擎爬虫的工作原理是什么?底层原理是什么?
267 0
|
存储 算法 搜索推荐
【GoDance搜索引擎】搜索引擎集群模块实现笔记
【GoDance搜索引擎】搜索引擎集群模块实现笔记
【GoDance搜索引擎】搜索引擎集群模块实现笔记
|
存储 自然语言处理 运维
搜索lucene概念扫盲
假期重新把之前在新浪博客里面的文字梳理了下,搬到这里。本篇回归基础,从概念介绍起。
98 0
|
存储 自然语言处理 搜索推荐
|
搜索推荐
这就是搜索引擎读书笔记-day1-搜索引擎技术架构
这是我看搜索引擎的第一天,希望通过记录督促自己学习
这就是搜索引擎读书笔记-day1-搜索引擎技术架构
|
机器学习/深度学习 人工智能 自然语言处理
搜索引擎工作原理你是否了解?做SEO的有必要看看
从事SEO(搜索引擎优化)工作的人可以比喻成搜索引擎的贴身管家,作为一名合格称职的管家必须要了解所服务对象的习性,爱好,健康程度等。 SEO服务的对象是搜索引擎,必须对它的运行规律、工作原理、习性、优缺点等都铭记在心,多多实践操作,平时实践的越多,经验也就越丰富。 搜索引擎是由人创造出来的,所以也是有理可寻的。搜索引擎工作过程有主要的三段工作流程,爬行、预处理及服务输出。
155 0
|
设计模式 NoSQL Java