Elasticsearch: Five Things I was Doing Wrong

本文涉及的产品
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
简介:

Elasticsearch: Five Things I was Doing Wrong

Update: Also check out my series on scaling Elasticsearch.

I’ve been working with Elasticsearch off and on for over a year, but recently I attended Elasticsearch.com’s training class (well worth the time and money) and discovered a few significant things that I was doing just plain wrong.

Before using Elasticsearch I used Lucene directly, and so a few of the errors I made were due to not understanding some of the things ES does for you behind the scenes.

As background, most of the data I’m indexing conforms to the WordPress database schema.

1. Use Arrays for Fields with Multiple Values

For some reason I had neglected to use arrays when creating fileds such as a list of tags attached to a document. At some point I started concatenating the tags together into a long string separated by semicolons and I used a custom analyzer to break them apart like this:

1
2
3
4
5
6
7
8
9
10
11
12
"analysis"  : {
   "tokenizer"  : {
     "semicolon_token"  : {
       "type"  => "pattern" ,
       "pattern"  => ";"
   } },
   "analyzer"  : {
     "wp_tag_analyzer"  : {
       "type"  => "custom" ,
       "tokenizer"  => "semicolon_token" ,
   } }
}

Or, for fields that were lists of URLs I just separated them by spaces and used the whitespace analyzer. Both methods worked fine for the initial applications, but have some obvious drawbacks. Explicitly inserting a character sequence as a delimiter will almost always means you will hit an edge case somewhere down the road where it will break.

Using an array of items is a much easier way, but somehow, after initially reading about the array mapping, I completely forgot that it existed. I think I was thinking of ES too much as a text searching engine and not enough as a general JSON data store.

2. Don’t Use store=true When Mapping Fields

If you are storing the full _source of the document, then there is very little reason to store individual fields separately. You just inflate your index size. I originally started storing the content and titles of documents because I thought it might speed up the highlighting. In practice, I don’t think it did anything for me, and many of our queries don’t do any highlighting at all.

In the end this was a case of premature optimization. Maybe at some point if I find that 90% of the time we are just returning the post_id and using that to lookup the original content in MySQL we’ll consider storing that separately to reduce network traffic and load caused by extracting the post_id field from _source, but that still feels premature at this point.

For debugging reasons I would never consider turning off storing _source. It is far too useful to know exactly what data was entered, and you never know when you might want to use a different field for a new application.

3. Don’t Manually Flush, Optimize, or Refresh

Elasticsearch takes care of these core Lucene operations for me, there was never any good reason for me to issue one of these commands when the default ES settings would accomplish it within a few minutes.

The optimize command in particular is dangerous since it merges all segments in the Lucene index (a very time consuming operation). The code I wrote which at first was issuing innocuous optimize commands after doing some bulk indexing by hand eventually started getting called repeatedly in automated jobs. Fortunately it never rose to a level of causing real problems, but its easy for code you write to get unintentionally called.

Again, this was a case of premature optimization.

4. Set the Appropriate Production Flags

This is another case that didn’t cause a real issue, but could have in the future. The default settings for ES are set to ensure it works to quickly start development. This means that a few of the default settings are not what you want when in production. In particular:

  • discovery.zen.minimum_master_nodes
    • Should be set to something like N/2 + 1 where N is the number of available master nodes.
  • action.disable_delete_all_indices
    • Do you really want to allow a single command (that could be mistyped) to delete all of your indices? No, neither do I.
  • gateway.recover_after_nodes
    • How many nodes need to be up before the recovery process starts replicating data around the cluster.
  • index.cache.field.type: soft (in 0.90 this field name changed to index.fielddata.cache. Thanks Olivier for the heads up.)
    • I started setting my field cache to soft to ensure that it never created OutOfMemory errors. I think this was particularly helpful because we are doing a lot of faceting.
    • Update 2014-01-09: the indices.fielddata.cache.size setting introduced in 0.90 is a better way to prevent running into OutOfMemory exceptions due to the field cache getting too big. I am no longer using the soft field data cache.

5. Do Not Use _type as Another Field

The _type field can entice you to use it as another field to indicate a category for your document. Don’t let it.

Here’s where I went wrong. WordPress posts can have different types (post_type) which allow displaying the content of the post in different ways (e.g. image posts, video posts, quotes, a status message). This despite the different post types all using the same schema. This seemed to align pretty well with the _type fields so I used an ES dynamic mapping to have post_type == _type.

The biggest problem with this is how do you determine the document’s _type after a post has been deleted from the database and you want to also delete it from your index. A document is uniquely identified both by its _id and its _type.

  • If you delete from your RDBMS first (or NoSQL data store flavor of the month), then you may no longer have the _type available to delete the object.
  • If you delete from ES first then what if the RDBMS delete operation fails for some reason.

Making the _type independent of any data within the document ensures that all you will need is the document id. This was one of those “Oh, that was dumb of me” bugs that I completely missed when building my index.

 

转自:https://greg.blog/2013/01/24/elasticsearch-five-things-i-was-doing-wrong/













本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6432400.html,如需转载请自行联系原作者

相关实践学习
使用阿里云Elasticsearch体验信息检索加速
通过创建登录阿里云Elasticsearch集群,使用DataWorks将MySQL数据同步至Elasticsearch,体验多条件检索效果,简单展示数据同步和信息检索加速的过程和操作。
ElasticSearch 入门精讲
ElasticSearch是一个开源的、基于Lucene的、分布式、高扩展、高实时的搜索与数据分析引擎。根据DB-Engines的排名显示,Elasticsearch是最受欢迎的企业搜索引擎,其次是Apache Solr(也是基于Lucene)。 ElasticSearch的实现原理主要分为以下几个步骤: 用户将数据提交到Elastic Search 数据库中 通过分词控制器去将对应的语句分词,将其权重和分词结果一并存入数据 当用户搜索数据时候,再根据权重将结果排名、打分 将返回结果呈现给用户 Elasticsearch可以用于搜索各种文档。它提供可扩展的搜索,具有接近实时的搜索,并支持多租户。
相关文章
|
存储 Cloud Native NoSQL
【Paper Reading】Cloud-Native Transactions and Analytics in SingleStore
HTAP & 云原生是如今数据库技术演进的两大热点方向。HTAP 代表既有传统的 HANA Delta RowStore+Main ColumnStore,Oracle In-MemoryColumnStore 等方案,也有像 TiDB,Snowflake Unistore这样新的技术架构;云原生代表则是以 S3 为低成本主存的 Snowflake,Redshift RA3,提供灵活弹性和Serverless 能力。SingleStore 则是首次把两者结合起来,基于计算存储分离的云原生架构,用一份存储提供低成本高性能的 HTAP 能力。
【Paper Reading】Cloud-Native Transactions and Analytics in SingleStore
|
10月前
Query Performance Optimization at Alibaba Cloud Log Analytics Service
PrestoCon Day 2023,链接:https://prestoconday2023.sched.com/event/1Mjdc?iframe=no首页自我介绍,分享题目概要各个性能优化项能够优化的资源类别limit快速短路有什么优点?有啥特征?进一步的优化空间?避免不必要块的生成逻辑单元分布式执行,global 阶段的算子哪些字段无需输出?公共子表达式结合FilterNode和Proje
Query Performance Optimization at Alibaba Cloud Log Analytics Service
|
Linux 开发工具 C#
Tencent Cloud Code Analysis介绍及安装部署
Tencent Cloud Code Analysis介绍及安装部署
848 0
Tencent Cloud Code Analysis介绍及安装部署
|
Java Spring
运行ElasticSearch报错:NoNodeAvailableException[None of the configured nodes are available
运行ElasticSearch报错:NoNodeAvailableException[None of the configured nodes are available
627 0
|
SQL 搜索推荐 Java
史上最全的ElasticSearch系列之should must联用问题
前言 文本已收录至我的GitHub仓库,欢迎Star:github.com/bin39232820… 种一棵树最好的时间是十年前,其次是现在
157 0
|
存储 测试技术 API
【Elastic Engineering】Elasticsearch:Runtime fields 入门, Elastic 的 schema on read 实现 - 7.11 发布
Elasticsearch:Runtime fields 入门, Elastic 的 schema on read 实现 - 7.11 发布
181 0
【Elastic Engineering】Elasticsearch:Runtime fields 入门, Elastic 的 schema on read 实现 - 7.11 发布
|
数据可视化 API 索引
【Elastic Engineering】Elasticsearch:创建 Runtime field 并在 Kibana 中使用它 - 7.11 发布
Elasticsearch:创建 Runtime field 并在 Kibana 中使用它 - 7.11 发布
287 0
【Elastic Engineering】Elasticsearch:创建 Runtime field 并在 Kibana 中使用它 - 7.11 发布
|
存储 数据库
|
索引 运维 监控
ElasticSearch Reading and Writing documents Translation
开门见山,根据es官网的doc:下面是根据我自己的理解(先从网上学习了基本的es教程并在虚机上搭了...
1206 0