列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩-阿里云开发者社区

列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩

2017-11-16 1177

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200].

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

If all values are identical (or missing), set a flag and record the value
If there are fewer than 256 values, a simple table encoding is used
If there are > 256 values, check to see if there is a common divisor
If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

摘自：https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

本文转自张昺华-sky博客园博客，原文链接：http://www.cnblogs.com/bonelee/p/6401472.html，如需转载请自行联系原作者

列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩

热门文章

最新文章

相关电子书