列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩

简介:

Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200]

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

  1. If all values are identical (or missing), set a flag and record the value
  2. If there are fewer than 256 values, a simple table encoding is used
  3. If there are > 256 values, check to see if there is a common divisor
  4. If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

Note

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

   

摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6401472.html,如需转载请自行联系原作者


相关文章
|
1月前
|
算法
算法题 — 整数转二进制,查找其中1的数量
算法题 — 整数转二进制,查找其中1的数量
10 0
|
4月前
|
算法 测试技术 C#
|
8月前
|
存储 算法 JavaScript
设计并实现一个函数, 功能为给定一个存储为随机整数的数组,从中删除所有值为i的整数
设计并实现一个函数, 功能为给定一个存储为随机整数的数组,从中删除所有值为i的整数
|
9月前
|
C++
C++读取一行内个数不定的整数的方式
C++读取一行内个数不定的整数的方式
91 0
|
存储 数据库
长整数在插入较短的列时会被转换,但不会被截断为什么?公式是什么?
长整数在插入较短的列时会被转换,但不会被截断为什么?公式是什么?
|
PHP
PHP数组排序 解决数值型版本号排序错乱
PHP数组排序 解决数值型版本号排序错乱
102 0
|
存储
字符串按照固定长度分割并存储在数组
字符串按照固定长度分割并存储在数组
125 0
统计二进制中1的个数,,,写一个函数,返回参数二进制中1的个数 写一个代码判断一个数字是不是2的n次方
统计二进制中1的个数,,,写一个函数,返回参数二进制中1的个数 写一个代码判断一个数字是不是2的n次方
101 0
|
机器学习/深度学习 算法
算法 | 妙法统计二进制中1的个数
二进制(binary),发现者莱布尼茨,是在数学和数字电路中以2为基数的记数系统,是以2为基数代表系统的二进位制。这一系统中,通常用两个不同的符号0(代表零)和1(代表一)来表示。数字电子电路中,逻辑门的实现直接应用了二进制,现代的计算机和依赖计算机的设备里都使用二进制。每个数字称为一个比特(Bit,Binary digit的缩写)
96 0
算法 | 妙法统计二进制中1的个数
L1-4 字符串压缩 (10 分)
编写一个程序,输入一个字符串,然后采用如下的规则对该字符串当中的每一个字符进行压缩: (1) 如果该字符是空格,则保留该字符; (2) 如果该字符是第一次出现或第三次出现或第六次出现,则保留该字符; (3) 否则,删除该字符。 例如,若用户输入“occurrence”,经过压缩后,字符c的第二次出现被删除,第一和第三次出现仍保留;字符r和e的第二次出现均被删除,因此最后的结果为:“ocurenc”。
80 0