关于PostgreSQL的GiST索引之四-阿里云开发者社区

3. hstore的GiST索引

hstore的GiST索引实现和全文检索的tsvector的GiST索引的实现差不过，都是签名文件索引。

3.1 GiST索引项的存储

1)所有索引项都采用签名向量压缩
2)hstore中的所有键和所有值都进行哈希设置bit位得到一个签名向量
3)签名向量的大小为128bit

3.2 性能推论

1)索引尺寸比较小(因为签名向量的大小是固定的而且一般比原始数据小)
2)每次查询要扫描的索引页很多（20%~40% 的索引页）
3)每次键查询(?操作符)至少要recheck 1.6%(2/128)以上的数据行
每条记录包含的键值对越多,需要recheck的数据行越多
4)每次键值查询(@>操作符)至少要recheck 0.02%((2/128) * (2/128))以上的数据行
每条记录包含的键值对越多,需要recheck的数据行越多

假设每条记录平均包含10条键值对，需要recheck的数据行估算如下
键查询(?操作符)匹配：7.5%的数据行

点击(此处)折叠或打开

testdb=# with recursive tx1(n,sign_bits) as
testdb-# (select 1 n,1.0 sign_bits
testdb(# union all
testdb(# select n+1, sign_bits + 1 - sign_bits/128 from tx1 where n=10
testdb(# )
testdb-# select n,sign_bits,sign_bits/128 scan_probability from tx1 where n in(10);
n | sign_bits | scan_probability
----+------------------------+------------------------
10 | 9.65566251563533320389 | 0.07543486340340104066
(1 row)

由此可见对键查询(多个键的?&查询另当别论)，hstore的GiST索引的查询效率是比较低的。不过，从另外一个角度考虑，通常的应用场景下，不同的key值数本来就不会很多，就不是索引的用武之地。

键值查询(@>操作符)匹配：0.57%(0.57%=0.07543486340340104066*0.07543486340340104066)

但是这只是理论上的平均值，由于签名向量的冲突，个别查询的效率可能会很低。比如，如果大部分数据记录都包含某个key，而查询这个键值对的时候，碰巧它的值在签名向量中对应的bit位和这个很常用的key相同，那么大部分数据记录都要进行recheck。

3.3 相关代码

contrib/hstore/hstore_gist.c

点击(此处)折叠或打开

/* bigint defines */
#define BITBYTE 8
#define SIGLENINT 4 /* >122 => key will toast, so very */
#define SIGLEN ( sizeof(int)*SIGLENINT )
#define SIGLENBIT (SIGLEN*BITBYTE)
typedef char BITVEC[SIGLEN];
typedef char *BITVECP;
...
Datum
ghstore_compress(PG_FUNCTION_ARGS)
{
GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY *retval = entry;
if (entry->leafkey)
{
GISTTYPE *res = (GISTTYPE *) palloc0(CALCGTSIZE(0));
HStore *val = DatumGetHStoreP(entry->key);
HEntry *hsent = ARRPTR(val);
char *ptr = STRPTR(val);
int count = HS_COUNT(val);
int i;
SET_VARSIZE(res, CALCGTSIZE(0));
//遍历所有KV对
for (i = 0; i count; ++i)
{
int h;
//对key做哈希
h = crc32_sz((char *) HS_KEY(hsent, ptr, i), HS_KEYLEN(hsent, i));
HASH(GETSIGN(res), h);
if (!HS_VALISNULL(hsent, i))
{
//对value做哈希
h = crc32_sz((char *) HS_VAL(hsent, ptr, i), HS_VALLEN(hsent, i));
HASH(GETSIGN(res), h);
}
}
retval = (GISTENTRY *) palloc(sizeof(GISTENTRY));
gistentryinit(*retval, PointerGetDatum(res),
entry->rel, entry->page,
entry->offset,
FALSE);
}
else if (!ISALLTRUE(DatumGetPointer(entry->key)))
{
int32 i;
GISTTYPE *res;
BITVECP sign = GETSIGN(DatumGetPointer(entry->key));
LOOPBYTE
{
if ((sign[i] & 0xff) != 0xff)
PG_RETURN_POINTER(retval);
}
res = (GISTTYPE *) palloc(CALCGTSIZE(ALLISTRUE));
SET_VARSIZE(res, CALCGTSIZE(ALLISTRUE));
res->flag = ALLISTRUE;
retval = (GISTENTRY *) palloc(sizeof(GISTENTRY));
gistentryinit(*retval, PointerGetDatum(res),
entry->rel, entry->page,
entry->offset,
FALSE);
}
PG_RETURN_POINTER(retval);
}

3.4 性能验证

1)环境
CentOS 6.5
PostgreSQL 9.4.0

2)测试数据准备
点击( 此处 )折叠或打开

testdb=# create extension hstore;
CREATE EXTENSION
Time: 235.391 ms
testdb=# create table tbhs(id serial,hs hstore);
CREATE TABLE
Time: 17.459 ms
testdb=# insert into tbhs select id,hstore(id::text, md5(id::text)) from generate_series(1,100000) id;
INSERT 0 100000
Time: 970.613 ms
testdb=# select pg_relation_size('tbhs');
pg_relation_size
------------------
8445952
(1 row)
Time: 0.643 ms
testdb=# select * from tbhs limit 5;
id | hs
----+-----------------------------------------
1 | "1"=>"c4ca4238a0b923820dcc509a6f75849b"
2 | "2"=>"c81e728d9d4c2f636f067f89cc14862c"
3 | "3"=>"eccbc87e4b5ce2fe28308fd9f2a7baf3"
4 | "4"=>"a87ff679a2f3e71d9181a67b7542122c"
5 | "5"=>"e4da3b7fbbce2345d7772b0674a318d5"
(5 rows)
Time: 1.147 ms

注）上面每个记录只有一个键值对，且每个key都是唯一的，这样的数据模型就不适合使用hstore存储，本文故意这么做只是为了验证性能。

2)无索引的查询
点击( 此处 )折叠或打开

testdb=# explain (analyze,buffers) select * from tbhs where hs ? '10';
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on tbhs (cost=0.00..2281.00 rows=100 width=53) (actual time=0.025..31.587 rows=1 loops=1)
Filter: (hs ? '10'::text)
Rows Removed by Filter: 99999
Buffers: shared hit=1031
Planning time: 0.055 ms
Execution time: 31.619 ms
(6 rows)
Time: 32.150 ms

3) 建立GiST索引再查询

点击(此处)折叠或打开

testdb=# create index tbhs_idx_gist on tbhs using gist(hs);
CREATE INDEX
Time: 6243.003 ms
testdb=# analyze tbhs;
ANALYZE
Time: 77.956 ms
testdb=# select pg_relation_size('tbhs_idx_gist'),pg_relation_size('tbhs_idx_gist')/8192 index_pages;
pg_relation_size | index_pages
------------------+-------------
4825088 | 589
(1 row)
Time: 0.456 ms

？查询扫描了45%（265/589）的索引页，Recheck了1.6%的数据行。

点击(此处)折叠或打开

testdb=# explain (analyze,buffers) select * from tbhs where hs ? '10';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbhs (cost=5.06..302.42 rows=100 width=53) (actual time=3.003..4.637 rows=1 loops=1)
Recheck Cond: (hs ? '10'::text)
Rows Removed by Index Recheck: 1558
Heap Blocks: exact=863
Buffers: shared hit=1128
-> Bitmap Index Scan on tbhs_idx_gist (cost=0.00..5.03 rows=100 width=0) (actual time=2.869..2.869 rows=1559 loops=1)
Index Cond: (hs ? '10'::text)
Buffers: shared hit=265
Planning time: 0.153 ms
Execution time: 5.073 ms
(10 rows)
Time: 6.209 ms

@>查询扫描了19%（114/589）的索引页，Recheck了0.012%的数据行。

点击(此处)折叠或打开

testdb=# explain (analyze,buffers) select * from tbhs where hs @> '1=>c4ca4238a0b923820dcc509a6f75849b';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbhs (cost=5.06..302.42 rows=100 width=53) (actual time=2.123..2.205 rows=1 loops=1)
Recheck Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
Rows Removed by Index Recheck: 11
Heap Blocks: exact=12
Buffers: shared hit=126
-> Bitmap Index Scan on tbhs_idx_gist (cost=0.00..5.03 rows=100 width=0) (actual time=2.097..2.097 rows=12 loops=1)
Index Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
Buffers: shared hit=114
Planning time: 0.407 ms
Execution time: 2.294 ms
(10 rows)
Time: 3.332 ms

4)建立GIN索引再查询
建立GIN索引对比一下。

点击(此处)折叠或打开

testdb=# drop index tbhs_idx_gist;
DROP INDEX
Time: 7.036 ms
testdb=# create index tbhs_idx_gin on tbhs using gin(hs);
CREATE INDEX
Time: 2244.241 ms
testdb=# analyze tbhs;
ANALYZE
Time: 64.608 ms
testdb=# select pg_relation_size('tbhs_idx_gin'),pg_relation_size('tbhs_idx_gin')/8192 index_pages;
pg_relation_size | index_pages
------------------+-------------
17629184 | 2152
(1 row)
Time: 0.641 ms
testdb=# explain (analyze,buffers) select * from tbhs where hs ? '10';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbhs (cost=16.77..314.14 rows=100 width=53) (actual time=0.096..0.097 rows=1 loops=1)
Recheck Cond: (hs ? '10'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on tbhs_idx_gin (cost=0.00..16.75 rows=100 width=0) (actual time=0.086..0.086 rows=1 loops=1)
Index Cond: (hs ? '10'::text)
Buffers: shared hit=4
Planning time: 0.349 ms
Execution time: 0.144 ms
(9 rows)
Time: 1.014 ms
testdb=# explain (analyze,buffers) select * from tbhs where hs @> '1=>c4ca4238a0b923820dcc509a6f75849b';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbhs (cost=28.77..326.14 rows=100 width=53) (actual time=0.070..0.071 rows=1 loops=1)
Recheck Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
Heap Blocks: exact=1
Buffers: shared hit=8
-> Bitmap Index Scan on tbhs_idx_gin (cost=0.00..28.75 rows=100 width=0) (actual time=0.047..0.047 rows=1 loops=1)
Index Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
Buffers: shared hit=7
Planning time: 0.131 ms
Execution time: 0.105 ms
(9 rows)
Time: 0.661 ms

GIN索引的查询效率相当不错。GIN的缺点是：索引比较大，对更新速度有一定影响。

3.5 参考

http://blog.chinaunix.net/uid-20726500-id-4884626.html
http://blog.chinaunix.net/uid-20726500-id-4884681.html
http://blog.chinaunix.net/uid-20726500-id-4886571.html
http://blog.chinaunix.net/uid-259788-id-4279627.html

关于PostgreSQL的GiST索引之四

3. hstore的GiST索引

hstore的GiST索引实现和全文检索的tsvector的GiST索引的实现差不过，都是签名文件索引。

3.1 GiST索引项的存储

3.2 性能推论

3.3 相关代码

3.4 性能验证

3.5 参考

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像