PostgreSQL 全表全字段模糊查询的毫秒级高效实现 - 搜索引擎也颤抖了-阿里云开发者社区

postgres=# create table t(phonenum text, info text, c1 int, c2 text, c3 text, c4 timestamp);    
CREATE TABLE    
postgres=# insert into t values ('13888888888','i am digoal, a postgresqler',123,'china','中华人民共和国，阿里巴巴，阿',now());    
INSERT 0 1    
postgres=# select * from t;    
  phonenum   |            info             | c1  |  c2   |              c3              |             c4                 
-------------+-----------------------------+-----+-------+------------------------------+----------------------------    
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2016-04-19 11:15:55.208658    
(1 row)

首先，被搜索的字段中如果有中文或者其他mutli-bytes字符，那么不能使用collate, ctype=c的数据库。

还好阿里云RDS PostgreSQL默认就不是C的，很棒。

如果不是，你可以这么指定collate 和 ctype  

postgres=# create database test with template template0 lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8';

那么接下来，我们要做的是，建立支持模糊查询的函数索引，

create extension pg_trgm;  

create or replace function record_to_text(anyelement) returns text as $$  
  select $1::text;                        
$$ language sql strict immutable;  

test=# create index idx_t_1 on t using gin (record_to_text(t) gin_trgm_ops) ;    
CREATE INDEX  

当需要使用分页，或者结果集很大时，建议使用gist   
test=# create index idx_t_2 on t using gist (record_to_text(t) gist_trgm_ops) ;   
CREATE INDEX

查询测试

test=# explain select * from t where record_to_text(t) ~ 'digoal';  
                            QUERY PLAN                               
-------------------------------------------------------------------  
 Index Scan using idx_t_2 on t  (cost=0.38..8.39 rows=1 width=140)  
   Index Cond: (record_to_text(t.*) ~ 'digoal'::text)  
(2 rows)

查询性能测试

先插一堆数据进去  
postgres=# insert into t select * from t;  
INSERT 0 4194304  
test=# select count(*) from t;  
  count    
---------  
 4194304  
(1 row)  

然后插几条不一样的  

insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌 德哥 刘德华 彭德怀',now());    
insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌 德哥 刘德华 彭德怀',now());    

vacuum analyze t;

查询速度杠杠的

test=# explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ 'dege';  
                                                     QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_t_2 on public.t  (cost=0.41..2.43 rows=1 width=101) (actual time=0.236..0.254 rows=2 loops=1)  
   Output: phonenum, info, c1, c2, c3, c4  
   Index Cond: (record_to_text(t.*) ~ 'dege'::text)  
   Buffers: shared hit=5  
 Planning time: 0.349 ms  
 Execution time: 0.301 ms  
(6 rows)

测试查询包含刘德华的行(因为行很少，所以建议使用GIN索引)

test=# /*+ BitmapScan(t idx_t_1) */ explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ '刘德华' limit 10;  
LOG:  available indexes for BitmapScan(t): idx_t_1  
LOG:  pg_hint_plan:  
used hint:  
BitmapScan(t idx_t_1)  
not used hint:  
duplication hint:  
error hint:  

LOG:  pg_hint_plan:  
used hint:  
not used hint:  
BitmapScan(t idx_t_1)  
duplication hint:  
error hint:  

                                                       QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=441.00..442.26 rows=1 width=101) (actual time=0.239..0.255 rows=2 loops=1)  
   Output: phonenum, info, c1, c2, c3, c4  
   Buffers: shared hit=4  
   ->  Bitmap Heap Scan on public.t  (cost=441.00..442.26 rows=1 width=101) (actual time=0.238..0.252 rows=2 loops=1)  
         Output: phonenum, info, c1, c2, c3, c4  
         Recheck Cond: (record_to_text(t.*) ~ '刘德华'::text)  
         Heap Blocks: exact=1  
         Buffers: shared hit=4  
         ->  Bitmap Index Scan on idx_t_1  (cost=0.00..441.00 rows=1 width=0) (actual time=0.086..0.086 rows=2 loops=1)  
               Index Cond: (record_to_text(t.*) ~ '刘德华'::text)  
               Buffers: shared hit=3  
 Planning time: 0.494 ms  
 Execution time: 0.313 ms  
(13 rows)  


test=# /*+ BitmapScan(t idx_t_1) */  select * from t where record_to_text(t) ~ '刘德华' limit 10;  
LOG:  available indexes for BitmapScan(t): idx_t_1  
LOG:  pg_hint_plan:  
used hint:  
BitmapScan(t idx_t_1)  
not used hint:  
duplication hint:  
error hint:  

LOG:  pg_hint_plan:  
used hint:  
not used hint:  
BitmapScan(t idx_t_1)  
duplication hint:  
error hint:  

  phonenum   |           info            | c1  |  c2   |           c3            |             c4               
-------------+---------------------------+-----+-------+-------------------------+----------------------------  
 13888889999 | i am dege, a postgresqler | 123 | china | 德歌 德哥 刘德华 彭德怀 | 2017-01-06 17:04:42.19215  
 13888889999 | i am dege, a postgresqler | 123 | china | 德歌 德哥 刘德华 彭德怀 | 2017-01-06 17:04:42.514895  
(2 rows)  

Time: 1.225 ms

语句超时

通常这种索引命中，根据返回的结果集数量，响应时间可能是 0.几毫秒到几十毫秒不等。

不过有些时候，可能因为用户输入的信息量太少，比如输入了2个字符，那么被匹配到的token信息会很多，导致变慢。

使用GiST可以缓解。

那么应用层可以做一个保护，比如超过1秒，则报语句超时。

postgres=# set statement_timeout = '1s';  
SET  

or  

test=# /*+ Set(statement_timeout 1s) */ select * from t where record_to_text(t) ~ 'd' limit 10;  
LOG:  pg_hint_plan:  
used hint:  
Set(statement_timeout 1s)  
not used hint:  
duplication hint:  
error hint:  

LOG:  pg_hint_plan:  
used hint:  
Set(statement_timeout 1s)  
not used hint:  
duplication hint:  
error hint:  

  phonenum   |            info             | c1  |  c2   |              c3              |             c4               
-------------+-----------------------------+-----+-------+------------------------------+----------------------------  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941  
(10 rows)