如何加快PostgreSQL结巴分词加载速度-阿里云开发者社区

如何加快PostgreSQL结巴分词加载速度

2016-07-25 10289

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

RDS PostgreSQL Serverless，0.5-4RCU 50GB 3个月

云数据库 RDS SQL Server，基础系列 2核4GB

云原生数据库 PolarDB 分布式版，标准版 2核8GB

简介： 背景 PostgreSQL的全文检索接口是开放API的，所以中文分词的插件也非常多，例如常用的scws分词插件，还有结巴分词的插件。但是你在使用结巴分词插件的时候，有没有遇到这样的问题。每个会话，第一次查询会比较慢，接下来的查询就快了。例如 psql (9.5.3)

背景

PostgreSQL的全文检索接口是开放API的，所以中文分词的插件也非常多，例如常用的scws分词插件，还有结巴分词的插件。

但是你在使用结巴分词插件的时候，有没有遇到这样的问题。

每个会话，第一次查询会比较慢，接下来的查询就快了。

例如

psql (9.5.3)
Type "help" for help.

postgres=# \timing
Timing is on.
postgres=# select * from ts_debug('jiebacfg', '子远e5a1cbb8');
 alias | description |  token   | dictionaries | dictionary |  lexemes   
-------+-------------+----------+--------------+------------+------------
 n     | noun        | 子远     | {jieba_stem} | jieba_stem | {子远}
 n     | noun        | e5a1cbb8 | {jieba_stem} | jieba_stem | {e5a1cbb8}
(2 rows)

Time: 863.777 ms
postgres=# select * from ts_debug('jiebacfg', '子远e5a1cbb8');
 alias | description |  token   | dictionaries | dictionary |  lexemes   
-------+-------------+----------+--------------+------------+------------
 n     | noun        | 子远     | {jieba_stem} | jieba_stem | {子远}
 n     | noun        | e5a1cbb8 | {jieba_stem} | jieba_stem | {e5a1cbb8}
(2 rows)

Time: 1.342 ms
        
          
        
        
        
          
          AI 代码解读

原因分析

第一次加载pg_jieba模块时，需要调用加载字典的动作。

/*
 * Module load callback
 */
void
_PG_init(void)
{
        if (jieba_ctx)
                return;

        {
                const char* dict_path = jieba_get_tsearch_config_filename(DICT_PATH, EXT);
                const char* hmm_path = jieba_get_tsearch_config_filename(HMM_PATH, EXT);
                const char* user_dict_path = jieba_get_tsearch_config_filename(USER_DICT, EXT);

        /*
         init will take a few seconds to load dicts.
         */
        jieba_ctx = Jieba_New(dict_path, hmm_path, user_dict_path);
        }
}
        
          
        
        
        
          
          AI 代码解读

如果pg_jieba.so没有放在shared_preload_libraries或session_preload_libraries中，那么每个会话启动时，都需要load pg_jieba.so，从而导致了第一次查询速度非常慢。

例子

psql (9.5.3)
Type "help" for help.

postgres=# \timing
Timing is on.
postgres=# load 'pg_jieba';
LOAD
Time: 857.098 ms
postgres=# select * from ts_debug('jiebacfg', '子远e5a1cbb8');
 alias | description |  token   | dictionaries | dictionary |  lexemes   
-------+-------------+----------+--------------+------------+------------
 n     | noun        | 子远     | {jieba_stem} | jieba_stem | {子远}
 n     | noun        | e5a1cbb8 | {jieba_stem} | jieba_stem | {e5a1cbb8}
(2 rows)
Time: 4.952 ms
        
          
        
        
        
          
          AI 代码解读

如何解决

知道问题在哪里了，就好解决。
可以将pg_jieba.so配置在shared_preload_libraries或session_preload_libraries中，就能解决以上问题。

vi postgresql.conf
 
shared_preload_libraries = 'pg_jieba.so'
or
session_preload_libraries = 'pg_jieba.so'
        
          
        
        
        
          
          AI 代码解读

重启数据库

pg_ctl restart -m fast
        
          
        
        
        
          
          AI 代码解读

内存开销比对

.1. 未配置

shared_preload_libraries = 'pg_jieba.so'
or
session_preload_libraries = 'pg_jieba.so'
        
          
        
        
        
          
          AI 代码解读

session A :

psql (9.5.3)
Type "help" for help.

postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          12254
(1 row)
        
          
        
        
        
          
          AI 代码解读

session B :

psql (9.5.3)
Type "help" for help.

postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          12261
(1 row)
        
          
        
        
        
          
          AI 代码解读

backend process内存使用情况

# smem|grep 12261
  PID User     Command                         Swap      USS      PSS      RSS
12261 digoal   postgres: postgres postgres        0      812     1677     3780 

# smem|grep 12254
  PID User     Command                         Swap      USS      PSS      RSS
12254 digoal   postgres: postgres postgres        0      812     1682     3788
        
          
        
        
        
          
          AI 代码解读

在未使用pg_jieba时，通过/proc/12261/smaps 也可以看到没有加载pg_jieba.so。

分别执行加载pg_jieba的模块或执行pg_jieba词法解析后

postgres=# load 'pg_jieba';
LOAD
Time: 872.095 ms
        
          
        
        
        
          
          AI 代码解读

内存飙升

# smem|grep 12254
  PID User     Command                         Swap      USS      PSS      RSS
12254 digoal   postgres: postgres postgres        0   114404   116326   120272 

# smem|grep 12261
  PID User     Command                         Swap      USS      PSS      RSS
12261 digoal   postgres: postgres postgres        0   114404   116321   120260 
        
          
        
        
        
          
          AI 代码解读

.1. 已配置

shared_preload_libraries = 'pg_jieba.so'
or
session_preload_libraries = 'pg_jieba.so'
        
          
        
        
        
          
          AI 代码解读

分别执行QUERY后，backend process进程内存没有独占加载pg_jieba.so的内存，算在共享内存中。

[root@iZ28tqoemgtZ ~]# smem|grep 12410
  PID User     Command                         Swap      USS      PSS      RSS
12410 digoal   postgres: postgres postgres        0     3696    17754   118988 

[root@iZ28tqoemgtZ ~]# smem|grep 12412
  PID User     Command                         Swap      USS      PSS      RSS
12412 digoal   postgres: postgres postgres        0     3124    17115   118296 
        
          
        
        
        
          
          AI 代码解读

通过/proc/12410/smaps 也可以看到，只是用到pg_jieba.so时算了少量的Pss。

7fb68fe40000-7fb68fe55000 r-xp 00000000 fd:01 1052111                    /home/digoal/pgsql9.5/lib/pg_jieba.so
Size:                 84 kB
Rss:                  48 kB
Pss:                  16 kB
Shared_Clean:         48 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:           48 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me 
7fb68fe55000-7fb690054000 ---p 00015000 fd:01 1052111                    /home/digoal/pgsql9.5/lib/pg_jieba.so
Size:               2044 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: mr mw me 
7fb690054000-7fb690055000 r--p 00014000 fd:01 1052111                    /home/digoal/pgsql9.5/lib/pg_jieba.so
Size:                  4 kB
Rss:                   4 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          4 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd mr mw me ac 
7fb690055000-7fb690056000 rw-p 00015000 fd:01 1052111                    /home/digoal/pgsql9.5/lib/pg_jieba.so
...
        
          
        
        
        
          
          AI 代码解读

参考

https://github.com/jaiminpan/pg_jieba
另外要提一点，结巴分词没有逗号的问题
https://yq.aliyun.com/articles/58007
效率，每CPU核约处理56.4万字/s。

postgres=# alter function to_tsvector(regconfig,text) volatile;
ALTER FUNCTION
postgres=# explain (buffers,timing,costs,verbose,analyze) select to_tsvector('jiebacfg','中华人民共和国万岁，如何加快PostgreSQL结巴分词加载速度') from generate_series(1,1000000);
                                                                QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------
 Function Scan on pg_catalog.generate_series  (cost=0.00..260.00 rows=1000 width=0) (actual time=100.054..13943.166 rows=1000000 loops=1)
   Output: to_tsvector('jiebacfg'::regconfig, '中华人民共和国万岁，如何加快PostgreSQL结巴分词加载速度'::text)
   Function Call: generate_series(1, 1000000)
   Buffers: temp read=1710 written=1709
 Planning time: 0.040 ms
 Execution time: 14175.527 ms
(6 rows)
Time: 14176.044 ms
postgres=# select to_tsvector('jiebacfg','中华人民共和国万岁，如何加快PostgreSQL结巴分词加载速度');
                                       to_tsvector                                        
------------------------------------------------------------------------------------------
 'postgresql':6 '万岁':2 '中华人民共和国':1 '分词':8 '加快':5 '加载':9 '结巴':7 '速度':10
(1 row)
Time: 0.522 ms
postgres=# select 8*1000000/14.175527;
      ?column?       
---------------------
 564352.916120860974
(1 row)
Time: 0.743 ms
        
          
        
        
        
          
          AI 代码解读

小结

为了提高结巴分词插件的装载速度，应该将so文件配置为数据库启动时自动加载。
使用数据库启动时自动加载，还有一个好处，内存使用量也大大减少。

祝大家玩得开心，欢迎随时来 阿里云促膝长谈 业务需求，恭候光临。

阿里云的小伙伴们加油，努力做 最贴地气的云数据库 。

如何加快PostgreSQL结巴分词加载速度

背景

原因分析

如何解决

内存开销比对

参考

小结

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

如何加快PostgreSQL结巴分词加载速度

背景

原因分析

如何解决

内存开销比对

参考

小结

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像