PostgreSQL 任意列组合条件行数估算实践 - 采样估算-阿里云开发者社区

postgres=# create table tbl1(c1 int, c2 int, c3 int, c4 int, c5 int, c6 timestamp);  
CREATE TABLE  
postgres=# insert into tbl1 select random()*100, random()*10, random()*50, random()*20, random()*5000, clock_timestamp()  from generate_series(1,50000000);  
INSERT 0 50000000  
postgres=# analyze tbl1;  
ANALYZE

2、观察多列、单列条件的行评估是否准确

两列AND条件，非常准确（评估53562，实际50354 ）

postgres=# select count(*) from tbl1 where c1=1 and c2=1;  
 count   
-------  
 50354  
(1 row)  
  
postgres=# explain select * from tbl1 where c1=1 and c2=1;  
                                 QUERY PLAN                                    
-----------------------------------------------------------------------------  
 Gather  (cost=1000.00..686505.00 rows=53562 width=28)  
   Workers Planned: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..680148.80 rows=22318 width=28)  
         Filter: ((c1 = 1) AND (c2 = 1))  
(4 rows)

两列OR条件，非常准确（评估5558119，实际5451751 ）

postgres=# explain select * from tbl1 where c1=1 or c2=1;  
                           QUERY PLAN                              
-----------------------------------------------------------------  
 Seq Scan on tbl1  (cost=0.00..1117649.92 rows=5558119 width=28)  
   Filter: ((c1 = 1) OR (c2 = 1))  
(2 rows)  
  
postgres=# select count(*) from tbl1 where c1=1 or c2=1;  
  count    
---------  
 5451751  
(1 row)

三列AND条件，非常准确（评估10，实际12 ）

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=99;  
                                                       QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..733233.27 rows=10 width=28) (actual time=41.413..1422.724 rows=12 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..732232.27 rows=4 width=28) (actual time=273.134..1420.060 rows=4 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 99))  
         Rows Removed by Filter: 16666663  
 Planning time: 0.126 ms  
 Execution time: 1424.440 ms  
(8 rows)

但是，所有情况都能准确吗？

当数据出现倾斜时，可能就没那么准确了。

3、再次写入5000万数据，此次c2,c3,c4,c5使用固定值，使其倾斜。

postgres=# insert into tbl1 select random()*100, 10, 50, 20, 5000, clock_timestamp()  from generate_series(1,50000000);  
INSERT 0 50000000  
  
postgres=# analyze tbl1;  
ANALYZE

4、非倾斜值的评估，依旧准确

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=99;  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..1465464.64 rows=21 width=28) (actual time=362.359..3084.858 rows=12 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..1464462.54 rows=9 width=28) (actual time=154.661..3082.340 rows=4 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 99))  
         Rows Removed by Filter: 33333329  
 Planning time: 0.099 ms  
 Execution time: 3086.687 ms  
(8 rows)

5、对于倾斜值，评估偏差很大（评估12702，实际0 ）

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=5000 and c4=20;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..1570899.53 rows=12702 width=28) (actual time=2834.241..2834.241 rows=0 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..1568629.33 rows=5292 width=28) (actual time=2831.777..2831.777 rows=0 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 5000) AND (c4 = 20))  
         Rows Removed by Filter: 33333333  
 Planning time: 0.144 ms  
 Execution time: 2836.049 ms  
(8 rows)

6、调整目标列的统计信息柱状图，扩大到1000。

postgres=# alter table tbl1 alter COLUMN c5 set statistics 1000;  
ALTER TABLE  
postgres=# alter table tbl1 alter COLUMN c4 set statistics 1000;  
ALTER TABLE  
postgres=# alter table tbl1 alter COLUMN c1 set statistics 1000;  
ALTER TABLE  
postgres=# alter table tbl1 alter COLUMN c2 set statistics 1000;  
ALTER TABLE  
postgres=# analyze tbl1;  
ANALYZE

7、行评估依旧不准确（评估13034，实际0 ）

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=5000 and c4=20;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..1570932.73 rows=13034 width=28) (actual time=2852.150..2852.150 rows=0 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..1568629.33 rows=5431 width=28) (actual time=2849.510..2849.510 rows=0 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 5000) AND (c4 = 20))  
         Rows Removed by Filter: 33333333  
 Planning time: 0.181 ms  
 Execution time: 2853.844 ms  
(8 rows)

那么这个问题无解了吗？实际上PG提供了采样查询的接口，我们通过采样查询能够实现高效、准确的评估。

《秒级任意维度分析1TB级大表 - 通过采样估值满足高效TOP N等统计分析需求》

内置了4种采样接口，我们这里用到的是基于BLOCK级别的百分比采样。

采样百分之一，COUNT值乘以100就是评估行数。如果采样百分之5的话，COUNT值乘以20就是评估行数。

实际上采样百分之一，评估已经非常准确了。

postgres=# select count(*) from ( SELECT * FROM tbl1 TABLESAMPLE SYSTEM(1)) t where c1=1 and c2=1 and c5=5000 and c4=20;  
 count   
-------  
     0  
(1 row)  
Time: 104.212 ms  
  
postgres=# select count(*) from ( SELECT * FROM tbl1 TABLESAMPLE SYSTEM(1)) t where c1=1 and c2=1 and c5=5000 ;  
 count   
-------  
     0  
(1 row)  
Time: 107.088 ms  
  
postgres=# select count(*) from tbl1 where c1=1 and c2=1 and c5=5000;  
 count   
-------  
     2  
(1 row)  
Time: 2838.372 ms (00:02.838)

postgres=# select count(*) from ( SELECT * FROM tbl1 TABLESAMPLE SYSTEM(1)) t where c1=1 and c2=1 ;  
 count   
-------  
   552  
(1 row)  
  
Time: 100.066 ms  
  
postgres=# select count(*) from tbl1 where c1=1 and c2=1 ;  
 count   
-------  
 50354  
(1 row)  
Time: 2546.602 ms (00:02.547)

1亿条记录，采样百分之一，大概耗时100毫秒。相比直接全表扫描还是节约了不少成本的。

小结

通过采样查询，我们实现了任意维度条件过滤的行评估，相比传统的行评估算法，精确度提高了不少，在1亿数据量的情况下，采样百分之一，耗时约100毫秒。

参考

《PostgreSQL Oracle 兼容性之 - 数据采样与脱敏》

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》

《PostgreSQL 优化器行评估算法》

《妙用explain Plan Rows快速估算行 - 分页数估算》

《秒级任意维度分析1TB级大表 - 通过采样估值满足高效TOP N等统计分析需求》

《PostgreSQL 10 黑科技 - 自定义统计信息》

修改单列统计信息柱状图精度，默认是100全局default_statistics_target参数控制。

postgres=# alter table tbl1 alter COLUMN c5 set statistics 1000;  
ALTER TABLE  
Time: 0.786 ms  
  
postgres=# analyze tbl1;  
ANALYZE  
Time: 2157.194 ms (00:02.157)

《Greenplum 最佳实践 - 估值插件hll的使用(以及hll分式聚合函数优化)》

《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 3》

《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 2》

《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 1》

《PostgreSQL count-min sketch top-n 概率计算插件 cms_topn (结合窗口实现同比、环比、滑窗分析等) - 流计算核心功能之一》

PostgreSQL 任意列组合条件行数估算实践 - 采样估算

标签

背景

例子

小结

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

PostgreSQL 任意列组合条件 行数估算 实践 - 采样估算

标签

背景

例子

小结

参考