1. 聚能聊>
  2. 话题详情

来测一下你有没有SQL优化基因!

<震惊!慢SQL居然能优化到这种速度,我不服!>后,又一轮SQL挑战赛来了。这次是《决战紫禁之巅》,不服来战。

image

场景表述

一张小表A,里面存储了一些ID,大约几百个。

(比如说巡逻车辆ID,环卫车辆的ID,公交车,微公交的ID)。

另外有一张日志表B,每条记录中的ID是来自前面那张小表的,但不是每个ID都出现在这张日志表中,比如说一天可能只有几十个ID会出现在这个日志表的当天的数据中。

(比如车辆的行车轨迹数据,每秒上报轨迹,数据量就非常庞大)。

那么我怎么快速的找出今天没有出现的ID呢。

(哪些巡逻车辆没有出现在这个片区,是不是偷懒了?哪些环卫车辆没有出行,哪些公交或微公交没有出行)?

测试模型和数据

建表

create table a(id int primary key, info text);

create table b(id int primary key, aid int, crt_time timestamp);
create index b_aid on b(aid);

插入测试数据

-- a表插入1000条
insert into a select generate_series(0,1000), md5(random()::text);

-- b表插入1000万条,只包含aid的901个id。
insert into b select generate_series(1,10000000), random()*900, clock_timestamp();

参考SQL和查询性能

下面两条SQL都是满足查询条件的SQL,由PostgreSQL 10给出。

代表了没有做任何优化的情况下的查询性能。

postgres=# explain (analyze,timing) select * from a where id not in (select aid from b); 
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=179053.25..179074.76 rows=500 width=37) (actual time=4369.478..4369.525 rows=100 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 901
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..154053.60 rows=9999860 width=4) (actual time=0.322..1829.342 rows=10000000 loops=1)
 Planning time: 0.094 ms
 Execution time: 4423.364 ms
(7 rows)

postgres=# explain (analyze,timing) select a.* from a left join b on (a.id=b.aid) where b.* is null;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=31.52..280244.69 rows=49999 width=37) (actual time=4316.767..4316.790 rows=100 loops=1)
   Hash Cond: (b.aid = a.id)
   Filter: (b.* IS NULL)
   Rows Removed by Filter: 10000000
   ->  Seq Scan on b  (cost=0.00..154053.60 rows=9999860 width=44) (actual time=0.013..2544.321 rows=10000000 loops=1)
   ->  Hash  (cost=19.01..19.01 rows=1001 width=37) (actual time=0.342..0.342 rows=1001 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 76kB
         ->  Seq Scan on a  (cost=0.00..19.01 rows=1001 width=37) (actual time=0.009..0.137 rows=1001 loops=1)
 Planning time: 0.173 ms
 Execution time: 4316.828 ms
(10 rows)

等你来战

胜利提示语

请使用 PostgreSQL 黑科技。

越老越有SQL优化基因

《带你走进179个SQL优化场景》

参与话题

奖品区域 活动规则 已 结束

  • 奖品一

    淘公仔 x 1

  • 奖品二

    虾米VIP季卡 x 1

  • 奖品三

    优酷VIP月卡 x 1

25个回答

4

阿弟@pgsql 已获得淘公仔 复制链接去分享

postgres=# explain analyze select from (select a.,(select aid from b where b.aid=a.id limit 1) as aid from a ) as t where t.aid is null;

                                                            QUERY PLAN                                                                 

Seq Scan on a (cost=0.00..4137.84 rows=5 width=41) (actual time=18.232..18.904 rows=100 loops=1)
Filter: ((SubPlan 2) IS NULL)
Rows Removed by Filter: 901
SubPlan 1

 ->  Limit  (cost=0.43..4.09 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=100)
       ->  Index Only Scan using b_aid on b  (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.002..0.002 rows=0 loops=100)
             Index Cond: (aid = a.id)
             Heap Fetches: 0

SubPlan 2

 ->  Limit  (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
       ->  Index Only Scan using b_aid on b b_1  (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
             Index Cond: (aid = a.id)
             Heap Fetches: 901

Planning time: 0.297 ms
Execution time: 18.980 ms
(15 rows)

德哥 回复

阿弟老司机. A表数据量小,查询A表的QUERY中使用SUB QUERY使得SUB QUERY的扫描次数下降到与A行数一致,SUB QUERY中采用LIMIT 1限定返回数,is null限定得出B表中未出现的aid。妙!!!

阿弟@pgsql 回复

再变种一下还可以写成这样,pg的sql语法确实强大

postgres=# explain analyze select * from a where (select aid from b where b.aid=a.id limit 1) is null;

                                                          QUERY PLAN                                                               

Seq Scan on a (cost=0.00..4117.37 rows=5 width=37) (actual time=18.346..18.699 rows=100 loops=1)
Filter: ((SubPlan 1) IS NULL)
Rows Removed by Filter: 901
SubPlan 1

 ->  Limit  (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.018 rows=1 loops=1001)
       ->  Index Only Scan using b_aid on b  (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
             Index Cond: (aid = a.id)
             Heap Fetches: 901

Planning time: 0.181 ms
Execution time: 18.755 ms
(10 rows)

德哥 回复

赞赞赞!!!PG的语法真强大。

teewon 回复

感觉把问题复杂化了,其实PG使用NOT EXISTS就实现你需要的,b表直接走Index Only Scan,b表数据再大妈妈也不用担心了

评论
1

feicai 已获得优酷VIP月卡 复制链接去分享

sysbench=# explain (analyze,verbose,timing,costs,buffers)
sysbench-# select * from a where id not in
sysbench-# (
sysbench(# with recursive c as (
sysbench(# (
sysbench(# select min(aid) aid from b where aid is not null
sysbench(# )
sysbench(# union all
sysbench(# (
sysbench(# select (select min(aid) aid from b where b.aid > c.aid and b.aid is not null)
sysbench(# from c where c.aid is not null
sysbench(# )
sysbench(# )
sysbench(# select aid from c where aid is not null
sysbench(# );

                                                                              QUERY PLAN                                                                                  

Seq Scan on public.a (cost=61.17..82.68 rows=500 width=37) (actual time=17.876..17.905 rows=100 loops=1)
Output: a.id, a.info
Filter: (NOT (hashed SubPlan 5))
Rows Removed by Filter: 901
Buffers: shared hit=3616
SubPlan 5

 ->  CTE Scan on c c_1  (cost=58.90..60.92 rows=100 width=4) (actual time=0.045..17.191 rows=901 loops=1)
       Output: c_1.aid
       Filter: (c_1.aid IS NOT NULL)
       Rows Removed by Filter: 1
       Buffers: shared hit=3607
       CTE c
         ->  Recursive Union  (cost=0.49..58.90 rows=101 width=4) (actual time=0.041..16.807 rows=902 loops=1)
               Buffers: shared hit=3607
               ->  Result  (cost=0.49..0.50 rows=1 width=4) (actual time=0.040..0.041 rows=1 loops=1)
                     Output: $1
                     Buffers: shared hit=4
                     InitPlan 3 (returns $1)
                       ->  Limit  (cost=0.43..0.49 rows=1 width=4) (actual time=0.037..0.037 rows=1 loops=1)
                             Output: b_1.aid
                             Buffers: shared hit=4
                             ->  Index Only Scan using b_aid on public.b b_1  (cost=0.43..530481.88 rows=9999860 width=4) (actual time=0.036..0.036 rows=1 loops=1)
                                   Output: b_1.aid
                                   Index Cond: (b_1.aid IS NOT NULL)
                                   Heap Fetches: 1
                                   Buffers: shared hit=4
               ->  WorkTable Scan on c  (cost=0.00..5.64 rows=10 width=4) (actual time=0.018..0.018 rows=1 loops=902)
                     Output: (SubPlan 2)
                     Filter: (c.aid IS NOT NULL)
                     Rows Removed by Filter: 0
                     Buffers: shared hit=3603
                     SubPlan 2
                       ->  Result  (cost=0.53..0.54 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=901)
                             Output: $3
                             Buffers: shared hit=3603
                             InitPlan 1 (returns $3)
                               ->  Limit  (cost=0.43..0.53 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=901)
                                     Output: b.aid
                                     Buffers: shared hit=3603
                                     ->  Index Only Scan using b_aid on public.b  (cost=0.43..329310.05 rows=3333287 width=4) (actual time=0.016..0.016 rows=1 loops=901)
                                           Output: b.aid
                                           Index Cond: ((b.aid > c.aid) AND (b.aid IS NOT NULL))
                                           Heap Fetches: 900
                                           Buffers: shared hit=3603

Planning time: 0.268 ms
Execution time: 17.982 ms
(46 rows)

被人吐槽这么老的数据库居然有这表现,我很吃惊

德哥 回复

递归查询,2009年PostgreSQL 8.4引入的特性,赞!!!!!

评论
1

1108094853475098 已获得虾米VIP季卡 复制链接去分享

explain (analyze,timing)with recursive temp as

(select min(aid)as aid from b where aid is not null)
union all
(select
(select min(aid)from b where b.aid > s.aid and b.aid is not null)from temp s where s.aid is not null

)
Select from a left join (select distinct aid as aid from temp) b on (a.id = b.aid) where b. is null

打字打到手残……

德哥 回复

递归查询,2009年PostgreSQL 8.4引入的特性,赞!!!!!

1、递归查询优化,A表全扫,B表索引扫描了若干次(若干 = 唯一AID在B中出现的次数)。

2、SUB QUERY优化,A表全扫,B表索引扫描了若干次(若干 = A表记录数)。

由于B表都是索引扫,两种方法差别不大(递归扫描的次数更少一丝丝)。

评论
1

1175800094894991 复制链接去分享

explain analyze select * from a where not exists (select aid from b where b.aid = a.id)

阿弟@pgsql 回复

exists 特性很赞,比较聪明

橘子小卜 回复

看不懂

郝瑞 回复

在查询的时候,exits 存在即break, 所以速度要比in快一个数量级。

评论
0

1175800094894991 复制链接去分享

select * from a where id not in (select aid from b where b.aid=a.id)

阿弟@pgsql 回复

这个需要变种一下,加个limit 1

explain analyze select * from a where id not in (select aid from b where b.aid=a.id limit 1);

郝瑞 回复

老司机就是不一样!

评论
1

1176300248998312 复制链接去分享

postgres=# explain analyze select from (select a.,(select aid from b where b.aid=a.id limit 1) as aid from a ) as t where t.aid is null;
QUERY PLAN
Seq Scan on a (cost=0.00..4137.84 rows=5 width=41) (actual time=18.232..18.904 rows=100 loops=1)
Filter: ((SubPlan 2) IS NULL)
Rows Removed by Filter: 901
SubPlan 1
-> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=100)
-> Index Only Scan using b_aid on b (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.002..0.002 rows=0 loops=100)
Index Cond: (aid = a.id)
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.017 rows=1 lps=1001)
-> Index Only Scan using b_aid on b b_1 (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
Index Cond: (aid = a.id)
Heap Fetches: 901
Planning time: 0.297 ms
Execution time: 18.980 ms
(15 rows)

0

1799193019143462 复制链接去分享

厉害,大佬

0

1357186084506931 复制链接去分享

不知道该从何学起

0

宇长龙 复制链接去分享

无话可说😶!

0

黑导游 复制链接去分享

Mark

0

中秋节到了 复制链接去分享

五颜六色,看着很让人激动。啊

0

定海神棍 复制链接去分享

上一篇sql赛,我学了mysql,还没玩溜,这又来postgre。。好,我认了,我抽空去学!(我想问,有没有Oracle的优化啊)

0

teewon 复制链接去分享

explain (analyze,timing) select id from a where not exists(select 1 from b where a.id = b.aid);

Nested Loop Anti Join (cost=0.43..884.94 rows=100 width=4) (actual time=16.771..17.044 rows=100 loops=1)
-> Seq Scan on a (cost=0.00..19.01 rows=1001 width=4) (actual time=0.007..0.231 rows=1001 loops=1)
-> Index Only Scan using b_aid on b (cost=0.43..352.62 rows=11099 width=4) (actual time=0.016..0.016 rows=1 loops=1001)

    Index Cond: (aid = a.id)
    Heap Fetches: 901

Planning time: 0.246 ms
Execution time: 17.084 ms

0

过目就忘 复制链接去分享

只能看懂aoe

0

古极 复制链接去分享

我就想问话题怎么收藏?

0

鹏ge 复制链接去分享

赞一个

0

鹏ge 复制链接去分享

点赞👍

0

1349398120564520 复制链接去分享

打赏,打赏

0

1176300248998312 复制链接去分享

0

1993467976560875 复制链接去分享

为什么有强迫症的我,看着有点乱,特别想替他整理一下?

2