继<震惊!慢SQL居然能优化到这种速度,我不服!>后,又一轮SQL挑战赛来了。这次是《决战紫禁之巅》,不服来战。
一张小表A,里面存储了一些ID,大约几百个。
(比如说巡逻车辆ID,环卫车辆的ID,公交车,微公交的ID)。
另外有一张日志表B,每条记录中的ID是来自前面那张小表的,但不是每个ID都出现在这张日志表中,比如说一天可能只有几十个ID会出现在这个日志表的当天的数据中。
(比如车辆的行车轨迹数据,每秒上报轨迹,数据量就非常庞大)。
那么我怎么快速的找出今天没有出现的ID呢。
(哪些巡逻车辆没有出现在这个片区,是不是偷懒了?哪些环卫车辆没有出行,哪些公交或微公交没有出行)?
建表
create table a(id int primary key, info text);
create table b(id int primary key, aid int, crt_time timestamp);
create index b_aid on b(aid);
插入测试数据
-- a表插入1000条
insert into a select generate_series(0,1000), md5(random()::text);
-- b表插入1000万条,只包含aid的901个id。
insert into b select generate_series(1,10000000), random()*900, clock_timestamp();
下面两条SQL都是满足查询条件的SQL,由PostgreSQL 10给出。
代表了没有做任何优化的情况下的查询性能。
postgres=# explain (analyze,timing) select * from a where id not in (select aid from b);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Seq Scan on a (cost=179053.25..179074.76 rows=500 width=37) (actual time=4369.478..4369.525 rows=100 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 901
SubPlan 1
-> Seq Scan on b (cost=0.00..154053.60 rows=9999860 width=4) (actual time=0.322..1829.342 rows=10000000 loops=1)
Planning time: 0.094 ms
Execution time: 4423.364 ms
(7 rows)
postgres=# explain (analyze,timing) select a.* from a left join b on (a.id=b.aid) where b.* is null;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=31.52..280244.69 rows=49999 width=37) (actual time=4316.767..4316.790 rows=100 loops=1)
Hash Cond: (b.aid = a.id)
Filter: (b.* IS NULL)
Rows Removed by Filter: 10000000
-> Seq Scan on b (cost=0.00..154053.60 rows=9999860 width=44) (actual time=0.013..2544.321 rows=10000000 loops=1)
-> Hash (cost=19.01..19.01 rows=1001 width=37) (actual time=0.342..0.342 rows=1001 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 76kB
-> Seq Scan on a (cost=0.00..19.01 rows=1001 width=37) (actual time=0.009..0.137 rows=1001 loops=1)
Planning time: 0.173 ms
Execution time: 4316.828 ms
(10 rows)
请使用 PostgreSQL 黑科技。
淘公仔 x 1
虾米VIP季卡 x 1
优酷VIP月卡 x 1
阿弟@pgsql
已获得淘公仔
复制链接去分享
postgres=# explain analyze select from (select a.,(select aid from b where b.aid=a.id limit 1) as aid from a ) as t where t.aid is null;
QUERY PLAN
Seq Scan on a (cost=0.00..4137.84 rows=5 width=41) (actual time=18.232..18.904 rows=100 loops=1)
Filter: ((SubPlan 2) IS NULL)
Rows Removed by Filter: 901
SubPlan 1
-> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=100)
-> Index Only Scan using b_aid on b (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.002..0.002 rows=0 loops=100)
Index Cond: (aid = a.id)
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
-> Index Only Scan using b_aid on b b_1 (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
Index Cond: (aid = a.id)
Heap Fetches: 901
Planning time: 0.297 ms
Execution time: 18.980 ms
(15 rows)
feicai
已获得优酷VIP月卡
复制链接去分享
sysbench=# explain (analyze,verbose,timing,costs,buffers)
sysbench-# select * from a where id not in
sysbench-# (
sysbench(# with recursive c as (
sysbench(# (
sysbench(# select min(aid) aid from b where aid is not null
sysbench(# )
sysbench(# union all
sysbench(# (
sysbench(# select (select min(aid) aid from b where b.aid > c.aid and b.aid is not null)
sysbench(# from c where c.aid is not null
sysbench(# )
sysbench(# )
sysbench(# select aid from c where aid is not null
sysbench(# );
QUERY PLAN
Seq Scan on public.a (cost=61.17..82.68 rows=500 width=37) (actual time=17.876..17.905 rows=100 loops=1)
Output: a.id, a.info
Filter: (NOT (hashed SubPlan 5))
Rows Removed by Filter: 901
Buffers: shared hit=3616
SubPlan 5
-> CTE Scan on c c_1 (cost=58.90..60.92 rows=100 width=4) (actual time=0.045..17.191 rows=901 loops=1)
Output: c_1.aid
Filter: (c_1.aid IS NOT NULL)
Rows Removed by Filter: 1
Buffers: shared hit=3607
CTE c
-> Recursive Union (cost=0.49..58.90 rows=101 width=4) (actual time=0.041..16.807 rows=902 loops=1)
Buffers: shared hit=3607
-> Result (cost=0.49..0.50 rows=1 width=4) (actual time=0.040..0.041 rows=1 loops=1)
Output: $1
Buffers: shared hit=4
InitPlan 3 (returns $1)
-> Limit (cost=0.43..0.49 rows=1 width=4) (actual time=0.037..0.037 rows=1 loops=1)
Output: b_1.aid
Buffers: shared hit=4
-> Index Only Scan using b_aid on public.b b_1 (cost=0.43..530481.88 rows=9999860 width=4) (actual time=0.036..0.036 rows=1 loops=1)
Output: b_1.aid
Index Cond: (b_1.aid IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=4
-> WorkTable Scan on c (cost=0.00..5.64 rows=10 width=4) (actual time=0.018..0.018 rows=1 loops=902)
Output: (SubPlan 2)
Filter: (c.aid IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=3603
SubPlan 2
-> Result (cost=0.53..0.54 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=901)
Output: $3
Buffers: shared hit=3603
InitPlan 1 (returns $3)
-> Limit (cost=0.43..0.53 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=901)
Output: b.aid
Buffers: shared hit=3603
-> Index Only Scan using b_aid on public.b (cost=0.43..329310.05 rows=3333287 width=4) (actual time=0.016..0.016 rows=1 loops=901)
Output: b.aid
Index Cond: ((b.aid > c.aid) AND (b.aid IS NOT NULL))
Heap Fetches: 900
Buffers: shared hit=3603
Planning time: 0.268 ms
Execution time: 17.982 ms
(46 rows)
被人吐槽这么老的数据库居然有这表现,我很吃惊
1108094853475098
已获得虾米VIP季卡
复制链接去分享
explain (analyze,timing)with recursive temp as
(
(select min(aid)as aid from b where aid is not null)
union all
(select
(select min(aid)from b where b.aid > s.aid and b.aid is not null)from temp s where s.aid is not null
)
)
Select from a left join (select distinct aid as aid from temp) b on (a.id = b.aid) where b. is null
打字打到手残……
select * from a where id not in (select aid from b where b.aid=a.id)
老司机就是不一样!
postgres=# explain analyze select from (select a.,(select aid from b where b.aid=a.id limit 1) as aid from a ) as t where t.aid is null;
QUERY PLAN
Seq Scan on a (cost=0.00..4137.84 rows=5 width=41) (actual time=18.232..18.904 rows=100 loops=1)
Filter: ((SubPlan 2) IS NULL)
Rows Removed by Filter: 901
SubPlan 1
-> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=100)
-> Index Only Scan using b_aid on b (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.002..0.002 rows=0 loops=100)
Index Cond: (aid = a.id)
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.017 rows=1 lps=1001)
-> Index Only Scan using b_aid on b b_1 (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001)
Index Cond: (aid = a.id)
Heap Fetches: 901
Planning time: 0.297 ms
Execution time: 18.980 ms
(15 rows)
explain (analyze,timing) select id from a where not exists(select 1 from b where a.id = b.aid);
Nested Loop Anti Join (cost=0.43..884.94 rows=100 width=4) (actual time=16.771..17.044 rows=100 loops=1)
-> Seq Scan on a (cost=0.00..19.01 rows=1001 width=4) (actual time=0.007..0.231 rows=1001 loops=1)
-> Index Only Scan using b_aid on b (cost=0.43..352.62 rows=11099 width=4) (actual time=0.016..0.016 rows=1 loops=1001)
Index Cond: (aid = a.id)
Heap Fetches: 901
Planning time: 0.246 ms
Execution time: 17.084 ms
阿弟老司机. A表数据量小,查询A表的QUERY中使用SUB QUERY使得SUB QUERY的扫描次数下降到与A行数一致,SUB QUERY中采用LIMIT 1限定返回数,is null限定得出B表中未出现的aid。妙!!!
再变种一下还可以写成这样,pg的sql语法确实强大
postgres=# explain analyze select * from a where (select aid from b where b.aid=a.id limit 1) is null;
Seq Scan on a (cost=0.00..4117.37 rows=5 width=37) (actual time=18.346..18.699 rows=100 loops=1)
Filter: ((SubPlan 1) IS NULL)
Rows Removed by Filter: 901
SubPlan 1
Planning time: 0.181 ms
Execution time: 18.755 ms
(10 rows)
赞赞赞!!!PG的语法真强大。
感觉把问题复杂化了,其实PG使用NOT EXISTS就实现你需要的,b表直接走Index Only Scan,b表数据再大妈妈也不用担心了