聊聊between and的坑和神奇的解法-阿里云开发者社区

背景

曾经一位社区的兄弟跟我抱怨MySQL里面查IP地址库并发几千每秒的查询数据库就抗不住了。

于是问他要来了他们的IP地址库数据和查询用的SQL以及MySQL里面的表结构。

我也想把数据转到PostgreSQL里面做一下相对应的压力测试，看看PostgreSQL的表现。

在其他的业务中，这样的需求也是屡见不鲜，比如年龄范围，收入范围，频繁活动的范围，地理位置区块，几何区块，线段等。都是用范围来描述的，随着物联网的发展，这类查询需求会越来越旺盛。

如果没有好的索引机制，查询需要消耗大量的CPU，很容易出现性能瓶颈。

本文要给大家介绍的是PostgreSQL 9.2引入的范围类型，以及针对范围类型的索引，大幅提升范围查询的性能。

该业务的数据模型介绍

MySQL里面的表结构如下 :

CREATE TABLE ip_address_pool (  
  id int(10) NOT NULL AUTO_INCREMENT COMMENT '自增主键',  
  start_ip varchar(20) NOT NULL COMMENT '起始ip',  
  end_ip varchar(20) NOT NULL COMMENT '截止ip',  
  province varchar(128) NOT NULL COMMENT '省名',  
  city varchar(128) NOT NULL COMMENT '城市',  
  region_name varchar(128) NOT NULL COMMENT '地区名',  
  company_name varchar(128) NOT NULL COMMENT '公司名',  
  start_ip_decimal bigint(10) DEFAULT NULL,  
  end_ip_decimal bigint(10) DEFAULT NULL,  
  PRIMARY KEY (id),  
  KEY idx_start_ip_Decimal (start_ip_decimal),  
  KEY idx_end_ip_Decimal (end_ip_decimal)  
) ENGINE=InnoDB AUTO_INCREMENT=436820 DEFAULT CHARSET=utf8 COMMENT='ip地址对应表';

MySQL里面的查询SQL如下 :

select   
  province,  
  start_ip_Decimal as startIpDecimal,  
  end_ip_Decimal as endIpDecimal  
  from ip_address_pool  
  where  
  #{ip}>=start_ip_Decimal and  
  #{ip}<=end_ip_Decimal;

数据量大概40W.

由于MySQL里面没有IP地址类型, 所以需要把IP地址转换为数值类型来存储并用来做IP地址范围的匹配.

IP地址的转换算法是32位的二进制IP地址转10进制数字。

PostgreSQL的神奇类型range

在PostgreSQL9.2里面新增了一种数据类型，range类型, 例如可以用来存储ip地址范围, int值的范围.

range类型具体的应用可以参见

《PostgreSQL 9.2 NEW Type, range》

这个应用场景, PostgreSQL有三种选择来实现它

1. 和MySQL里面一样, 使用两个字段分别存储起始的IP数字

2. 使用iprange, 直接存储IP地址范围

3. 使用int8range, 存储转换后的数字范围

这个CASE PostgreSQL 的三种解法

方法1. 和MySQL里面一样, 使用两个字段分别存储起始的IP数字

CREATE TABLE ip_address_pool (  
  id serial8 primary key,  
  start_ip inet NOT NULL ,  
  end_ip inet NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL ,  
  start_ip_decimal bigint ,  
  end_ip_decimal bigint   
) ;

以下索引其实只需要建一个就够了, PostgreSQL btree索引支持>=,>,<=,<,=等几种操作符, 同时IP地址段也不存在交叠的情况.

create index idx_ip_address_pool_sip on ip_address_pool (start_ip_decimal);  
create index idx_ip_address_pool_eip on ip_address_pool (end_ip_decimal);

思考题, 如何避免表内数据交错(PostgreSQL 可以使用排他约束)

如何避免交叠呢, 在并发的情况下插入和更新ip_address_pool表, 是没有办法避免交叠情况的发生的, 例如

1. 最low的方法，先查询需要插入的IP地址段是否已经存在表里面

2. 不存在则插入. 但是这里存在一个问题, 并发的情况下, 多个进程都可能认为插入的数据不存在, 都插入了, 但是并发插入的数据中可能有交叠的.

3. 在MYSQL中也许只能使用全表堵塞式锁来避免这个问题

4. 在PostgreSQL中则不需要全表锁, 因为可以使用range类型, 建立range类型的 exclude 约束, 例如：

digoal=# alter table ip_info add constraint ck_exclude_iprange exclude using gist(location with =, iprange with &&);  
NOTICE:  ALTER TABLE / ADD EXCLUDE will create implicit index "ck_exclude_iprange" for table "ip_info"  
ALTER TABLE

这个约束还可以用于地理位置数据，比如在数据库中存储的版图有相交时，违反约束报错。

方法2. 使用iprange, 直接存储IP地址范围, 并使用GiST索引

create type iprange as range (subtype=inet);  
  
CREATE TABLE ip_address_pool_2 (  
  id serial8 primary key,  
  ip_segment iprange NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL  
) ;  
  
CREATE INDEX ip_address_pool_2_range ON ip_address_pool_2 USING gist (ip_segment);

方法3. 使用int8range, 存储转换后的数字范围

CREATE TABLE ip_address_pool_3 (  
  id serial8 primary key,  
  start_ip inet NOT NULL ,  
  end_ip inet NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL ,  
  ip_decimal_segment int8range  
) ;

从第一个表把数据转换成range类型并存储到这个表，并使用GiST索引

insert into ip_address_pool_3 (id,start_ip,end_ip,province,city,region_name,company_name,ip_decimal_segment) select id,start_ip,end_ip,province,city,region_name,company_name,int8range(start_ip_decimal,end_ip_decimal+1) from ip_address_pool;  
  
CREATE INDEX ip_address_pool_3_range ON ip_address_pool_3 USING gist (ip_decimal_segment);

PostgreSQL三个解法的性能对比

方法1

测试脚本:

\setrandom ip 0 2094967294  
select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where :ip>=start_ip_Decimal and :ip<=end_ip_Decimal;

测试结果:

pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: simple  
number of clients: 8  
number of threads: 8  
duration: 60 s  
number of transactions actually processed: 20389  
tps = 339.576580 (including connections establishing)  
tps = 339.618604 (excluding connections establishing)

为什么只有300多呢？原因是建立的不是复合索引, 注意因为这里使用的是范围检索, 不是= , 所以检索速度和取值范围关系很大, 分别取三个值, 从小到大. 来看看查询耗时.

postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1>=start_ip_Decimal and 1<=end_ip_Decimal;  
                                                                          QUERY PLAN                                                  
                            
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_sip on ip_address_pool  (cost=10000000000.00..10000000004.51 rows=1 width=22) (actual time=0.004..0.004 rows=1 loops=1)  
   Index Cond: (1 >= start_ip_decimal)  
   Filter: (1 <= end_ip_decimal)  
 Total runtime: 0.014 ms  
(4 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1123371940>=start_ip_Decimal and 1123371940<=end_ip_Decimal;  
                                                                    QUERY PLAN                                                        
                 
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_sip on ip_address_pool  (cost=0.00..3899.49 rows=75277 width=22) (actual time=37.572..37.573 rows=1 loops=1)  
   Index Cond: (1123371940 >= start_ip_decimal)  
   Filter: (1123371940 <= end_ip_decimal)  
   Rows Removed by Filter: 96523  
 Total runtime: 37.604 ms  
(5 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 4123371940>=start_ip_Decimal and 4123371940<=end_ip_Decimal;  
                                                                     QUERY PLAN                                                       
                   
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_sip on ip_address_pool  (cost=0.00..17557.23 rows=1251 width=22) (actual time=168.138..168.139 rows=1 loops=1)  
   Index Cond: (4123371940::bigint >= start_ip_decimal)  
   Filter: (4123371940::bigint <= end_ip_decimal)  
   Rows Removed by Filter: 436810  
 Total runtime: 168.165 ms  
(5 rows)

建立复合索引,

create index idx_ip_address_pool_ip on ip_address_pool (start_ip_decimal,end_ip_decimal);

建完后还是分三个值来测试一下响应时间 :

postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1>=start_ip_Decimal and 1<=end_ip_Decimal;  
                                                               QUERY PLAN                                                             
       
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_ip on ip_address_pool  (cost=0.00..4.61 rows=1 width=22) (actual time=0.004..0.005 rows=1 loops=1)  
   Index Cond: ((1 >= start_ip_decimal) AND (1 <= end_ip_decimal))  
 Total runtime: 0.014 ms  
(3 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1123371940>=start_ip_Decimal and 1123371940<=end_ip_Decimal;  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_ip on ip_address_pool  (cost=0.00..8754.53 rows=75277 width=22) (actual time=5.995..5.996 rows=1 loops=1)  
   Index Cond: ((1123371940 >= start_ip_decimal) AND (1123371940 <= end_ip_decimal))  
 Total runtime: 6.017 ms  
(3 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 4123371940>=start_ip_Decimal and 4123371940<=end_ip_Decimal;  
                                                                   QUERY PLAN                                                         
               
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_ip on ip_address_pool  (cost=0.00..8737.49 rows=1251 width=22) (actual time=27.042..27.044 rows=1 loops=1)  
   Index Cond: ((4123371940::bigint >= start_ip_decimal) AND (4123371940::bigint <= end_ip_decimal))  
 Total runtime: 27.079 ms  
(3 rows)

那么它的TPS能达到多少呢?

pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 8  
number of threads: 8  
duration: 60 s  
number of transactions actually processed: 216400  
tps = 3606.368660 (including connections establishing)  
tps = 3606.821632 (excluding connections establishing)

有10倍提高, 但是还远远不够.

方法3

测试脚本:

\setrandom ip 0 2094967294  
select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> :ip::int8;

测试结果:

pg92@db-172-16-3-33-> pgbench -M simple -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: simple  
number of clients: 8  
number of threads: 8  
duration: 60 s  
number of transactions actually processed: 3498195  
tps = 58301.468890 (including connections establishing)  
tps = 58307.865068 (excluding connections establishing)

使用prepared即绑定变量还能有提升(这是2010年左右购买的HP DL360 8核机器上的测试表现), 如下

pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 8  
number of threads: 8  
duration: 60 s  
number of transactions actually processed: 4810415  
tps = 80171.925111 (including connections establishing)  
tps = 80180.458975 (excluding connections establishing)

使用range类型还是测试一下那三个值的耗时, 分布就比较均匀了.

postgres=# explain analyze select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> int8 '1';  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using ip_address_pool_3_range on ip_address_pool_3  (cost=0.00..862.55 rows=437 width=38) (actual time=0.034..0.035 rows=1 loops=1)  
   Index Cond: (ip_decimal_segment @> 1::bigint)  
 Total runtime: 0.045 ms  
(3 rows)  
  
postgres=# explain analyze select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> int8 '1123371940';  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using ip_address_pool_3_range on ip_address_pool_3  (cost=0.00..862.55 rows=437 width=38) (actual time=0.036..0.036 rows=1 loops=1)  
   Index Cond: (ip_decimal_segment @> 1123371940::bigint)  
 Total runtime: 0.052 ms  
(3 rows)  
  
postgres=# explain analyze select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> int8 '4123371940';  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using ip_address_pool_3_range on ip_address_pool_3  (cost=0.00..862.55 rows=437 width=38) (actual time=0.058..0.059 rows=1 loops=1)  
   Index Cond: (ip_decimal_segment @> 4123371940::bigint)  
 Total runtime: 0.069 ms  
(3 rows)

PostgreSQL第四种解法透明优化, 函数索引, 无需变更表结构

1. PostgreSQL支持函数索引，所以我们不需要改表结构就可以使用函数索引来达到加速的目的。

例如 :

CREATE TABLE ip_address_pool (  
  id serial8 primary key,  
  start_ip inet NOT NULL ,  
  end_ip inet NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL ,  
  start_ip_decimal bigint ,  
  end_ip_decimal bigint   
) ;  
  
create index idx_ip_address_1 on ip_address_pool using index gist (int8range(start_ip_decimal, end_ip_decimal+1::int8));  
select * from ip_address_pool where int8range(start_ip_decimal, end_ip_decimal+1::int8) @> ?;

小结

1. 在PostgreSQL中，使用range类型后，我们对它建立了GiST的索引，这个索引可以快速的根据用户提供的IP地址定位到包含它的行。效率直接提示了20多倍，QPS从几千达到了接近10万。

2. PostgreSQL的range类型除了可以很好的利用它的gist索引作为检索之外, 还可以使用它来做排他约束, 也就是防止数据交叠.

如果没有这种约束的话，需要锁全表来搞定.

3. 使用PostgreSQL存储IP数据的话, 还可以使用掩码, 这样的话就不需要存储两个字段了, 直接存在一个字段就可以.

当然也可以加一个存储比特位的字段, 使用bit函数来处理包含关系.

另一种用法是把这个比特运算放到内存中执行, 内存中存储IP比特位以及对应到数据库的记录的ID信息, 获取ID后去数据库查询, 也就是把数据库的范围查询变成主键查询. 也可以提高效率.

4. 关于GiST索引的原理，可以参考

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

聊聊between and的坑和神奇的解法

标签

背景

该业务的数据模型介绍

PostgreSQL的神奇类型range

这个应用场景, PostgreSQL有三种选择来实现它

这个CASE PostgreSQL 的三种解法

方法1. 和MySQL里面一样, 使用两个字段分别存储起始的IP数字

思考题, 如何避免表内数据交错(PostgreSQL 可以使用排他约束)

方法2. 使用iprange, 直接存储IP地址范围, 并使用GiST索引

方法3. 使用int8range, 存储转换后的数字范围

PostgreSQL三个解法的性能对比

方法1

方法3

PostgreSQL第四种解法透明优化, 函数索引, 无需变更表结构

小结

数据库

热门文章

最新文章

相关电子书

聊聊between and的坑 和 神奇的解法

标签

背景

该业务的数据模型介绍

PostgreSQL的神奇类型range

这个应用场景, PostgreSQL有三种选择来实现它

这个CASE PostgreSQL 的三种解法

方法1. 和MySQL里面一样, 使用两个字段分别存储起始的IP数字

思考题, 如何避免表内数据交错(PostgreSQL 可以使用排他约束)

方法2. 使用iprange, 直接存储IP地址范围, 并使用GiST索引

方法3. 使用int8range, 存储转换后的数字范围

PostgreSQL三个解法的性能对比

方法1

方法3

PostgreSQL第四种解法 透明优化, 函数索引, 无需变更表结构

小结

数据库

热门文章

最新文章

相关电子书

聊聊between and的坑和神奇的解法

PostgreSQL第四种解法透明优化, 函数索引, 无需变更表结构