PostreSQL崩溃试验全记录-阿里云开发者社区

PostreSQL崩溃试验全记录

2017-12-18 2037

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生数据库 PolarDB MySQL 版，Serverless 5000PCU 100GB

简介：

由于客户提到，运行某些大运算量批处理时，PostgreSQL突发性使用大量资源的问题，

进行了如下的调查，发现确实发生了崩溃。PostgreSQL 需要资源控制方案啊。

现在正在考虑是否必须要用 os 级别的限制方案：

过程如下：

测试环境：

内存：大约1024MB

postgresql.conf 设定：

使用缺省值：checkpoint_segments =3 shard_buffers=32MB

这些是有意为之，就是想看看数据量大、shared_buffer小的状况：

建立表(一条记录大约1024字节)：

postgres=# create table test01(id integer, val char(1024));

向表中插入大量数据(总共插入2400MB)

postgres=# insert into test01 values(generate_series(1,2457600),repeat( chr(int4(random()*26)+65),1024));

插入操作会花费一些时间，此时用ps 命令看PostgreSQL个进程对内存使用状况：‘

[root@server ~]# ps aux | grep post

root      3180  0.0  0.0 105296   712 pts/1    S    16:31   0:00 su - postgres

postgres  3181  0.0  0.0  70304   676 pts/1    S+   16:31   0:00 -bash

postgres  3219  0.0  0.2 113644  2864 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data

postgres  3221  0.4  3.0 113724 35252 ?        Ss   16:32   0:01 postgres: writer process                  

postgres  3222  0.2  0.1 113644  1616 ?        Ds   16:32   0:00 postgres: wal writer process              

postgres  3223  0.0  0.0 114380  1148 ?        Ss   16:32   0:00 postgres: autovacuum launcher process     

postgres  3224  0.0  0.0  73332   472 ?        Ss   16:32   0:00 postgres: stats collector process         

root      3252  0.0  0.0 105296   712 pts/2    S    16:32   0:00 su - postgres

postgres  3253  0.0  0.0  70304   676 pts/2    S    16:32   0:00 -bash

postgres  3285  0.0  0.0  83488   740 pts/2    S+   16:32   0:00 ./psql

postgres  3286 14.8 80.2 2598332 924308 ?      Ds   16:32   0:35 postgres: postgres postgres [local] INSERT

root      3333  0.0  0.0  65424   812 pts/3    S+   16:36   0:00 grep post

可以发现，INSERT操作正在消耗80%以上的内存。

再等片刻，发现psql端传来反馈：

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
The connection to the server was lost. Attempting reset: Failed.
!>

此时看看Log，可以看到：background writer (3321) 已经被干掉了，所有连接被重置。

LOG: autovacuum launcher started
LOG: database system is ready to accept connections
LOG: checkpoints are occurring too frequently (2 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (1 second apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (2 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (1 second apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (2 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (1 second apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (2 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (4 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (9 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (9 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (5 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (5 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (15 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (23 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (13 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (4 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (5 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (9 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: background writer process (PID 3221) was terminated by signal 9: Killed
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
FATAL: the database system is in recovery mode
LOG: database system was interrupted; last known up at 2013-08-30 16:36:42 CST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: consistent recovery state reached at 0/B7657BD0
LOG: redo starts at 0/B60FE2B8
LOG: unexpected pageaddr 0/B044C000 in log file 0, segment 184, offset 4505600
LOG: redo done at 0/B844B940
LOG: autovacuum launcher started
LOG: database system is ready to accept connections

各个进程都重新生成了：

[root@server ~]# ps aux | grep post
root      3180  0.0  0.0 105296   660 pts/1    S    16:31   0:00 su - postgres
postgres  3181  0.0  0.0  70304   628 pts/1    S+   16:31   0:00 -bash
postgres  3219  0.0  0.5 113644  6276 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data
root      3252  0.0  0.0 105296   660 pts/2    S    16:32   0:00 su - postgres
postgres  3253  0.0  0.0  70304   628 pts/2    S    16:32   0:00 -bash
postgres  3285  0.0  0.0  83488  1144 pts/2    S+   16:32   0:00 ./psql
postgres  3348  0.0  0.0 113644   984 ?        Ss   16:40   0:00 postgres: writer process                  
postgres  3349  0.0  0.0 113644   732 ?        Ss   16:40   0:00 postgres: wal writer process              
postgres  3350  0.0  0.1 114336  1756 ?        Ss   16:40   0:00 postgres: autovacuum launcher process     
postgres  3351  0.0  0.0  73332   580 ?        Ss   16:40   0:00 postgres: stats collector process         
root      3361  0.0  0.0  65420   796 pts/3    R+   16:44   0:00 grep post
[root@server ~]#

回到psql端再看，发现连接已经失效了：

!> \
Invalid command \. Try \? for help.
!> \dt;
You are currently not connected to a database.
!>

根据向社区提问，据说是因为OS级别的OOM错误发生,所以杀死了Postmaster进程。

总之，此种情况表明，如果没有对资源消费总量的控制，那么突发性的用户访问会导致崩溃的。

本文转自健哥的数据花园博客园博客，原文链接：http://www.cnblogs.com/gaojian/p/3291938.html，如需转载请自行联系原作者

PostreSQL崩溃试验全记录

热门文章

最新文章

相关课程

相关电子书

相关实验场景