PostreSQL崩溃试验全记录

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介:

由于客户提到,运行某些大运算量批处理时,PostgreSQL突发性使用大量资源的问题,

进行了如下的调查,发现确实发生了崩溃。PostgreSQL 需要资源控制方案啊。

现在正在考虑是否必须要用 os 级别的限制方案:

过程如下:

测试环境:

内存:大约1024MB

postgresql.conf 设定:

    使用缺省值:checkpoint_segments =3  shard_buffers=32MB

    这些是有意为之,就是想看看数据量大、shared_buffer小的状况:

建立表(一条记录大约1024字节):

    postgres=# create table test01(id integer, val char(1024)); 

向表中插入大量数据(总共插入2400MB)

   postgres=# insert into test01 values(generate_series(1,2457600),repeat( chr(int4(random()*26)+65),1024));

插入操作会花费一些时间,此时用ps 命令看PostgreSQL个进程对内存使用状况:‘

 

复制代码
[root@server ~]# ps aux | grep post

root      3180  0.0  0.0 105296   712 pts/1    S    16:31   0:00 su - postgres

postgres  3181  0.0  0.0  70304   676 pts/1    S+   16:31   0:00 -bash

postgres  3219  0.0  0.2 113644  2864 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data

postgres  3221  0.4  3.0 113724 35252 ?        Ss   16:32   0:01 postgres: writer process                  

postgres  3222  0.2  0.1 113644  1616 ?        Ds   16:32   0:00 postgres: wal writer process              

postgres  3223  0.0  0.0 114380  1148 ?        Ss   16:32   0:00 postgres: autovacuum launcher process     

postgres  3224  0.0  0.0  73332   472 ?        Ss   16:32   0:00 postgres: stats collector process         

root      3252  0.0  0.0 105296   712 pts/2    S    16:32   0:00 su - postgres

postgres  3253  0.0  0.0  70304   676 pts/2    S    16:32   0:00 -bash

postgres  3285  0.0  0.0  83488   740 pts/2    S+   16:32   0:00 ./psql

postgres  3286 14.8 80.2 2598332 924308 ?      Ds   16:32   0:35 postgres: postgres postgres [local] INSERT

root      3333  0.0  0.0  65424   812 pts/3    S+   16:36   0:00 grep post
复制代码

可以发现,INSERT操作正在消耗80%以上的内存。

再等片刻,发现psql端传来反馈:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
The connection to the server was lost. Attempting reset: Failed.
!> 

此时看看Log,可以看到:background writer (3321) 已经被干掉了,所有连接被重置。

复制代码
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (1 second apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (1 second apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (1 second apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (4 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (10 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (10 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (11 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (15 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (23 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (13 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (4 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  background writer process (PID 3221) was terminated by signal 9: Killed
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
FATAL:  the database system is in recovery mode
LOG:  database system was interrupted; last known up at 2013-08-30 16:36:42 CST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  consistent recovery state reached at 0/B7657BD0
LOG:  redo starts at 0/B60FE2B8
LOG:  unexpected pageaddr 0/B044C000 in log file 0, segment 184, offset 4505600
LOG:  redo done at 0/B844B940
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
复制代码

各个进程都重新生成了:

复制代码
[root@server ~]# ps aux | grep post
root      3180  0.0  0.0 105296   660 pts/1    S    16:31   0:00 su - postgres
postgres  3181  0.0  0.0  70304   628 pts/1    S+   16:31   0:00 -bash
postgres  3219  0.0  0.5 113644  6276 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data
root      3252  0.0  0.0 105296   660 pts/2    S    16:32   0:00 su - postgres
postgres  3253  0.0  0.0  70304   628 pts/2    S    16:32   0:00 -bash
postgres  3285  0.0  0.0  83488  1144 pts/2    S+   16:32   0:00 ./psql
postgres  3348  0.0  0.0 113644   984 ?        Ss   16:40   0:00 postgres: writer process                  
postgres  3349  0.0  0.0 113644   732 ?        Ss   16:40   0:00 postgres: wal writer process              
postgres  3350  0.0  0.1 114336  1756 ?        Ss   16:40   0:00 postgres: autovacuum launcher process     
postgres  3351  0.0  0.0  73332   580 ?        Ss   16:40   0:00 postgres: stats collector process         
root      3361  0.0  0.0  65420   796 pts/3    R+   16:44   0:00 grep post
[root@server ~]# 
复制代码

回到psql端再看,发现连接已经失效了:

!> \
Invalid command \. Try \? for help.
!> \dt;
You are currently not connected to a database.
!> 

根据向社区提问,据说是因为OS级别的OOM错误发生,所以杀死了Postmaster进程。

总之,此种情况表明,如果没有对资源消费总量的控制,那么突发性的用户访问会导致崩溃的。









本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/3291938.html,如需转载请自行联系原作者

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
J3D在UOS+KIRIN崩溃2:深层原因分析
J3D在UOS+KIRIN崩溃2:深层原因分析
51 0
|
Java Linux
理论:第十四章:生产环境服务器变慢如何诊断,性能评估
理论:第十四章:生产环境服务器变慢如何诊断,性能评估
245 0
理论:第十四章:生产环境服务器变慢如何诊断,性能评估
|
传感器 机器学习/深度学习 算法
康奈尔大学推出终结者式机器人!被刺伤后能检测到损伤,还可以当场自我修复
《终结者》已经成为不少科幻电影迷心中的经典。 电影中机器人T-800从一个没有感情的机器人最终成长为理解人性与生命的生命体。 深入人心的除了T-800的形象外,反派液态金属机器人变形模仿和自我修复的能力也同样深入人心。
182 0
康奈尔大学推出终结者式机器人!被刺伤后能检测到损伤,还可以当场自我修复
|
程序员 测试技术 BI
谈谈企业管理软件领域内那些很难稳定重现故障的处理技巧(二)
企业管理软件面向的是企业级用户,如果软件出现故障(bug),在某些极端情况下,可能会让企业蒙受巨大的经济损失,故而对软件开发人员在编程规范,软件测试和软件交付之前的验证等各方面都提出了更高的要求。
127 0
谈谈企业管理软件领域内那些很难稳定重现故障的处理技巧(二)
|
运维 数据挖掘 测试技术
谈谈企业管理软件领域内那些很难稳定重现故障的处理技巧(一)
企业管理软件面向的是企业级用户,如果软件出现故障(bug),在某些极端情况下,可能会让企业蒙受巨大的经济损失,故而对软件开发人员在编程规范,软件测试和软件交付之前的验证等各方面都提出了更高的要求。
118 0
谈谈企业管理软件领域内那些很难稳定重现故障的处理技巧(一)
|
JSON 运维 网络协议
平稳运行半年的系统宕机了,记录一次排错调优的全过程!
最近发生了一件很让人头疼的事情,已经上线半年且平稳运行半年系统在年后早高峰的使用时发生了濒临宕机的情况。访问速度特别慢,后台查到大量time_wait的连接,从代码层面到架构层面到网络层面排查了几天几夜,总算是有了结果。
|
Java Shell
进程无故消失的破案历程
进程无故消失的破案历程
进程无故消失的破案历程
国外隔离迷惑行为:这个网站帮你复现办公室白噪音,还能自行决定同事数量
国外隔离迷惑行为:这个网站帮你复现办公室白噪音,还能自行决定同事数量
118 0
|
域名解析 缓存 运维
一行小错为何产生巨大破坏-Facebook史诗级故障大反思
弱小从来不是生存的障碍,傲慢才是。10月4日FaceBook发生了一次史诗级中断事故,故障期间FaceBook所有旗下APP全面对外服务中断,而且故障的时间长达7个小时之久。根据Facebook最新的声明来看,故障的原因是由于工程师错误地发出了一条指令,切断了Facebook的数据中心“在全球范围内的所有网络连接”。
一行小错为何产生巨大破坏-Facebook史诗级故障大反思
|
存储 运维 监控
十条运维经验,帮你远离故障
1. 确保变更可以回滚 佛说:“每次创伤都是一次成熟”。这是运维人员的真实写照。从某种意义上讲,运维是一份不断犯错、不断积累经验的工作。以前没有经历的东西,总是不定期的给你痛击。所以请保护好变更的现场,使得变更有回头的机会。
1065 0