【故障处理】ORA- 2730*,status 12故障分析与处理

简介: 今天有朋友在AIX操作系统上的10g数据库环境中遭遇了ORA-2730*一系列错误。导致系统使用一段时间后就无法连接,客户端亦无法登陆,服务器端也无法提交任何命令。
今天有朋友在AIX操作系统上的10g数据库环境中遭遇了ORA-2730*一系列错误。导致系统使用一段时间后就无法连接,客户端亦无法登陆,服务器端也无法提交任何命令。
最后使用简单的“重启数据库”方法暂时处理了这个问题(这种处理方法要坚决取缔!)。我们简单看一下这个故障。

1.警告日志提示的错误信息
Errors in file /home/oracle/admin/ora10g/bdump/ora10g_psp0_147610.trc:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3
Fri Apr 16 06:14:42 2010
Process m000 died, see its trace file
Fri Apr 16 06:14:42 2010
ksvcreate: Process(m000) creation failed
Fri Apr 16 06:15:19 2010
Process startup failed, error stack:
Fri Apr 16 06:15:19 2010
Errors in file /home/oracle/admin/ora10g/bdump/ora10g_psp0_147610.trc:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3
Fri Apr 16 06:15:19 2010
Process m000 died, see its trace file
Fri Apr 16 06:15:19 2010
ksvcreate: Process(m000) creation failed
Fri Apr 16 06:15:44 2010
Process startup failed, error stack:
Fri Apr 16 06:15:44 2010

2.Trace文件中的记录
/home/oracle/admin/ora10g/bdump/ora10g_cjq0_516352.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /home/oracle/product/10g
System name: AIX
Node name: smartcard1
Release: 3
Version: 5
Machine: 0006A849D600
Instance name: ora10g
Redo thread mounted by this instance: 1
Oracle process number: 10
Unix process pid: 516352, image: oracle@smartcard1 (CJQ0)
*** 2010-04-28 23:04:44.012
*** SERVICE NAME:(SYS$BACKGROUND) 2010-04-28 23:04:43.776
*** SESSION ID:(162.1) 2010-04-28 23:04:43.776
Waited for process J000 to initialize for 60 seconds
*** 2010-04-28 23:04:44.012
Dumping diagnostic information for J000:
*** 2010-04-28 23:06:08.301
OS pid = 425984
loadavg : 0.62 0.46 0.23
swap info: free_mem = 20.43M rsv = 32.00M
           alloc = 4999.81M avail = 8192.00M swap_free = 3192.19M
skgpgpstack: fgets() timed out after 60 seconds
ERROR: process 425984 is not alive
*** 2010-04-28 23:06:08.332
*** 2010-04-29 23:12:22.828
Waited for process J000 to initialize for 60 seconds
*** 2010-04-29 23:12:24.493
Dumping diagnostic information for J000:
OS pid = 635374
loadavg : 1.23 1.04 0.62
swap info: free_mem = 11.09M rsv = 32.00M
           alloc = 3162.42M avail = 8192.00M swap_free = 5029.58M
skgpgpstack: fgets() timed out after 60 seconds
open: Permission denied
procstack: open(/proc/635374/ctl): Permission denied
*** 2010-04-29 23:13:58.926
*** 2010-05-02 23:35:36.230
Waited for process J000 to initialize for 60 seconds
*** 2010-05-02 23:35:38.121
Dumping diagnostic information for J000:
OS pid = 553290
loadavg : 0.34 0.26 0.45
swap info: free_mem = 11.81M rsv = 32.00M
           alloc = 6194.64M avail = 8192.00M swap_free = 1997.36M
       F S      UID    PID   PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   loohcs 553290      1   0 255 20 6064f510 87244          23:34:36      -  0:00 [oracle]
ERROR: process 553290 is not alive
*** 2010-05-02 23:37:02.247
*** 2010-05-02 23:37:02.247
Process J000 is dead (pid=553290, state=5):

3.问题分析
重点应该关注以下报错信息:
1)alert中的报错
Errors in file /home/oracle/admin/ora10g/bdump/ora10g_psp0_147610.trc:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3

2)Trace中关键信息
swap info: free_mem = 11.81M rsv = 32.00M
           alloc = 6194.64M avail = 8192.00M swap_free = 1997.36M

显然是因为交换空间被耗尽导致的问题。

4.MOS中对此问题也有描述
Database Cannot Start Due to Lack of Memory [ID 560309.1]

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.4
This problem can occur on any platform.
Symptoms

The database can not start up due to the following errors:

*** SERVICE NAME:(SYS$BACKGROUND) 2008-03-24 17:02:34.855
*** SESSION ID:(1104.1) 2008-03-24 17:02:34.855
*** 2008-03-24 17:02:34.855
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3
*** 2008-03-24 17:02:38.158
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3

Cause

This issue is mainly caused by lack of memory / swap. Checking the memory configuration on the server, we have found the following:

Total Physical Memory 38912 MB
Swap: Max Size 17664 MiB
So, RAM is 38 GB, SWAP space is only 17 GB
Solution

1- We should increase the server swap space (paging space) . The general rule of thumb is that swap space should be:
RAM                              SWAP
1GB to 2GB                   1.5 times RAM
> 2GB and
> 8GB                            .75 times RAM

So in our case, the recommended swap space should be 28 GB  Instead of  17 GB

2- We can also try to increase physical memory, if possible.

3- In Unix Platforms , The user limits  for user oracle should be checked , using the "ulimit -a " command.

4- We should also check memory parameters in the pfile/spfile that may add more load to the memory consumption on the server.  For example setting the following parameters can add more overhead to memory consumption

-lock_sga=true
- db_keep_cache_size=

5.小结
既然找到了问题的症结,处理起来就方便了,可以适当的增加交换空间,或优化数据库以便减少内存的使用。
通过这个案例我们应该吸取些什么经验和教训呢?
1)在系统上线的时候要充分考虑到应用的类型,是CPU密集型、MEM密集型还是磁盘读写密集型,据此给出系统科学有效的优化方式;
2)遇到问题,不可简单的使用“重启数据库”的方法来处理,要充分挖掘出问题背后的真实原因,防止问题的再一次出现;
3)“化风险于无形”,也就是说要加强平时的系统监控,随时发现问题进行优化,防止病入膏肓时无计可施。

Good luck.

secooler
10.05.11

-- The End --

目录
相关文章
|
11月前
|
SQL Oracle 前端开发
ORAchk检查RAC后整改一例
使用ORAchk检查一个两个节点RAC的数据库,根据检查结果整改如下
|
监控 数据库
dataguard中MRP无法启动的问题分析和解决
自己手头有一套dataguard环境,因为也有些日子没有用了,结果突然心血来潮准备启动起来学习一下,突然发现在敲了命令 recover managed standby database disconnect from session之后,命令运行正常,但是后台却报了ora错误。
1099 0
|
JavaScript Perl
一次RAC VIP漂移的结果诊断及修复
背景概述 客户的10G数据库VIP出现宕,引起VIP负载到另一个节点 事件支持细节 04:29:56.378 一号机器VIP 出现 went OFFLINE unexpectedly,当天出现这个VIP漂移的故障后为检查VIP宕掉的原因, 对VIP资源启动DEBUG 5模式:./crsctl debug log res "orahostname1.vip:5" 04:38:36.047 一号节点VIP 出现 went OFFLINE unexpectedly。
1981 0
|
运维 Oracle 关系型数据库
【故障处理】CRS-1153错误处理
【故障处理】CRS-1153错误处理   1  CRS-1153: There was an error setting Oracle Clusterware to rolling patch mode.
1562 0