ges resources暴长导致 ORA"/>
ges resources暴长导致 ORA
1.概述2015年5月6日09:16 两实例均分别出现ORA-04031应用无法连接,其中信用3A继ORA-04031之后触发了ORA-00600错误,最终由ORA-481中断该实例;信用4A为了及时恢复业务及时对实例进行了重启;后经分析此次故障是人工执行预付费的账号的一个过程所致,该过程使用游标的方式对表进行大量的UPDATE操作加上未使用绑定变量致使用数据库出现ORA-04031;最终触发了BUG;
2.故障现象
数据库出ORA-04031、ORA-00600 [kksscl-inf-inl-loop]最后由ORA-00481中断实例,信用4a应用报ORA-04031数据库日志未出现告警;
--日志报错信息
Wed May 6 09:16:34 2015
Errors in file /oracle/app/admin/oraxy3/udump/oxy3a_ora_1483130.trc:
ORA-04031: unable to allocate 4216 bytes of shared memory ("shared pool","insert into CREDIT_EVENT_ERR...","sga heap(3,0)","library cache")
ORA-04031: unable to allocate 4216 bytes of shared memory ("shared pool","insert into CREDIT_EVENT_ERR...","sga heap(3,0)","library cache")
Errors in file /oracle/app/admin/oraxy3/udump/oxy3a_ora_492244.trc:
ORA-00600: internal error code, arguments: [kksscl-inf-inl-loop], [2500], [3], [323], [652], [652], [], []
Wed May 6 09:17:00 2015
Errors in file /oracle/app/admin/oraxy3/udump/oxy3a_ora_635918.trc:
ORA-00600: internal error code, arguments: [kksscl-inf-inl-loop], [2500], [3], [323], [652], [652]
Errors in file /oracle/app/admin/oraxy3/bdump/oxy3a_lmon_1422378.trc:
ORA-00600: internal error code, arguments: [kjrfr:!owner], [0x7000001E9DA5890], [], [], [], [], [], []
Wed May 6 09:23:06 2015
Errors in file /oracle/app/admin/oraxy3/bdump/oxy3a_lmon_1422378.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00600: internal error code, arguments: [kjrfr:!owner], [0x7000001E9DA5890], [], [
Errors in file /oracle/app/admin/oraxy3/bdump/oxy3a_pmon_1503680.trc:
ORA-00481: LMON process terminated with error
Wed May 6 09:25:32 2015
PMON: terminating instance due to error 481
Wed May 6 09:25:32 2015
Errors in file /oracle/app/admin/oraxy3/bdump/oxy3a_lms4_1343952.trc:
ORA-00481: LMON process terminated with error
3. 故障分析
3.1 数据库负载分析
08:30-09:00数据库每秒执行的事物高达7439.33次相当于平时的165倍,从这里可以看出数据库当时业务量非常高;
DB Name DB Id Instance Inst Num Release Cluster Host
------------ ----------- ------------ -------- ----------- ------- ------------
ORAXY4 3696212656 oxy4a 1 9.2.0.6.0 YES sxy4a
Snap Id Snap Time Sessions Curs/Sess Comment
--------- ------------------ -------- --------- -------------------
Begin Snap: 186816 06-May-15 08:30:01 681 10.2
End Snap: 186833 06-May-15 09:00:01 686 10.2
Elapsed: 30.00 (mins)
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 3,460,589.76 465.17
Logical reads: 151,302.03 20.34
Block changes: 36,679.13 4.93
Physical reads: 7,794.99 1.05
Physical writes: 716.58 0.10
User calls: 2,380.18 0.32
Parses: 664.15 0.09
Hard parses: 3.03 0.00
Sorts: 205.90 0.03
Logons: 3.17 0.00
Executes: 24,984.70 3.36
Transactions: 7,439.33 ===========》每秒执行的事物高达7439.33次,说明数据库当时事物量非常高
从Shared Pool分析发现,其中GES RESOURCES组件分别在5月5日18:30-19:00之间由45M增长到80M,5月6日08:30-09:30之间由80M增长到1.7G;
SGA breakdown difference for DB: ORAXY4 Instance: oxy4a Snaps: 186816 -186833
Pool Name Begin value End value % Diff
------ ------------------------------ ---------------- ---------------- -------
java free memory 218,103,808 218,103,808 0.00
large free memory 218,103,808 218,103,808 0.00
shared 1M buffer 528,384 528,384 0.00
shared Checkpoint queue 271,427,840 271,427,840 0.00
shared DML lock 1,885,064 1,885,064 0.00
shared FileIdentificatonBlock 2,247,328 2,247,328 0.00
shared FileOpenBlock 20,380,456 20,380,456 0.00
shared KCL extra lock elements 1,728,000 1,728,000 0.00
shared KCL name table 1,483,512 1,483,512 0.00
shared KGK heap 64,600 64,600 0.00
shared KGLS heap 2,152,904 1,950,760 -9.39
shared KQR L PO 2,505,432 2,070,176 -17.37
shared KQR M PO 29,642,864 19,847,288 -33.05
shared KQR S SO 5,136 6,928 34.89
shared KQR X PO 115,944 115,944 0.00
shared KSXR pending messages que 853,952 853,952 0.00
shared MTTR advisory 352,576 352,576 0.00
shared PL/SQL DIANA 1,573,960 1,260,880 -19.89
shared PL/SQL MPCODE 89,231,648 2,981,424 -96.66
shared PL/SQL PPCODE 24,664 12,344 -49.95
shared PLS non-lib hp 3,632 3,632 0.00
shared branch 788,560 788,560 0.00
shared db_block_hash_buckets 2,851,520 2,851,520 0.00
shared db_files 371,128 371,128 0.00
shared db_handles 2,320,000 2,320,000 0.00
shared dictionary cache 3,229,952 3,229,952 0.00
shared enqueue 3,877,288 3,877,288 0.00
shared enqueue resources 1,525,720 1,525,720 0.00
shared errors 56,584 41,408 -26.82
shared event statistics per sess 24,942,960 24,942,960 0.00
shared fixed allocation callback 3,672 3,672 0.00
shared free memory 217,359,432 799,474,760 267.81
shared gcs resources 39,716,184 39,716,184 0.00
shared gcs shadows 26,385,752 26,385,752 0.00
shared ges big msg buffers 8,866,088 8,866,088 0.00
shared ges enqueues 71,480,032 71,480,032 0.00
shared ges process array 2,561,280 2,561,280 0.00
shared ges regular msg buffers 2,064,248 2,064,248 0.00
shared ges reserved msg buffers 2,096,008 2,096,008 0.00
shared ges resource hash table 4,325,376 4,325,376 0.00
shared ges resources 80,605,512 1,716,473,160 ####### ====》 从这里可以看出shared_pool大分部被ges resources占用; shared pool共2G,该组件就占用了1.7G
shared joxs heap init 488 488 0.00
shared library cache 957,054,656 379,577,984 -60.34
shared miscellaneous 206,828,912 206,838,528 0.00
shared parameters 10,241,824 237,752 -97.68
shared partitioning d 2,880,912 1,696,736 -41.10
shared processes 2,608,000 2,608,000 0.00
shared sessions 5,962,320 5,962,320 0.00
shared set_descriptor_array 685,184 685,184 0.00
shared sim memory hea 690,232 690,232 0.00
shared sim trace entries 5,210,112 5,210,112 0.00
shared sql area 920,218,568 34,797,528 -96.22
shared table definiti 10,928 13,544 23.94
shared transaction 2,619,192 2,619,192 0.00
ges resource解释:
Depending on application work loads, some components in a RAC environment can temporarily consume a lot of SGA memory for "ges enqueues" and this can fragment the shared pool sufficiently to cause ORA-4031 type errors
通过ges resources可以判断当时数据库正在做大量的word loads,从这里也正好验证了执行的事物高达7439.33次的事实
SQL ordered by Gets for DB: ORAXY4 Instance: oxy4a Snaps: 186816 -186833
-> End Buffer Gets Threshold: 10000
-> Note that resources reported for PL/SQL includes the resources used by
all SQL statements called within the PL/SQL code. As individual SQL
statements are also reported, it is possible and valid for the summed
total % to exceed 100
CPU Elapsd
Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
53,270,084 13,304,986 4.0 19.6 934.22 3161.89 1945535319
Module: PL/SQL Developer
SELECT VALUE FROM ACCOUNT_ATTR B WHERE B.KEY_ID = :B1 AND B.ATTR
_ID = 7
15,017,455 6,793,977 2.2 5.5 1207.08 1120.64 2588995081
Module: PL/SQL Developer
UPDATE ACCOUNT_ATTR633 C SET C.FLAG = 1 WHERE C.ROWID = :B1
14,589,255 6,488,773 2.2 5.4 1025.90 916.91 646745035
Module: PL/SQL Developer
UPDATE ACCOUNT_ATTR543 C SET C.FLAG = 1 WHERE C.ROWID = :B
13,386,305 13,326,843 1.0 4.9 407.27 728.89 1053795750
Module: SQL*Plus
COMMIT
10,007,055 1,883 5,314.4 3.7 37.08 61.03 2162479642
Module: exec_proc@c4cbeg02 (TNS V1-V3)
SELECT ROWID NROWID, ACCTID ACCOUNTID, PROCESSTIME INTIME, DECOD
E(OLDSTATUS,'FC17','FC121','IFC12','FC01','IFC121','FC01','IFC13
1','FC01','IFC132','FC01','IFC133','FC01','IFC30','FC30','IFC20'
,'FC20','IFC00','FC00','IFC01','FC01',OLDSTATUS) OLDSTATUS, DECO
DE(STATUS,'FC17','FC121','IFC12','FC12','IFC121','FC121','IFC131
9,337,971 1 9,337,971.0 3.4 243.89 364.25 1852514272
Module: PL/SQL Developer
declare v_value varchar2(10); begin for i in (select a.*, ro
wid rid from account_attr633 a where mod(a.key_id,10)=9 and a.fl
ag is null) loop begin select value into v_val
ue from account_attr b where b.key_id = i.key_id
and b.attr_id = 7; if v_value = '0' then
9,068,352 1 9,068,352.0 3.3 243.69 367.38 1116525411
Module: PL/SQL Developer
declare v_value varchar2(10); begin for i in (select a.*, ro
wid rid from account_attr633 a where mod(a.key_id,10)=8 and a.fl
ag is null) loop begin select value into v_val
ue from account_attr b where b.key_id = i.key_id
and b.attr_id = 7; if v_value = '0' then
8,572,295 1 8,572,295.0 3.1 240.85 370.84 3382822563
Module: PL/SQL Developer
declare v_value varchar2(10); begin for i in (select a.*, ro
wid rid from account_attr633 a where mod(a.key_id,10)=7 and a.fl
ag is null) loop begin select value into v_val
ue from account_attr b where b.key_id = i.key_id
and b.attr_id = 7; if v_value = '0' then
8,121,479 1 8,121,479.0 3.0 242.36 387.30 3652706703
从上述报告可以看出,有人在使用PL/SQL Developer来执行大批量的update语句,通过该语句我们可以定位是一个存储过程;
从上述分析导致该问题的原因是由于人工执行一过程所致,那么如何知道谁执行此操作,我们可以通过监听日志得到一些相关的信息;
ORT=60899)) * establish * oraxy3 * 0
05-MAY-2015 17:37:20 * (CONNECT_DATA=(SERVICE_NAME=oraxy3)(CID=(PROGRAM=D:\oracle\PLSQL Developer\plsqldev.exe)(HOST=SD4ACITRIX13)(USER=4A_yuanjuju))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.19.194.25)(PORT=60904)) * establish * oraxy3 * 0
05-MAY-2015 17:38:17 * (CONNECT_DATA=(SERVICE_NAME=oraxy3)(CID=(PROGRAM=D:\oracle\PLSQL Developer\plsqldev.exe)(HOST=SD4ACITRIX13)(USER=4A_yuanjuju))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.19.194.25)(PORT=60995)) * establish * orax
y3 * 0
05-MAY-2015 17:39:28 * (CONNECT_DATA=(SERVICE_NAME=oraxy3)(CID=(PROGRAM=D:\oracle\PLSQL Developer\plsqldev.exe)(HOST=SD4ACITRIX13)(USER=4A_wangzhen2))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.19.194.25)(PORT=61100)) * establish * oraxy3 * 0
05-MAY-2015 17:39:28 * (CONNECT_DATA=(SERVICE_NAME=oraxy3)(CID=(PROGRAM=D:\oracle\PLSQL Developer\plsqldev.exe)(HOST=SD4ACITRIX13)(USER=4A_wangzhen2))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.19.194.25)(PORT=61101)) * establish * oraxy3 * 0
从监听日志里我们可以看出用户USER=4A_yuanjuju通过plsqldev.exe执行了该过程;
3.5 数据宕机原因定位 -- Trace File 分析
oxy3a_ora_1483130.trc
通过对TRACE文件分析发现56%的资源被GES RESOURCES组件占用,通过Trace file内容可以判断该次故障命中Bug 3046725
参考文档:
Bug 3046725 - ORA-4031 due to shared_pool fragmented with high ges resources & enqueues ( Doc ID 3046725.8 )
Bug 5722472 - ORA-600[KKSSCL-INF-INL-LOOP] OCCURS IN HIGH LOAD TEST.
FileComment
-----------------
======================================
Allocation Request Summary Information
======================================
Current information setting: 00654fff
Dump Interval=300 seconds SGA Heap Dump Interval=3600 seconds
Last Dump Time=05/06/2015 09:16:34
Allocation request for: library cache
Heap: 700000000041e00, size: 4216
******************************************************
HEAP DUMP heap name="sga heap(3,0)" desc=700000000041e00
extent sz=0xfe0 alt=200 het=32767 rec=9 flg=-126 opc=0
parent=0 owner=0 nex=0 xsz=0x1
Subpool 1: total 989855744, top 5 components:
Allocation Name Size Ratio(%)
___________________________ ____________ ________
"ges resources " 552448736 55.81
"free memory " 182546912 18.44
"miscellaneous " 88791648 8.97
"Checkpoint queue " 64656160 6.53
"ges enqueues " 27310904 2.76
Subpool 2: total 989855744, top 5 components:
Allocation Name Size Ratio(%)
___________________________ ____________ ________
"ges resources " 552475400 55.81
"free memory " 189370904 19.13
"Checkpoint queue " 64976240 6.56
"miscellaneous " 63793920 6.44
"FileOpenBlock " 29265560 2.96
Subpool 3: total 973078528, top 5 components: <<<<<<<
Allocation Name Size Ratio(%)
___________________________ ____________ ________
"ges resources " 552211440 56.75
"free memory " 183526328 18.86
"miscellaneous " 80548568 8.28
"Checkpoint queue " 64976240 6.68
"ges enqueues " 27266248 2.80
Shared Pool: total 2952790016, top 5 components:
Allocation Name Size Ratio(%)
___________________________ ____________ ________
"ges resources " 1657135576 56.12
"free memory " 555444144 18.81
"miscellaneous " 233134136 7.90
"Checkpoint queue " 194608640 6.59
"ges enqueues " 82139704 2.78
---BUG定位
Bug 3046725 - ORA-4031 due to shared_pool fragmented with high ges resources & enqueues ( Doc ID 3046725.8 )
Versions confirmed as being affected
9.2.0.2
9.2.0.7
10.1.0.4
This issue is fixed in
10.2.0.1 (Base Release)
Description
Depending on application work loads, some components in
a RAC environment can temporarily consume a lot of SGA
memory for "ges enqueues" and this can fragment the
shared pool sufficiently to cause ORA-4031 type errors.
Workaround:
Pre-reserve plenty of space for GES resources in
the shared pools to prevent the fragmentation effect
of a sudden jump in requirements.
This can be achieved by setting the hidden parameters
_lm_locks / _lm_ress .
该此故障发生是由人为执行的存储过程里对一表执行大批量的UPDATE操作,且未合理使用绑定变量给数据库带来严重负载压力,触使GES RESOURCE资源由45M暴长到1.7G,占用了85% SHARED POOL大小。致使用应该连接报ORA-04031错误,触发了BUG;
5. 解决方案5.1 临时方案
(1) 部署监控脚本,及时发现问题;
(2) 重启数据库释放内容、目前3、4区已在问题发生时进行了重启,2区未重启;需要安排进行进行重启;
(3) 修改存储过程,合理使用绑定变量
--原过程
declare
v_value varchar2(10);
begin
for i in (select a.*, rowid rid
from account_attr531 a
where mod(a.key_id,10)=0 and a.flag is null) loop
begin
select value
into v_value
from account_attr b
where b.key_id = i.key_id
and b.attr_id = 7;
if v_value = '0' then
update account_attr531 c set c.flag = 0 where c.rowid = i.rid;
else
update account_attr531 c set c.flag = 1 where c.rowid = i.rid;
end if;
exception
when others then
update account_attr531 c set c.flag = 3 where c.rowid = i.rid;
end;
end loop;
commit;
end;
--整改后:
declare
v_value varchar2(10);
v_num number(5);
begin
v_num:=0;
for i in (select a.*, rowid rid
from account_attr531 a
where mod(a.key_id,10)=0 and a.flag is null) loop
v_num:=v_num+1;
begin
select value
into v_value
from account_attr b
where b.key_id = i.key_id
and b.attr_id = 7;
if v_value = '0
' then
--update account_attr531 c set c.flag = 0 where c.rowid = i.rid;
execute immediate 'update account_attr531 c set c.flag = 0 where c.rowid =:r' using i.rid;
else
--update account_attr531 c set c.flag = 1 where c.rowid = i.rid;
execute immediate 'update account_attr531 c set c.flag = 1 where c.rowid =:r' using i.rid;
end if;
if v_num >=500 then
v_num:=0;
commit;
end if;
exception
when others then
--update account_attr531 c set c.flag = 3 where c.rowid = i.rid;
execute immediate 'update account_attr531 c set c.flag = 1 where c.rowid =:r' using i.rid;
end;
end loop;
commit;
end;
5.2 长期方案
(1)权限控制,禁止在生产系统上操作
(2)升级数据库到10G以上
来自 “ ITPUB博客 ” ,链接:/,如需转载,请注明出处,否则将追究法律责任。
转载于:/
更多推荐
ges resources暴长导致 ORA
发布评论