一、環(huán)境:
操作系統(tǒng)為:AIX 5.3
數(shù)據(jù)庫版本:Oracle 10.2.0.5
二、第一現(xiàn)場:
1.主機(jī)能夠登入,系統(tǒng)負(fù)載很低
2.數(shù)據(jù)庫實例2登錄不進(jìn)去,實例1能正常提供服務(wù)
3.業(yè)務(wù)不受到影響
三、故障處理:
1、基本狀態(tài)查看
首先查看下主機(jī)情況:
$ uptime
06:06PM up 232 days, 19:31, 6 users, load average: 6.15, 6.71, 6.69
節(jié)點2已經(jīng)運行232天,平均負(fù)載6點多,相對比較輕,最近時刻的負(fù)載也沒有超過7,因此主機(jī)狀態(tài)基本正常。接下來查看是否有大量換頁出現(xiàn):
$ vmstat 3 10
System configuration: lcpu=16mem=79360MB
kthr memory page faults cpu
----- ----------------------------------- ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
6 04932557 13865294 0 0 0 0 0 0 246 273723 268060 17 1172 0
5 04932540 13865311 0 0 0 0 0 0 279 265072 263893 17 1172 0
5 04932503 13865349 0 0 0 0 0 0 311 265326 264567 17 1172 0
5 04932511 13865340 0 0 0 0 0 0 597 340525 264203 18 1270 0
8 04932523 13865328 0 0 0 0 0 0 327 265519 260997 17 1172 0
…….
我們同樣發(fā)現(xiàn),CPU內(nèi)存都相對空閑,沒有換頁發(fā)生。
$ ps -ef|grep ora|wc -l
367
數(shù)據(jù)庫進(jìn)程300多,對于16CPU,80G內(nèi)存的主機(jī)系統(tǒng)來說,一切似乎都很正常。
$ sqlplus -prelim / as sysdba
SQL*Plus: Release 10.2.0.5.0 -Production on Thu Jun 14 18:01:13 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
SQL>
SQL> oradebug hanganalyze 3
Hang Analysis in /oracle/admin/bxxx/udump/bxxx2_ora_1011948.trc
SQL> oradebug setmypid
Statement processed.
SQL> oradebug dump systemstate 266
Statement processed.
SQL> set time on
18:02:57 SQL> set timing on
18:03:01 SQL> oradebug dump systemstate 266
Statement processed.
18:04:13 SQL> oradebug hanganalyze5
Hang Analysis in /oracle/admin/bxxx/udump/bxxx2_ora_1011948.trc
18:04:24 SQL>
18:04:51 SQL> oradebug dump systemstate 266
Statement processed.
18:07:12 SQL> exit
Disconnected from ORACLE
由于其余數(shù)據(jù)庫實例正常,因此信息搜集我們僅限于本實例,以免影響到其他實例的正常運行。
先強(qiáng)制關(guān)閉:
$ sqlplus -prelim / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Productionon Thu Jun 14 18:10:11 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
SQL> shutdown abort
ORACLE instance shut down.
SQL> exit
Disconnected from ORACLE
再正常啟動:
$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Productionon Thu Jun 14 18:10:34 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 1610612736bytes
Fixed Size 2096736 bytes
Variable Size 1308623264 bytes
Database Buffers 285212672 bytes
Redo Buffers 14680064 bytes
Database mounted.
Database opened.
實例重啟后,所有應(yīng)用能正常連接到實例2。
接下來進(jìn)行原因的深入分析,以防同樣的故障再次出現(xiàn)。
首先我們來看hanganalyze的信息輸出:
*** 2012-06-14 18:01:25.716
==============
HANG ANALYSIS:
==============
Found 177 objects waitingfor
<1/1014/60449/0x6fbfedb8/925832/SGA:allocation forcing componen>
Found 33 objects waiting for
<1/1093/1/0x6fbe9868/295728/NoWait>
Open chains found:
Chain 1 :
<1/1093/1/0x6fbe9868/295728/NoWait>
-- <1/1014/60449/0x6fbfedb8/925832/SGA:allocation forcing componen>
-- <1/674/1309/0x6fc258d8/689108/librarycache load lock>
本次的故障原因從這個輸出就一目了然,Oracle進(jìn)行內(nèi)部內(nèi)存自動分配時,遲遲沒有結(jié)束,導(dǎo)致177個對象等待著新的內(nèi)存空間。
進(jìn)一步的從systemdump輸出:
…….
waiting for SGA: allocation forcing component growthwait_time=0, seconds since wait started=3
….
SO: 700000050f2a528, type: 50, owner:70000006d500910, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=70000006d45d6e0object=700000058c461b8, mode=S
savepoint=0x7a5dd06
row cache parent object: address=700000058c461b8cid=8(dc_objects)
hash=875b7c55 typ=11 transaction=0 flags=00008000
own=700000058c46288[700000050f2a558,700000050f2a558]
……
可以看出,由于Oracle在進(jìn)行內(nèi)存分配時,將一些對象從內(nèi)存中踢出去了,新的SQL解析時,需要裝載新的對象,然而沒有可用內(nèi)存空間,因此數(shù)據(jù)庫實例hang住。
根據(jù)以往經(jīng)驗,發(fā)生這種等待,通常是采用了Oracle的ASMM(自動共享內(nèi)存管理),即SGA自動管理引起。
從實例2的告警日志alert_bxxx2.log,我們發(fā)現(xiàn):
Thu Jun 14 18:10:38 BEIST 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en1 172.16.1.0configured from OCR for use as a cluster interconnect
Interface type 1 en8 10.153.246.128configured from OCR for use as a publicinterface
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameterdefault value as /oracle/product/10.2.0/db/dbs/arch
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 priorto event group initialization
Starting up ORACLE RDBMS Version:10.2.0.5.0.
System parameters with non-defaultvalues:
processes = 1000
…….
nls_territory = CHINA
sga_target = 1610612736
control_files =/oradata1/control01.ctl, /oradata2/control02.ctl, /oradata3/control03.ctl
db_block_size = 8192
__db_cache_size =285212672
compatible =10.2.0.5.0
db_files = 2000
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
thread = 2
…….
db_name = bxxx
open_cursors = 300
pga_aggregate_target =8311013376
確實采用了SGA自動管理(sga_target不為0)。
且總共80G的物理內(nèi)存,只分配給Oracle SGA 2G不到,而本機(jī)唯一的任務(wù)就是確保Oracle系統(tǒng)正常。
結(jié)合本次故障發(fā)生的原因及新炬在移動行業(yè)維護(hù)經(jīng)驗,我們給出如下三點配置修改建議:
1.將SGA管理改為手動。
2.數(shù)據(jù)庫會話在400個左右事,將SGA從2G增加到40G,其中:
shared_pool_size=1G
db_cache_size=35G
SGA_MAX_SIZE=40G
SGA_TARGET=0
3.關(guān)閉DRM特性:
_gc_affinity_time=0 # Only if DBversion is 10.1 or 10.2
_gc_undo_affinity=FALSE # Only if Db version is10.2
注:上述建議都需要停止數(shù)據(jù)庫,因此需安排計劃性停機(jī)。
同時,我們開發(fā)了數(shù)據(jù)庫實例hang住時的自動采集腳本,供數(shù)據(jù)庫發(fā)生hang住的情況下,快速搜集相關(guān)信息(存成文件放到Oracle用戶下的目錄即可):
#
#auto_hang_analyze.sh
#created by shsnc @20120614
#
#any question please sendemailto:master@shsnc.com
#
#!/usr/bin/ksh
. ~/.profile
#自動終止上次運行的進(jìn)程
for line in `ps -ef|grepauto_hang_analyze.sh|grep -v grep|awk {print $2}`
do
for line1 in `ps -ef|grep $line|grep sqlplus|awk {print $2}`
do
for line2 in `ps -ef|grep $line1|grepLOCAL=YES|awk {print $2}`
do
echo $line2
kill -9 $line2
done
done
done
count=`ps -ef |grep $0 |grep -v grep|wc -l`
if [[ count -gt 2 ]]; then
echo $0 already running!
exit
fi
echo `date`
# 判斷是否需要進(jìn)行自動執(zhí)行hang analyze
sqlplus -prelim / as sysdba <
set feedback off
set termout off;
ttitle off;
btitle off;
set heading off
set timing off;
set verify off;
set echo off;
spool session_event_cnt.out
select get_event_cnt from dual;
spool off;
exit
EOF
event_cnt=`grep -v SQLsession_event_cnt.out|awk {print $1}`
if [[ event_cnt -lt 30 ]]; then
echo no need hang analyze!
exit
fi
#進(jìn)行hang analyze
sqlplus -prelim / as sysdba<
set feedback off
set termout off;
ttitle off;
btitle off;
set heading off
set timing off;
set verify off;
set echo off;
spool hang_analyze.out
oradebug setmypid
oradebug unlimit;
oradebug hanganalyze 3;
oradebug dump systemstate 1;
spool off;
exit
EOF
#取hang analyze 文件名
hanganalyze_file=`grep -i -E HangAnalysis hang_analyze.out|awk {print $4}`
if test -z "$hanganalyze_file";then
exit
fi
echo $hanganalyze_file
#生成自動kill腳本
awk{if(index($0,"Found")>0) {printf"%s ",$0} else {print$0}} $hanganalyze_file|grep Found|awk {if($2>20) {{split($7,A,"/")}{ print "ps -ef|grep "A[5]"|grepLOCAL=NO|awk 47{print "
-9 "$2}47|xargs kill"}}}>kill_hang_process.sh
#執(zhí)行自動終止腳本并備份腳本
if [ `cat kill_hang_process.sh|wc -l`-gt 0 ]
then
sh kill_hang_process.sh
echo $hanganalyze_file>>kill_hang_process.sh
cp kill_hang_process.sh ./ak_log/kill_hang_process.sh.`date+%b_%d_%H_%M_%S`
fi
echo `date`
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://specialneedsforspecialkids.com/yun/130246.html
閱讀 1345·2023-01-11 13:20
閱讀 1683·2023-01-11 13:20
閱讀 1132·2023-01-11 13:20
閱讀 1858·2023-01-11 13:20
閱讀 4099·2023-01-11 13:20
閱讀 2704·2023-01-11 13:20
閱讀 1385·2023-01-11 13:20
閱讀 3594·2023-01-11 13:20