客戶現場兩節點庫crash告警。運維人員緊急將數據庫拉起,應用恢復。但啟動后alert log 報錯ORA-16191和ORA-01031,為DataGuard主備庫密碼文件不一致所致, 重建密碼文件后, 故障解決。
分析alert log發現:16:32,節點1讀取控制文件發現壞塊,緊接著16:33分實例無法正常讀取控制文件導致crash,然后實例2在16:35關閉。經檢查控制文件并未存在壞塊,初步判定為數據庫短暫讀取控制文件失敗導致BUG。
發起SR,經SSC人員及SR后臺專家共同確認為bug 11698676,該bug與bug 9549042為重復bug,并在patch 9549042上被fixed。
2. 故障分析/處理
2.1 故障處理
4月5日16:34, ssyy庫兩節點相繼crash, 緊急接入后確認兩實例已被徹底關閉、監聽仍然開啟,緊急startup將兩實例拉起,應用恢復連接至生產庫。
重啟實例后,檢查節點1 alert log 發現:
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
提示為SYS主備庫上密碼文件不一致導致, 于是決定主庫重建密碼文件,并將新生成的密碼文件拷至備庫節點應用(操作前備份原密碼文件,并更改主庫SYS密碼).
分別在primary-rac兩個節點上執行密碼文件創建語句.
orapwd file=/oracle/db/oracle/product/11.1.0/db/dbs/ssyydb1 entries=5 force=y password=*********
orapwd file=/oracle/db/oracle/product/11.1.0/db/dbs/ssyydb2 entries=5 force=y password=*********
分別將ssyydb1和ssyydb2依次拷至standby-rac節點1和節點2.
primary-rac1節點alert log 仍持續報錯:
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_arc2_4134.trc:
ORA-01031: insufficient privileges
PING[ARC2]: Heartbeat failed to connect to standby drdb. Error is 1031.
此時,主庫節點1無法向備庫節點1傳送archive log. 查詢MOS,ORA-01031仍為主備庫密碼文件不一致導致,懷疑主庫歸檔進程使用了主機緩存密碼文件導致,因歸檔進程為非關鍵進程,kill -9 后會重新啟動,對當前數據庫無影響。
依次kill主庫節點1和節點2所有歸檔進程,節點1仍持續報錯ORA-01031。
sqlplus連接確認主備庫上SYS密碼已更改.
檢查新生成的密碼文件是否已被應用:
--主庫節點
SQL> select * from v$pwfile_users;
USERNAME SYSDB SYSOP SYSAS
------------------------------ ----- ----- -----
SYS TRUE TRUE FALSE
--備庫節點
SQL> select * from v$pwfile_users;
no rows selected
顯然,主庫密碼文件已被應用,備庫密碼文件未被應用。
仔細檢查備庫密碼文件, 文件名未滿足orapw<$ORACLE_SID>命名規則, 密碼文件沿 用主庫密碼文件,但備庫實例名區別于主庫實例名。
修改備庫密碼文件名:
mv $ORACLE_HOME/dbs/ssyydb1 $ORACLE_HOME/dbs/orapwdrdb1
mv $ORACLE_HOME/dbs/ssyydb2 $ORACLE_HOME/dbs/orapwdrdb2
持續觀察幾分鐘,ORA-01031錯誤未解決.
查詢MOS,參照ORA-1031 for Remote Archive Destination on Primary (Doc ID 733793.1)解決方案操作.
1. Make sure parameter REMOTE_LOGIN_PASSWORDFILE is set to EXCLUSIVE or SHARED in both databases.
2. Copy the password file again from primary :
a. Defer the log_archive_dest_2 on primary:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2 = DEFER;
b. Copy/ftp the password file from primary to standby and rename it accordingly on the standby database. Creating the password file on standby with orapwd-utility is not supported for 11g anymore.
Make sure that name of password file on both primary and standby is : orapw
c. Enable the log_archive_dest_2 on primary:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2 = ENABLE;
d. Switch 2-3 log files on primary :
SQL> ALTER SYSTEM SWITCH LOGFILE;
e. Check the status of log_archive_dest_2 on primary.
SQL> SELECT STATUS,ERROR FROM V$ARCHIVE_DEST WHERE DEST_ID =2;
STATUS ERROR
--------- -----------------------------------------------------------------
VALID
持續跟蹤主庫節點alert log ,在持續ORA-01031報錯3-5分鐘后, 主庫節點均能正常向備庫節點傳送archive log,備庫實例也能正常應用archive log, 主庫節點1和節點2 alert log 也未曾重現ORA-01031和ORA-16191.
至此,故障全部解決!
2.2 crash分析
首先,檢查兩節點syslog,無異常,排除主機因素。
實例1 alert log:
Fri Apr 05 15:58:52 2013
Archived Log entry 34220 added for thread 1 sequence 12072 ID 0x9441c6d1 dest 1:
Fri Apr 05 16:32:39 2013
Read from controlfile member /dev/oravg/rlv_cntl1 has found a corrupted block (blk# 4, cf seq# 0)
Hex dump of (file 0, block 4) in trace file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc
Corrupt block relative dba: 0x00000004 (file 0, block 4)
Bad check value found during control file block read
Data in bad block:
type: 21 format: 2 rdba: 0x00000004
last change scn: 0x0000.00000000 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x00001501
check value in block header: 0x8f5d
computed block checksum: 0x2
Re-read from controlfile member /dev/oravg/rlv_cntl1 returned valid block 4
Hex dump of (file 0, block 4) in trace file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc:
ORA-00202: control file: /dev/oravg/rlv_cntl1
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc (incident=888259):
ORA-00227: corrupt block detected in control file: (block 4, # blocks 1)
ORA-00202: control file: /dev/oravg/rlv_cntl1
Incident details in: /oracle/db/diag/rdbms/ssyy/ssyy1/incident/incdir_888259/ssyy1_lmon_22418_i888259.trc
Fri Apr 05 16:33:24 2013
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc:
ORA-00227: corrupt block detected in control file: (block 4, # blocks 1)
ORA-00202: control file: /dev/oravg/rlv_cntl1
LMON (ospid: 22418): terminating the instance due to error 227
16:32:39,實例1在讀控制文件/dev/oravg/rlv_cntl1的時候出錯,發現壞塊。
16:33:24,實例1因無法正常讀取控制文件導致實例crash。
檢查三個控制文件,未發現壞塊。
ssyy1: dbv file=/dev/datavg02/rlv_cntl1 blocksize=16384
ssyy1: dbv file=/dev/datavg02/rlv_cntl2 blocksize=16384
ssyy1: dbv file=/dev/datavg02/rlv_cntl3 blocksize=16384
查看節點2 crsd.log: 16:35:23由于數據庫異常offline,CRS停掉實例2.
2013-04-05 16:32:42.179: [ CRSRES][6345673] Resource recovery not purged:ora.ssyy.ssyy2.inst
2013-04-05 16:32:42.205: [ CRSRES][6345673] ora.ssyy.ssyy2.inst target set to OFFLINE before stop action
2013-04-05 16:32:42.206: [ CRSRES][6345673] StopResource: setting CLI values
2013-04-05 16:32:42.252: [ CRSRES][6345673] Attempting to stop `ora.ssyy.ssyy2.inst` on member `ssyy2`
2013-04-05 16:33:40.826: [ CRSD][54] SM: rE2Ec: 4
2013-04-05 16:33:40.896: [ CRSRES][6345681] ora.ssyy.db target set to OFFLINE before stop action
2013-04-05 16:33:40.896: [ CRSRES][6345681] StopResource: setting CLI values
2013-04-05 16:33:42.288: [ CRSD][6345681] SM:dE2Ec: all E2E cmds done. 0
2013-04-05 16:35:23.123: [ CRSRES][6345695] Resource recovery not purged:ora.ssyy.db
2013-04-05 16:35:23.124: [ CRSRES][6345695] `ora.ssyy.db` is already OFFLINE.
2013-04-05 16:35:23.173: [ CRSRES][6345673] Stop of `ora.ssyy.ssyy2.inst` on member `ssyy2` succeeded.
初步懷疑為bug導致, 發起SR,經SSC人員及SR后臺專家共同確認,命中bug 11698676。
該bug與bug 9549042為重復bug, 在當前HP-UX Itanium 64 bit 平臺下,有現成patch 9549042。
2.3 解決方案
官方建議,盡快打patch 9549042, 以規避此crash故障再現。
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/130244.html
RAC補丁日常更新成功反遇異常處理 img{ display:block; margin:0 auto !important; width:100%; } body{ width:75%; m...
閱讀 1345·2023-01-11 13:20
閱讀 1684·2023-01-11 13:20
閱讀 1132·2023-01-11 13:20
閱讀 1858·2023-01-11 13:20
閱讀 4099·2023-01-11 13:20
閱讀 2704·2023-01-11 13:20
閱讀 1385·2023-01-11 13:20
閱讀 3594·2023-01-11 13:20