數(shù)據(jù)庫異常宕機(jī)分析報(bào)告

IT那活兒發(fā)布于2023-01-11 13:20 / 3595人閱讀

問題描述

數(shù)據(jù)庫自6月29日起連續(xù)多天異常宕機(jī)。alert日志報(bào)Instance Critical Process (pid: 21, ospid: 105559, DBW2) died unexpectedly之類的錯(cuò)誤。

系統(tǒng)message日志里有Jun 29 06:14:25 HOSTNAME kernel: Out of memory: Kill process 105559 (ora_dbw2_XXXX) score 186 or sacrifice child之類報(bào)錯(cuò)。

問題分析

1. 數(shù)據(jù)庫alert日志報(bào)錯(cuò)：

--6月29日

2021-06-29T06:14:27.599722+08:00

Instance Critical Process (pid: 21, ospid: 105559, DBW2) died unexpectedly

PMON (ospid: 105510): terminating the instance due to error 471

2021-06-29T06:14:27.929867+08:00

System state dump requested by (instance=1, osid=105510 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/********_diag_105538_20210629061427.trc

--6月30日

2021-06-30T07:00:57.368545+08:00

Instance Critical Process (pid: 19, ospid: 66571, DBW0) died unexpectedly

PMON (ospid: 66527): terminating the instance due to error 471

2021-06-30T07:00:57.544376+08:00

System state dump requested by (instance=1, osid=66527 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/********_diag_66555_20210630070057.trc

--7月1日

2021-07-01T09:07:04.699459+08:00

Instance Critical Process (pid: 20, ospid: 125148, DBW1) died unexpectedly

PMON (ospid: 125085): terminating the instance due to error 471

2021-07-01T09:07:05.005935+08:00

System state dump requested by (instance=1, osid=125085 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/********_diag_125126_20210701090705.trc

--7月2日

2021-07-02T07:36:36.660203+08:00

Instance Critical Process (pid: 19, ospid: 25474, DBW0) died unexpectedly

PMON (ospid: 25437): terminating the instance due to error 471

2021-07-02T07:36:37.001449+08:00

System state dump requested by (instance=1, osid=25437 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/********_diag_25457_20210702073637.trc

--7月3日

2021-07-03T07:52:35.531254+08:00

Instance Critical Process (pid: 7, ospid: 91834, MMAN) died unexpectedly

2021-07-03T07:52:55.666657+08:00

PMON (ospid: 91819): terminating the instance due to error 822

2021-07-03T07:52:55.679848+08:00

System state dump requested by (instance=1, osid=91819 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/********_diag_91842_20210703075255.trc

2. 系統(tǒng)message日志報(bào)錯(cuò)：

--6月29日

Jun 29 06:14:25 HOSTNAME kernel: Out of memory: Kill process 105559 (ora_dbw2_XXXX) score 186 or sacrifice child

Jun 29 06:14:25 HOSTNAME kernel: Killed process 105559 (ora_dbw2_XXXX) total-vm:13041272kB, anon-rss:17376kB, file-rss:80kB, shmem-rss:9180780kB

--6月30日

Jun 30 07:00:57 HOSTNAME kernel: Out of memory: Kill process 66571 (ora_dbw0_XXXX) score 130 or sacrifice child

Jun 30 07:00:57 HOSTNAME kernel: Killed process 66571 (ora_dbw0_XXXX) total-vm:13041260kB, anon-rss:18256kB, file-rss:0kB, shmem-rss:6426356kB

Jun 30 07:00:57 HOSTNAME kernel: ora_dbw0_XXXX: page allocation failure: order:0, mode:0x2015a

Jun 30 07:00:57 HOSTNAME kernel: CPU: 26 PID: 66571 Comm: ora_dbw0_XXXX Not tainted 3.10.0-693.el7.x86_64 #1

--7月1日

Jul 1 09:07:01 HOSTNAME kernel: Out of memory: Kill process 125152 (ora_dbw3_XXXX) score 140 or sacrifice child

Jul 1 09:07:01 HOSTNAME kernel: Killed process 125152 (ora_dbw3_XXXX) total-vm:13041280kB, anon-rss:17888kB, file-rss:568kB, shmem-rss:6901124kB

--7月2日

Jul 2 07:36:24 HOSTNAME kernel: Out of memory: Kill process 31865 (oracle_31865_ns) score 142 or sacrifice child

Jul 2 07:36:24 HOSTNAME kernel: Killed process 31865 (oracle_31865_ns) total-vm:13034256kB, anon-rss:8496kB, file-rss:512kB, shmem-rss:7017736kB

--7月3日

Jul 3 07:52:28 HOSTNAME kernel: Out of memory: Kill process 91834 (ora_mman_XXXX) score 32 or sacrifice child

Jul 3 07:52:28 HOSTNAME kernel: Killed process 91834 (ora_mman_XXXX) total-vm:13022640kB, anon-rss:2892kB, file-rss:80kB, shmem-rss:1615108kB

3. 系統(tǒng)內(nèi)存使用情況：

結(jié)合數(shù)據(jù)庫alert和系統(tǒng)message日志可以看到，由于系統(tǒng)內(nèi)存溢出，數(shù)據(jù)庫核心進(jìn)程dbwn被killed導(dǎo)致數(shù)據(jù)庫宕機(jī)。系統(tǒng)內(nèi)存使用情況可以看到16G的swap分區(qū)已被全部使用。

故障處理

1. 關(guān)閉asm實(shí)例及has（周末時(shí)間數(shù)據(jù)庫沒人用）

——經(jīng)查看，asm實(shí)例相關(guān)進(jìn)程占用了較多的swap分區(qū)

——關(guān)閉asm實(shí)例和has

srvctl stop asm

crsctl stop has

2. 清理swap交換分區(qū)

--清理內(nèi)存cache，確保系統(tǒng)空閑內(nèi)存大于swap已用內(nèi)存

sync

echo "3" > /proc/sys/vm/drop_caches

--關(guān)閉swap分區(qū)

swapoff -a

--殺掉占用較多swap的進(jìn)程（系統(tǒng)自動(dòng)釋放速度會(huì)很慢）

[root@HOSTNAME ~]# for i in `cd /proc;ls |grep "^[0-9]"|awk $0 >100` ;do awk /Swap:/{a=a+$2}END{print "$i",a/1024"M"} /proc/$i/smaps ;done 2>&1 |sort -k2nr |head -20

66285 101.973M

124601 6.6875M

88435 6.53906M

75924 6.46875M

88033 6.42188M

15747 6.41406M

71480 6.38672M

112856 6.32422M

32315 6.24609M

30195 6.24219M

118924 6.23047M

112052 6.12891M

43413 6.10156M

123682 6.0625M

62669 6.05859M

89471 5.99219M

38687 5.96094M

23452 5.95703M

30953 5.95703M

13602 5.95312M

3. 調(diào)整swappiness參數(shù)，恢復(fù)swap分區(qū)

——調(diào)整swappiness參數(shù)

cat /etc/sysctl.conf

vm.swappiness=0

sysctl -p

swapon -a

——啟動(dòng)has和asm實(shí)例

crsctl start has

srvctl start asm

4. 啟動(dòng)數(shù)據(jù)庫，并調(diào)整SGA和PGA（客戶要求）

——啟動(dòng)數(shù)據(jù)庫

——調(diào)整sga和pga

alter system set sga_max_size=8G scope=spfile;
alter system set sga_target=8G scope=spfile;
alter system set pga_aggregate_target=2G scope=spfile;

——重啟數(shù)據(jù)庫

總結(jié)

ASM實(shí)例相關(guān)進(jìn)程占用了較多的swap分區(qū)，周末關(guān)閉asm實(shí)例及has后發(fā)現(xiàn)有大量的/u01/app/oracle/tfa/HOSTNAME/tfa_home/perl/bin/perl /u01/app/oracle/tfa/HOSTNAME/tfa_home/bin/tfactl.pl rediscover -mode full -auto和sh -c rpm -qa --queryformat "(%{INSTALLTIME:date}):%{NAME}|%{VERSION}|%{RELEASE}|%{ARCH} " > rpms.out 2>&1占用著swap分區(qū)，大批量kill進(jìn)程后swap分區(qū)快速釋放（數(shù)據(jù)庫正常運(yùn)行時(shí)只能kill非關(guān)鍵進(jìn)程，關(guān)鍵進(jìn)程需等系統(tǒng)自動(dòng)釋放）。

END

更多精彩干貨分享

點(diǎn)擊下方名片關(guān)注

IT那活兒

云服務(wù)器 GPU云服務(wù)器異常數(shù)據(jù)分析異常分析 java異常信息分析數(shù)據(jù)庫宕機(jī)

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://specialneedsforspecialkids.com/yun/129816.html

發(fā)表評論

登陸后可評論

0條評論

IT那活兒

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

消息中間件故障分析一例

閱讀 1346·2023-01-11 13:20
RAC雙節(jié)點(diǎn)crash回復(fù)一例

閱讀 1684·2023-01-11 13:20
ORA-600處理一例

閱讀 1132·2023-01-11 13:20
雙節(jié)點(diǎn)RAC實(shí)例2 HANG 故障分析一例

閱讀 1858·2023-01-11 13:20
RAC集群節(jié)點(diǎn)1重啟分析一例

閱讀 4100·2023-01-11 13:20
CRS啟動(dòng)報(bào)錯(cuò)CRS-1656處理分享

閱讀 2704·2023-01-11 13:20
oracle 12CR2打補(bǔ)丁報(bào)錯(cuò)處理一例

閱讀 1385·2023-01-11 13:20
分布式緩存組件故障分析及監(jiān)控優(yōu)化

閱讀 3597·2023-01-11 13:20

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

數(shù)據(jù)庫異常宕機(jī)分析報(bào)告

1. 數(shù)據(jù)庫alert日志報(bào)錯(cuò)：

2. 系統(tǒng)message日志報(bào)錯(cuò)：

3. 系統(tǒng)內(nèi)存使用情況：

1. 關(guān)閉asm實(shí)例及has（周末時(shí)間數(shù)據(jù)庫沒人用）

2. 清理swap交換分區(qū)

3. 調(diào)整swappiness參數(shù)，恢復(fù)swap分區(qū)

4. 啟動(dòng)數(shù)據(jù)庫，并調(diào)整SGA和PGA（客戶要求）

相關(guān)文章

云計(jì)算節(jié)點(diǎn)故障自動(dòng)化運(yùn)維服務(wù)設(shè)計(jì)

談?wù)劄槭裁葱枰?wù)治理（Dubbo）

關(guān)于分布式系統(tǒng)的思考（二）

關(guān)于分布式系統(tǒng)的思考（二）

玩概念還是真好用？一文讀懂融合CDN

發(fā)表評論

0條評論

IT那活兒

男|高級(jí)講師

TA的文章

消息中間件故障分析一例

RAC雙節(jié)點(diǎn)crash回復(fù)一例

ORA-600處理一例

雙節(jié)點(diǎn)RAC實(shí)例2 HANG 故障分析一例

RAC集群節(jié)點(diǎn)1重啟分析一例

CRS啟動(dòng)報(bào)錯(cuò)CRS-1656處理分享

oracle 12CR2打補(bǔ)丁報(bào)錯(cuò)處理一例

分布式緩存組件故障分析及監(jiān)控優(yōu)化

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

數(shù)據(jù)庫異常宕機(jī)分析報(bào)告

1. 數(shù)據(jù)庫alert日志報(bào)錯(cuò)：

2. 系統(tǒng)message日志報(bào)錯(cuò)：

3. 系統(tǒng)內(nèi)存使用情況：

1. 關(guān)閉asm實(shí)例及has（周末時(shí)間數(shù)據(jù)庫沒人用）

2. 清理swap交換分區(qū)

3. 調(diào)整swappiness參數(shù)，恢復(fù)swap分區(qū)

4. 啟動(dòng)數(shù)據(jù)庫，并調(diào)整SGA和PGA（客戶要求）

相關(guān)文章

發(fā)表評論

0條評論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

3. 調(diào)整swappiness參數(shù)，恢復(fù)swap分區(qū)

4. 啟動(dòng)數(shù)據(jù)庫，并調(diào)整SGA和PGA（客戶要求）