12cRAC, 災演後, cluster 服務無法啟動

環境

Solaris 11 SPARC (LDOM)
Oracle 12.2.0.1 EE RAC 

異常  

OS 重開後, GI 無法啟動,  osysmond.bin 也沒有啟動

細節

EMRRAC1

root@EMRRAC1:~# /oracle/app/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crf' on 'emrrac1'
CRS-2672: Attempting to start 'ora.cssd' on 'emrrac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'emrrac1'
CRS-2676: Start of 'ora.diskmon' on 'emrrac1' succeeded
CRS-2676: Start of 'ora.crf' on 'emrrac1' succeeded
CRS-2674: Start of 'ora.cssd' on 'emrrac1' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'emrrac1'
CRS-2681: Clean of 'ora.cssd' on 'emrrac1' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'emrrac1'
CRS-2677: Stop of 'ora.crf' on 'emrrac1' succeeded
CRS-4000: Command Start failed, or completed with errors.

EMRRAC2

root@EMRRAC2:~# /oracle/app/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crf' on 'emrrac2'
CRS-2672: Attempting to start 'ora.cssd' on 'emrrac2'
CRS-2672: Attempting to start 'ora.diskmon' on 'emrrac2'
CRS-2676: Start of 'ora.diskmon' on 'emrrac2' succeeded
CRS-2676: Start of 'ora.crf' on 'emrrac2' succeeded
CRS-2674: Start of 'ora.cssd' on 'emrrac2' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'emrrac2'
CRS-2681: Clean of 'ora.cssd' on 'emrrac2' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'emrrac2'
CRS-2677: Stop of 'ora.crf' on 'emrrac2' succeeded
CRS-4000: Command Start failed, or completed with errors.

HISRAC2

root@HISRAC2:~# /oracle/app/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crf' on 'hisrac2'
CRS-2672: Attempting to start 'ora.cssd' on 'hisrac2'
CRS-2672: Attempting to start 'ora.diskmon' on 'hisrac2'
CRS-2676: Start of 'ora.diskmon' on 'hisrac2' succeeded
CRS-2676: Start of 'ora.crf' on 'hisrac2' succeeded
CRS-2674: Start of 'ora.cssd' on 'hisrac2' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'hisrac2'
CRS-2681: Clean of 'ora.cssd' on 'hisrac2' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'hisrac2'
CRS-2677: Stop of 'ora.crf' on 'hisrac2' succeeded
CRS-4000: Command Start failed, or completed with errors.

三台的 OS log 如下, (在 vdisk offline 後, 沒有緊接著 vdisk online 的訊息)

May  4 11:39:17 EMRRAC1 genunix: [ID 390243 kern.info] Creating /etc/devices/devid_cache
May  4 11:39:18 EMRRAC1 hwmgmtd[1051]: [ID 702911 daemon.notice] hwmgmtd version 2.4.2.2 r20727 started.
May  4 11:39:20 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@1 is offline
May  4 11:39:20 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@2 is offline
May  4 11:39:20 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@3 is offline
May  4 11:39:20 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@4 is offline
May  4 11:39:20 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@5 is offline
May  4 11:39:20 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@6 is offline
May  4 11:39:24 EMRRAC1 oracleoks: [ID 123267 kern.notice] NOTICE: OKSK-00028: In memory kernel log buffer address: 0x304004ee
90, size: 10485760

EMRRAC2
May  4 11:39:34 EMRRAC2 root: [ID 702911 user.error] Starting execution of Oracle Clusterware init.ohasd
May  4 11:39:36 EMRRAC2 hwmgmtd[856]: [ID 702911 daemon.notice] hwmgmtd version 2.4.2.2 r20727 started.
May  4 11:39:39 EMRRAC2 vdc: [ID 990228 kern.info] vdisk@1 is offline
May  4 11:39:39 EMRRAC2 vdc: [ID 990228 kern.info] vdisk@2 is offline
May  4 11:39:39 EMRRAC2 vdc: [ID 990228 kern.info] vdisk@3 is offline
May  4 11:39:39 EMRRAC2 vdc: [ID 990228 kern.info] vdisk@4 is offline
May  4 11:39:39 EMRRAC2 vdc: [ID 990228 kern.info] vdisk@5 is offline
May  4 11:39:39 EMRRAC2 vdc: [ID 990228 kern.info] vdisk@6 is offline
May  4 11:39:42 EMRRAC2 oracleoks: [ID 123267 kern.notice] NOTICE: OKSK-00028: In memory kernel log buffer address: 0x304005d3fd9f0, size:
10485760
May  4 11:39:42 EMRRAC2 oracleoks: [ID 863671 kern.notice] NOTICE: OKSK-00027: Oracle kernel distributed lock manager hash size is 31251

HISRAC2
May  4 11:52:42 HISRAC2 root: [ID 702911 user.error] Starting execution of Oracle Clusterware init.ohasd
May  4 11:52:47 HISRAC2 vdc: [ID 990228 kern.info] vdisk@1 is offline
May  4 11:52:47 HISRAC2 hwmgmtd[1117]: [ID 702911 daemon.notice] hwmgmtd version 2.4.2.2 r20727 started.


Workaround

EMRRAC1檢查 /dev/rdsk , 發現其中三顆 disk (voting disk)權限跑掉了

root@EMRRAC1# ls -l /devices/virtual-devices\@100/channel-devices\@200 |grep a,raw
crw-------   1 root     sys      279,  0 May  4 11:39 disk@0:a,raw
crw-rw----   1 grid     asmadmin 279,  8 Apr 27 18:09 disk@1:a,raw
crw-rw----   1 grid     asmadmin 279, 16 Apr 27 18:09 disk@2:a,raw
crw-rw----   1 grid     asmadmin 279, 24 Apr 27 18:09 disk@3:a,raw
crw-------   1 root     sys      279, 32 May  4 11:39 disk@4:a,raw
crw-------   1 root     sys      279, 40 May  4 11:39 disk@5:a,raw
crw-------   1 root     sys      279, 48 May  4 11:39 disk@6:a,raw

root@EMRRAC1# chown grid:asmadmin /dev/rdsk/c1d4* /dev/rdsk/c1d5* /dev/rdsk/c1d6*
root@EMRRAC1# chmod 0660 /dev/rdsk/c1d4* /dev/rdsk/c1d5* /dev/rdsk/c1d6*

root@EMRRAC1# ls -l /devices/virtual-devices\@100/channel-devices\@200 |grep a,raw
crw-------   1 root     sys      279,  0 May  4 11:39 disk@0:a,raw
crw-rw----   1 grid     asmadmin 279,  8 Apr 27 18:09 disk@1:a,raw
crw-rw----   1 grid     asmadmin 279, 16 Apr 27 18:09 disk@2:a,raw
crw-rw----   1 grid     asmadmin 279, 24 Apr 27 18:09 disk@3:a,raw
crw-rw----   1 grid     asmadmin 279, 32 May  4 13:30 disk@4:a,raw
crw-rw----   1 grid     asmadmin 279, 40 May  4 13:30 disk@5:a,raw
crw-rw----   1 grid     asmadmin 279, 48 May  4 13:30 disk@6:a,raw

root@EMRRAC1# /oracle/app/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crf' on 'emrrac1'
CRS-2672: Attempting to start 'ora.cssd' on 'emrrac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'emrrac1'
CRS-2676: Start of 'ora.diskmon' on 'emrrac1' succeeded
CRS-2676: Start of 'ora.crf' on 'emrrac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'emrrac1' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'emrrac1'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'emrrac1'
CRS-2676: Start of 'ora.ctssd' on 'emrrac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'emrrac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'emrrac1'
CRS-2676: Start of 'ora.asm' on 'emrrac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'emrrac1'
CRS-2676: Start of 'ora.storage' on 'emrrac1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'emrrac1'
CRS-2676: Start of 'ora.crsd' on 'emrrac1' succeeded

EMRRAC2檢查 /dev/rdsk , 這台更誇張, 全部的權限都跑掉了

root@EMRRAC2:/devices/virtual-devices@100/channel-devices@200#  ls -l *:a,raw
crw-------   1 root     sys      279,  0 May  4 11:39 disk@0:a,raw
crw-------   1 root     sys      279,  8 May  4 11:39 disk@1:a,raw
crw-------   1 root     sys      279, 16 May  4 11:39 disk@2:a,raw
crw-------   1 root     sys      279, 24 May  4 11:39 disk@3:a,raw
crw-------   1 root     sys      279, 32 May  4 11:39 disk@4:a,raw
crw-------   1 root     sys      279, 40 May  4 11:39 disk@5:a,raw
crw-------   1 root     sys      279, 48 May  4 11:39 disk@6:a,raw

HISRAC2檢查 /dev/rdsk , 這台只有一顆 shared disk

root@HISRAC2:~# ls -l /devices/virtual-devices\@100/channel-devices\@200 |grep a,raw
crw-------   1 root     sys      279,  0 May  4 11:52 disk@0:a,raw
crw-------   1 root     sys      279,  8 May  4 11:52 disk@1:a,raw

這三台修正完異後, 再以 reboot -- -r 重開, 沒有再發生權限跑掉的異常. ,

root@EMRRAC1:~# dmesg |grep vdisk
May  4 13:39:03 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@3 is online using ldc@9,0
May  4 13:39:04 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@4 is online using ldc@12,0
May  4 13:39:04 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@5 is online using ldc@13,0
May  4 13:39:04 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@6 is online using ldc@14,0
May  4 13:39:17 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@1 is offline
May  4 13:39:17 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@2 is offline
May  4 13:39:17 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@3 is offline
May  4 13:39:17 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@4 is offline
May  4 13:39:17 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@5 is offline
May  4 13:39:17 EMRRAC1 vdc: [ID 990228 kern.info] vdisk@6 is offline
May  4 13:39:21 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@1 is online using ldc@7,0
May  4 13:39:21 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@2 is online using ldc@8,0
May  4 13:39:21 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@3 is online using ldc@9,0
May  4 13:39:21 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@4 is online using ldc@12,0
May  4 13:39:21 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@5 is online using ldc@13,0
May  4 13:39:21 EMRRAC1 vdc: [ID 625787 kern.info] vdisk@6 is online using ldc@14,0

留言

這個網誌中的熱門文章

12c RAC, OS log 出現 WARNING: couldn't allocate FBT table for module oracleacfs

11g client 連上 12c server, 出現 ora-28040

新建的 12.2.0.1 資料庫 alert 出現 ORA-12012