在做exadata的检查的时候,我们通常收集如下信息:
1,exachk
2,sundiag
3,diagcollect(GI版本从11.2.0.4.x开始, 可以使用TFA Collector)
4,awr
5,db节点和cell节点的alert
6,osw
根据上述检查内容是否存在异常可能还需要 CheckHWnFWProfile等等。。。。
本文主要分析如何识别磁盘损坏的内容。
++++++++++++++++++++++++++查看cell 的alert,检查是否有磁盘需要更换的信息:
检查cell的alert告警信息:
dcli -g cell_group -l root “cellcli -e list alerthistory”
查看关键内容:
grep "PREDICTIVE FAILURE" cell-alerthistory.txt grep "Logical drive status changed" cell-alerthistory.txt grep "NOT PRESENT" cell-alerthistory*.txt grep "POOR PERFORMANCE" cell-alerthistory*.txt grep "critical" cell-alerthistory.txt
例如:
[root@lunar tmp]# grep "NOT PRESENT" cell-alerthistory*.txt cell-alerthistory1.txt:dm01cel05: 44_1 2013-08-13T16:13:07+08:00 critical "Hard disk was removed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLMTLN Firmware : A700 Slot Number : 0 Cell Disk : CD_00_dm01cel05 Grid Disk : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05" cell-alerthistory1.txt:dm01cel05: 47_1 2013-09-24T00:16:46+08:00 critical "Hard disk was removed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLMTLN Firmware : A700 Slot Number : 0 Cell Disk : CD_00_dm01cel05 Grid Disk : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05" cell-alerthistory1.txt:dm01cel05: 48_1 2013-09-25T04:45:31+08:00 critical "Hard disk was removed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLMTLN Firmware : A700 Slot Number : 0 Cell Disk : CD_00_dm01cel05 Grid Disk : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05" cell-alerthistory1.txt:dm01cel05: 56_1 2013-10-25T16:29:21+08:00 critical "Hard disk was removed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLMTLN Firmware : A700 Slot Number : 0 Cell Disk : CD_00_dm01cel05 Grid Disk : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05" cell-alerthistory1.txt:dm01cel05: 57_1 2013-10-28T01:26:58+08:00 critical "Hard disk was removed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLMTLN Firmware : A700 Slot Number : 0 Cell Disk : CD_00_dm01cel05 Grid Disk : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05" cell-alerthistory1.txt:dm01cel05: 59_1 2013-10-30T10:24:27+08:00 critical "Hard disk was removed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLMTLN Firmware : A700 Slot Number : 0 Cell Disk : CD_00_dm01cel05 Grid Disk : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05" [root@lunar tmp]#
+++++++++++++++++++++++++++看sundiag的信息:
收集sundiag信息后,你会发现,每个db节点和cell节点的文件非常多,包括RAID,HCA, Infiniband,。。。等等
例如:
[root@lunar sundiag_2013_11_21_13_33]# ls alert.log dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out CmdTool.log dm01cel05_megacli64-BbuCmd_2013_11_21_13_33.out dm01cel05_alerthistory_2013_11_21_13_33.out dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out dm01cel05_aurasmart_2013_11_21_13_33.out dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out dm01cel05_cell-detail_2013_11_21_13_33.out dm01cel05_megacli64-GetEvents-all_2013_11_21_13_33.out dm01cel05_celldisk-detail_2013_11_21_13_33.out dm01cel05_megacli64-LdInfo_2013_11_21_13_33.out dm01cel05_disk_devices_2013_11_21_13_33.out dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out dm01cel05_dmesg_2013_11_21_13_33.out dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out dm01cel05_fdisk-l_2013_11_21_13_33.out dm01cel05_megacli64-PdList_short_2013_11_21_13_33.out dm01cel05_fdom-l_2013_11_21_13_33.out dm01cel05_megacli64-status_2013_11_21_13_33.out dm01cel05_flashcache-detail_2013_11_21_13_33.out dm01cel05_physicaldisk-detail_2013_11_21_13_33.out dm01cel05_griddisk-detail_2013_11_21_13_33.out dm01cel05_physicaldisk-fail_2013_11_21_13_33.out dm01cel05_imageinfo-all_2013_11_21_13_33.out dm01cel05_scripts-aura_2013_11_21_13_33.out dm01cel05_lspci_2013_11_21_13_33.out dm01cel05_sel-list_2013_11_21_13_33.out dm01cel05_lspci-xxxx_2013_11_21_13_33.out MegaSAS.log dm01cel05_lsscsi_2013_11_21_13_33.out messages dm01cel05_lun-detail_2013_11_21_13_33.out ms-odl.trc [root@lunar sundiag_2013_11_21_13_33]#
针对磁盘损坏信息,主要检查如下内容:
grep "Failed Disks" * grep "Predictive Failure Count" * grep "warning" * grep "Error Count" *|grep -v "Error Count: 0" grep "I/O error" * grep "Unconfigured" *
—————–检查坏盘:
[root@lunar sundiag_2013_11_21_13_33]# grep "Failed Disks" * dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out: Failed Disks : 0 dm01cel05_megacli64-status_2013_11_21_13_33.out:Failed Disks : 0 MegaSAS.log: Failed Disks : 0 MegaSAS.log: Failed Disks : 0 MegaSAS.log: Failed Disks : 0 MegaSAS.log: Failed Disks : 0 [root@lunar sundiag_2013_11_21_13_33]#
———————检查报告了“先兆失效”的盘:
[root@lunar sundiag_2013_11_21_13_33]# grep "Predictive Failure Count" * dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out:Predictive Failure Count: 0 dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out:Predictive Failure Count: 0 ....................... dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Predictive Failure Count: 0 dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Predictive Failure Count: 0 dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Predictive Failure Count: 0 dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0 dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0 ....................... dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0 dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0 MegaSAS.log:Predictive Failure Count: 0 MegaSAS.log:Predictive Failure Count: 0 ....................... MegaSAS.log:Predictive Failure Count: 0 MegaSAS.log:Predictive Failure Count: 0 [root@lunar sundiag_2013_11_21_13_33]
———-检查告警的磁盘信息:
[root@lunar sundiag_2013_11_21_13_33]# grep "warning" * dm01cel05_alerthistory_2013_11_21_13_33.out: 61_1 2013-11-08T02:44:22+08:00 warning "Hard disk entered confinement offline status. The LUN 0_8 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status : WARNING - CONFINEDOFFLINE Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1216KLN0HN Firmware : A700 Slot Number : 8 Cell Disk : CD_08_dm01cel05 Grid Disk : RECO_DM01_CD_08_dm01cel05, DBFS_DG_CD_08_dm01cel05, DATA_DM01_CD_08_dm01cel05 Reason for confinement : threshold for service time exceeded" dm01cel05_cell-detail_2013_11_21_13_33.out: notificationPolicy: critical,warning,clear dm01cel05_dmesg_2013_11_21_13_33.out:warning: `ntpdate' uses 32-bit capabilities (legacy support in use) messages:Jun 22 16:58:56 dm01cel05 kernel: warning: `dbus-daemon' uses 32-bit capabilities (legacy support in use) messages:Jun 22 17:02:03 dm01cel05 setfiles: labeling /usr/share/snmp/mib2c-data/m2c-internal-warning.m2i to system_u:object_r:usr_t:s0 messages:Jun 22 17:02:05 dm01cel05 setfiles: labeling /usr/share/man/man3/warnings.3pm.gz to system_u:object_r:man_t:s0 messages:Jun 22 17:02:05 dm01cel05 setfiles: labeling /usr/share/man/man3/warnings::register.3pm.gz to system_u:object_r:man_t:s0 messages:Jun 22 17:02:17 dm01cel05 setfiles: relabeling /usr/share/swig/1.3.29/swigwarnings.swg from root:object_r:file_t:s0 to system_u:object_r:usr_t:s0 messages:Jun 22 17:02:19 dm01cel05 setfiles: relabeling /usr/lib/perl5/5.8.8/warnings from root:object_r:file_t:s0 to system_u:object_r:lib_t:s0 messages:Jun 22 17:02:19 dm01cel05 setfiles: labeling /usr/lib/perl5/5.8.8/warnings/register.pm to system_u:object_r:lib_t:s0 messages:Jun 22 17:02:19 dm01cel05 setfiles: labeling /usr/lib/perl5/5.8.8/warnings.pm to system_u:object_r:lib_t:s0 messages:Jun 22 17:02:21 dm01cel05 setfiles: labeling /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/warnings.h to system_u:object_r:lib_t:s0 messages:Jun 22 17:02:22 dm01cel05 setfiles: labeling /usr/lib64/python2.4/warnings.pyo to system_u:object_r:lib_t:s0 messages:Jun 22 17:02:22 dm01cel05 setfiles: labeling /usr/lib64/python2.4/warnings.pyc to system_u:object_r:lib_t:s0 messages:Jun 22 17:02:23 dm01cel05 setfiles: labeling /usr/lib64/python2.4/warnings.py to system_u:object_r:lib_t:s0 messages:Jun 22 17:13:44 dm01cel05 kernel: warning: `ntpdate' uses 32-bit capabilities (legacy support in use) ....................... messages:Nov 1 19:57:36 dm01cel05 kernel: warning: `ntpdate' uses 32-bit capabilities (legacy support in use) ms-odl.trc:[2013-11-08T02:44:22.980+08:00] [ossmgmt] [NOTIFICATION] [] [ms.core.MSAlertHistory] [tid: 17] [ecid: 10.48.27.14:63987:1383849862863:10,0] AlertHistory 61_1 created. Severity: warning. Message: Hard disk entered confinement offline status. The LUN 0_8 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped.<exadata:br/>Status : WARNING - CONFINEDOFFLINE <exadata:br/>Manufacturer : HITACHI <exadata:br/>Model Number : HUS1560SCSUN600G <exadata:br/>Size : 600G <exadata:br/>Serial Number : 1216KLN0HN <exadata:br/>Firmware : A700 <exadata:br/>Slot Number : 8 <exadata:br/>Cell Disk : CD_08_dm01cel05 <exadata:br/>Grid Disk : RECO_DM01_CD_08_dm01cel05, DBFS_DG_CD_08_dm01cel05, DATA_DM01_CD_08_dm01cel05 <exadata:br/>Reason for confinement : threshold for service time exceeded ms-odl.trc:[2013-11-08T02:48:03.652+08:00] [ossmgmt] [NOTIFICATION] [] [ms.hwadapter.diskadp.MSLUNImpl] [tid: 19] [ecid: 10.48.27.14:63987:1383849872786:12,0] Called reenableLun for: 0 LUN: 0_8 lunOSName: /dev/sdi phys: 20:8 slotNumber: 8 lunStatus: warning reenableForce: true ms-odl.trc:[2013-11-08T02:48:03.859+08:00] [ossmgmt] [NOTIFICATION] [] [ms.hwadapter.diskadp.MSLUNImpl] [tid: 19] [ecid: 10.48.27.14:63987:1383849872786:12,0] LUN 0_8 is in state warning and is not a system disk. No further action is required at this time.[[ [root@lunar sundiag_2013_11_21_13_33]# [root@lunar sundiag_2013_11_21_13_33]# grep "Error Count" *|grep -v "Error Count: 0" dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out: Error Counters dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out:Media Error Count: 51 dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Media Error Count: 51 dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Media Error Count: 51 MegaSAS.log: Error Counters MegaSAS.log:Media Error Count: 51 MegaSAS.log:Media Error Count: 51 MegaSAS.log:Media Error Count: 51 MegaSAS.log:Media Error Count: 51 MegaSAS.log: Error Counters MegaSAS.log:Media Error Count: 51 MegaSAS.log: Error Counters MegaSAS.log: Error Counters [root@lunar sundiag_2013_11_21_13_33]# [root@lunar sundiag_2013_11_21_13_33]# grep "I/O error" * alert.log:Redo log write error 201 (Generic I/O error) on griddisk DATA_DM01_CD_00_dm01cel05: use of Flash Log for this device has been disabled alert.log:Redo log write error 201 (Generic I/O error) on griddisk DATA_DM01_CD_00_dm01cel05: use of Flash Log for this device has been disabled dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 15595520 dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 15595520 dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 15595520 dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 349304832 messages:Jun 29 21:44:18 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66141200 messages:Jun 29 21:44:18 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66139152 messages:Jun 29 21:44:18 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66137104 messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66143248 messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 1130092206 messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 1130092206 messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 1338480 messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 134319216 [root@lunar sundiag_2013_11_21_13_33]# grep "Unconfigured" * dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out:PR Correct Unconfigured Areas : Yes dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5: prCorrectUnconfiguredAreas=1, useFdeOnly=1 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:09/29/13 23:40:27: prCorrectUnconfiguredAreas=1, useFdeOnly=1 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:09/29/13 23:40:27: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5: prCorrectUnconfiguredAreas=1, useFdeOnly=1 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:10/11/13 22:58:47: prCorrectUnconfiguredAreas=1, useFdeOnly=1 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:10/11/13 22:58:47: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5: prCorrectUnconfiguredAreas=1, useFdeOnly=1 dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 MegaSAS.log:PR Correct Unconfigured Areas : Yes MegaSAS.log:T5: prCorrectUnconfiguredAreas=1, useFdeOnly=1 MegaSAS.log:T5: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 MegaSAS.log:09/29/13 23:40:27: prCorrectUnconfiguredAreas=1, useFdeOnly=1 MegaSAS.log:09/29/13 23:40:27: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 MegaSAS.log:T5: prCorrectUnconfiguredAreas=1, useFdeOnly=1 MegaSAS.log:T5: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 MegaSAS.log:10/11/13 22:58:47: prCorrectUnconfiguredAreas=1, useFdeOnly=1 MegaSAS.log:10/11/13 22:58:47: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 MegaSAS.log:T5: prCorrectUnconfiguredAreas=1, useFdeOnly=1 MegaSAS.log:T5: enableSpinDownUnconfigured=1, disableSpinDownHS=0, spinDownTime=1e, autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 MegaSAS.log:PR Correct Unconfigured Areas : Yes MegaSAS.log:PR Correct Unconfigured Areas : Yes MegaSAS.log:PR Correct Unconfigured Areas : Yes [root@lunar sundiag_2013_11_21_13_33]#
使用cellcli查看磁盘的错误信息:
cellcli -e list physicaldisk detail|grep err list physicaldisk where disktype=harddisk and status like ".*predictive failure.*" list physicaldisk where DISKtYPE=hARDdISK and STATUS=CRITICAL detail
检查ASM的日志是否有类似如下的告警:
1. WARNING: failed to read mirror
2. IO Error