exadata HC-检查是否有硬盘需要更换

联系:QQ(5163721)

标题:exadata HC-检查是否有硬盘需要更换

作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

在做exadata的检查的时候,我们通常收集如下信息:
1,exachk
2,sundiag
3,diagcollect(GI版本从11.2.0.4.x开始, 可以使用TFA Collector)
4,awr
5,db节点和cell节点的alert
6,osw
根据上述检查内容是否存在异常可能还需要 CheckHWnFWProfile等等。。。。

本文主要分析如何识别磁盘损坏的内容。

++++++++++++++++++++++++++查看cell 的alert,检查是否有磁盘需要更换的信息:
检查cell的alert告警信息:
dcli -g cell_group -l root “cellcli -e list alerthistory”

查看关键内容:

grep "PREDICTIVE FAILURE" cell-alerthistory.txt
grep "Logical drive status changed" cell-alerthistory.txt
grep "NOT PRESENT" cell-alerthistory*.txt
grep "POOR PERFORMANCE" cell-alerthistory*.txt
grep "critical" cell-alerthistory.txt

例如:

[root@lunar tmp]# grep "NOT PRESENT" cell-alerthistory*.txt
cell-alerthistory1.txt:dm01cel05: 44_1  2013-08-13T16:13:07+08:00       critical        "Hard disk was removed.  Status        : NOT PRESENT  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : 1216KLMTLN  Firmware      : A700  Slot Number   : 0  Cell Disk     : CD_00_dm01cel05  Grid Disk     : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05"
cell-alerthistory1.txt:dm01cel05: 47_1  2013-09-24T00:16:46+08:00       critical        "Hard disk was removed.  Status        : NOT PRESENT  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : 1216KLMTLN  Firmware      : A700  Slot Number   : 0  Cell Disk     : CD_00_dm01cel05  Grid Disk     : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05"
cell-alerthistory1.txt:dm01cel05: 48_1  2013-09-25T04:45:31+08:00       critical        "Hard disk was removed.  Status        : NOT PRESENT  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : 1216KLMTLN  Firmware      : A700  Slot Number   : 0  Cell Disk     : CD_00_dm01cel05  Grid Disk     : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05"
cell-alerthistory1.txt:dm01cel05: 56_1  2013-10-25T16:29:21+08:00       critical        "Hard disk was removed.  Status        : NOT PRESENT  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : 1216KLMTLN  Firmware      : A700  Slot Number   : 0  Cell Disk     : CD_00_dm01cel05  Grid Disk     : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05"
cell-alerthistory1.txt:dm01cel05: 57_1  2013-10-28T01:26:58+08:00       critical        "Hard disk was removed.  Status        : NOT PRESENT  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : 1216KLMTLN  Firmware      : A700  Slot Number   : 0  Cell Disk     : CD_00_dm01cel05  Grid Disk     : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05"
cell-alerthistory1.txt:dm01cel05: 59_1  2013-10-30T10:24:27+08:00       critical        "Hard disk was removed.  Status        : NOT PRESENT  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : 1216KLMTLN  Firmware      : A700  Slot Number   : 0  Cell Disk     : CD_00_dm01cel05  Grid Disk     : RECO_DM01_CD_00_dm01cel05, DATA_DM01_CD_00_dm01cel05"
[root@lunar tmp]# 

+++++++++++++++++++++++++++看sundiag的信息:
收集sundiag信息后,你会发现,每个db节点和cell节点的文件非常多,包括RAID,HCA, Infiniband,。。。等等
例如:

[root@lunar sundiag_2013_11_21_13_33]# ls
alert.log                                         dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out
CmdTool.log                                       dm01cel05_megacli64-BbuCmd_2013_11_21_13_33.out
dm01cel05_alerthistory_2013_11_21_13_33.out       dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out
dm01cel05_aurasmart_2013_11_21_13_33.out          dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out
dm01cel05_cell-detail_2013_11_21_13_33.out        dm01cel05_megacli64-GetEvents-all_2013_11_21_13_33.out
dm01cel05_celldisk-detail_2013_11_21_13_33.out    dm01cel05_megacli64-LdInfo_2013_11_21_13_33.out
dm01cel05_disk_devices_2013_11_21_13_33.out       dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out
dm01cel05_dmesg_2013_11_21_13_33.out              dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out
dm01cel05_fdisk-l_2013_11_21_13_33.out            dm01cel05_megacli64-PdList_short_2013_11_21_13_33.out
dm01cel05_fdom-l_2013_11_21_13_33.out             dm01cel05_megacli64-status_2013_11_21_13_33.out
dm01cel05_flashcache-detail_2013_11_21_13_33.out  dm01cel05_physicaldisk-detail_2013_11_21_13_33.out
dm01cel05_griddisk-detail_2013_11_21_13_33.out    dm01cel05_physicaldisk-fail_2013_11_21_13_33.out
dm01cel05_imageinfo-all_2013_11_21_13_33.out      dm01cel05_scripts-aura_2013_11_21_13_33.out
dm01cel05_lspci_2013_11_21_13_33.out              dm01cel05_sel-list_2013_11_21_13_33.out
dm01cel05_lspci-xxxx_2013_11_21_13_33.out         MegaSAS.log
dm01cel05_lsscsi_2013_11_21_13_33.out             messages
dm01cel05_lun-detail_2013_11_21_13_33.out         ms-odl.trc
[root@lunar sundiag_2013_11_21_13_33]# 

针对磁盘损坏信息,主要检查如下内容:

grep "Failed Disks" * 
grep "Predictive Failure Count" * 
grep "warning" * 
grep "Error Count" *|grep -v "Error Count: 0"
grep "I/O error" * 
grep "Unconfigured" * 

—————–检查坏盘:

[root@lunar sundiag_2013_11_21_13_33]# grep "Failed Disks" *
dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out:  Failed Disks    : 0 
dm01cel05_megacli64-status_2013_11_21_13_33.out:Failed Disks : 0
MegaSAS.log:  Failed Disks    : 0 
MegaSAS.log:  Failed Disks    : 0 
MegaSAS.log:  Failed Disks    : 0 
MegaSAS.log:  Failed Disks    : 0 
[root@lunar sundiag_2013_11_21_13_33]# 

———————检查报告了“先兆失效”的盘:

[root@lunar sundiag_2013_11_21_13_33]# grep "Predictive Failure Count" * 
dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out:Predictive Failure Count: 0

dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out:Predictive Failure Count: 0
.......................
dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Predictive Failure Count: 0
dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Predictive Failure Count: 0
dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Predictive Failure Count: 0
dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0
dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0
.......................

dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0
dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Predictive Failure Count: 0
MegaSAS.log:Predictive Failure Count: 0
MegaSAS.log:Predictive Failure Count: 0
.......................
MegaSAS.log:Predictive Failure Count: 0
MegaSAS.log:Predictive Failure Count: 0
[root@lunar sundiag_2013_11_21_13_33]

———-检查告警的磁盘信息:

[root@lunar sundiag_2013_11_21_13_33]# grep "warning" * 
dm01cel05_alerthistory_2013_11_21_13_33.out:     61_1    2013-11-08T02:44:22+08:00       warning         "Hard disk entered confinement offline status. The LUN 0_8 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status                      : WARNING - CONFINEDOFFLINE  Manufacturer                : HITACHI  Model Number                : HUS1560SCSUN600G  Size                        : 600G  Serial Number               : 1216KLN0HN  Firmware                    : A700  Slot Number                 : 8  Cell Disk                   : CD_08_dm01cel05  Grid Disk                   : RECO_DM01_CD_08_dm01cel05, DBFS_DG_CD_08_dm01cel05, DATA_DM01_CD_08_dm01cel05  Reason for confinement      : threshold for service time exceeded"
dm01cel05_cell-detail_2013_11_21_13_33.out:      notificationPolicy:     critical,warning,clear
dm01cel05_dmesg_2013_11_21_13_33.out:warning: `ntpdate' uses 32-bit capabilities (legacy support in use)
messages:Jun 22 16:58:56 dm01cel05 kernel: warning: `dbus-daemon' uses 32-bit capabilities (legacy support in use)
messages:Jun 22 17:02:03 dm01cel05 setfiles: labeling /usr/share/snmp/mib2c-data/m2c-internal-warning.m2i to system_u:object_r:usr_t:s0 
messages:Jun 22 17:02:05 dm01cel05 setfiles: labeling /usr/share/man/man3/warnings.3pm.gz to system_u:object_r:man_t:s0 
messages:Jun 22 17:02:05 dm01cel05 setfiles: labeling /usr/share/man/man3/warnings::register.3pm.gz to system_u:object_r:man_t:s0 
messages:Jun 22 17:02:17 dm01cel05 setfiles: relabeling /usr/share/swig/1.3.29/swigwarnings.swg from root:object_r:file_t:s0 to system_u:object_r:usr_t:s0 
messages:Jun 22 17:02:19 dm01cel05 setfiles: relabeling /usr/lib/perl5/5.8.8/warnings from root:object_r:file_t:s0 to system_u:object_r:lib_t:s0 
messages:Jun 22 17:02:19 dm01cel05 setfiles: labeling /usr/lib/perl5/5.8.8/warnings/register.pm to system_u:object_r:lib_t:s0 
messages:Jun 22 17:02:19 dm01cel05 setfiles: labeling /usr/lib/perl5/5.8.8/warnings.pm to system_u:object_r:lib_t:s0 
messages:Jun 22 17:02:21 dm01cel05 setfiles: labeling /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/warnings.h to system_u:object_r:lib_t:s0 
messages:Jun 22 17:02:22 dm01cel05 setfiles: labeling /usr/lib64/python2.4/warnings.pyo to system_u:object_r:lib_t:s0 
messages:Jun 22 17:02:22 dm01cel05 setfiles: labeling /usr/lib64/python2.4/warnings.pyc to system_u:object_r:lib_t:s0 
messages:Jun 22 17:02:23 dm01cel05 setfiles: labeling /usr/lib64/python2.4/warnings.py to system_u:object_r:lib_t:s0 
messages:Jun 22 17:13:44 dm01cel05 kernel: warning: `ntpdate' uses 32-bit capabilities (legacy support in use)
.......................
messages:Nov  1 19:57:36 dm01cel05 kernel: warning: `ntpdate' uses 32-bit capabilities (legacy support in use)
ms-odl.trc:[2013-11-08T02:44:22.980+08:00] [ossmgmt] [NOTIFICATION] [] [ms.core.MSAlertHistory] [tid: 17] [ecid: 10.48.27.14:63987:1383849862863:10,0] AlertHistory 61_1 created. Severity: warning. Message: Hard disk entered confinement offline status. The LUN 0_8 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped.<exadata:br/>Status                      : WARNING - CONFINEDOFFLINE <exadata:br/>Manufacturer                : HITACHI <exadata:br/>Model Number                : HUS1560SCSUN600G <exadata:br/>Size                        : 600G <exadata:br/>Serial Number               : 1216KLN0HN <exadata:br/>Firmware                    : A700 <exadata:br/>Slot Number                 : 8 <exadata:br/>Cell Disk                   : CD_08_dm01cel05 <exadata:br/>Grid Disk                   : RECO_DM01_CD_08_dm01cel05, DBFS_DG_CD_08_dm01cel05, DATA_DM01_CD_08_dm01cel05 <exadata:br/>Reason for confinement      : threshold for service time exceeded
ms-odl.trc:[2013-11-08T02:48:03.652+08:00] [ossmgmt] [NOTIFICATION] [] [ms.hwadapter.diskadp.MSLUNImpl] [tid: 19] [ecid: 10.48.27.14:63987:1383849872786:12,0] Called reenableLun for: 0 LUN: 0_8 lunOSName: /dev/sdi phys: 20:8 slotNumber: 8 lunStatus: warning reenableForce: true
ms-odl.trc:[2013-11-08T02:48:03.859+08:00] [ossmgmt] [NOTIFICATION] [] [ms.hwadapter.diskadp.MSLUNImpl] [tid: 19] [ecid: 10.48.27.14:63987:1383849872786:12,0] LUN 0_8 is in state warning and is not a system disk. No further action is required at this time.[[
[root@lunar sundiag_2013_11_21_13_33]# 


[root@lunar sundiag_2013_11_21_13_33]# grep "Error Count" *|grep -v "Error Count: 0"
dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out:                Error Counters
dm01cel05_megacli64-CfgDsply_2013_11_21_13_33.out:Media Error Count: 51
dm01cel05_megacli64-LdPdInfo_2013_11_21_13_33.out:Media Error Count: 51
dm01cel05_megacli64-PdList_long_2013_11_21_13_33.out:Media Error Count: 51
MegaSAS.log:                Error Counters
MegaSAS.log:Media Error Count: 51
MegaSAS.log:Media Error Count: 51
MegaSAS.log:Media Error Count: 51
MegaSAS.log:Media Error Count: 51
MegaSAS.log:                Error Counters
MegaSAS.log:Media Error Count: 51
MegaSAS.log:                Error Counters
MegaSAS.log:                Error Counters
[root@lunar sundiag_2013_11_21_13_33]# 


[root@lunar sundiag_2013_11_21_13_33]# grep "I/O error" * 
alert.log:Redo log write error 201 (Generic I/O error) on griddisk DATA_DM01_CD_00_dm01cel05: use of Flash Log for this device has been disabled
alert.log:Redo log write error 201 (Generic I/O error) on griddisk DATA_DM01_CD_00_dm01cel05: use of Flash Log for this device has been disabled
dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 15595520
dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 15595520
dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 15595520
dm01cel05_dmesg_2013_11_21_13_33.out:end_request: I/O error, dev sdi, sector 349304832
messages:Jun 29 21:44:18 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66141200
messages:Jun 29 21:44:18 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66139152
messages:Jun 29 21:44:18 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66137104
messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 66143248
messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 1130092206
messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 1130092206
messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 1338480
messages:Jun 29 21:44:19 dm01cel05 kernel: end_request: I/O error, dev sda, sector 134319216

[root@lunar sundiag_2013_11_21_13_33]# grep "Unconfigured" * 
dm01cel05_megacli64-AdpAllInfo_2013_11_21_13_33.out:PR Correct Unconfigured Areas           : Yes
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5:  prCorrectUnconfiguredAreas=1, useFdeOnly=1 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5:  enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:09/29/13 23:40:27:   prCorrectUnconfiguredAreas=1, useFdeOnly=1 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:09/29/13 23:40:27:   enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,   autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5:  prCorrectUnconfiguredAreas=1, useFdeOnly=1 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5:  enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:10/11/13 22:58:47:   prCorrectUnconfiguredAreas=1, useFdeOnly=1 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:10/11/13 22:58:47:   enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,   autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5:  prCorrectUnconfiguredAreas=1, useFdeOnly=1 
dm01cel05_megacli64-FwTermLog_2013_11_21_13_33.out:T5:  enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
MegaSAS.log:PR Correct Unconfigured Areas           : Yes
MegaSAS.log:T5:         prCorrectUnconfiguredAreas=1, useFdeOnly=1 
MegaSAS.log:T5:         enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
MegaSAS.log:09/29/13 23:40:27:  prCorrectUnconfiguredAreas=1, useFdeOnly=1 
MegaSAS.log:09/29/13 23:40:27:  enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
MegaSAS.log:T5:         prCorrectUnconfiguredAreas=1, useFdeOnly=1 
MegaSAS.log:T5:         enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
MegaSAS.log:10/11/13 22:58:47:  prCorrectUnconfiguredAreas=1, useFdeOnly=1 
MegaSAS.log:10/11/13 22:58:47:  enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
MegaSAS.log:T5:         prCorrectUnconfiguredAreas=1, useFdeOnly=1 
MegaSAS.log:T5:         enableSpinDownUnconfigured=1,   disableSpinDownHS=0,    spinDownTime=1e,        autoEnhancedImport=0, enableSecretKeyControl=0, disableOnlineCtrlReset=0 
MegaSAS.log:PR Correct Unconfigured Areas           : Yes
MegaSAS.log:PR Correct Unconfigured Areas           : Yes
MegaSAS.log:PR Correct Unconfigured Areas           : Yes
[root@lunar sundiag_2013_11_21_13_33]# 

使用cellcli查看磁盘的错误信息:

cellcli -e list physicaldisk detail|grep err

list physicaldisk where disktype=harddisk and status like ".*predictive failure.*" 
list physicaldisk where DISKtYPE=hARDdISK and STATUS=CRITICAL detail

检查ASM的日志是否有类似如下的告警:
1. WARNING: failed to read mirror
2. IO Error

此条目发表在 日常运维 分类目录,贴了 , , 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注