联系:QQ(5163721)
标题:修复由于修改主机名造成Standbalone异常(ORA-29701 raised in ASM I/O path)
作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]
前不久,我同事修改standbalone的主机名,只是单纯的从主机层面修改了,has等并没有做调整,虽然数据库奇迹般的open了(有些时候,数据库在此情况下是不能open的,直接就会报错),但是数据库更改主机名以后,ASM和数据库的alert都有报错,而且分析后,觉得这个报错不解决,会在后期使用中造成数据库crash。
修复这个问题的程很简单,有几点说明:
1,8i和8i以前叫做OPS,Oracle Parallel Server,9.0.1~11.1之间叫做RAC(Real Application Cluster),11.2以后叫做GI(Grid Infrastructure)
2,Standbalone结构就是单实例数据库使用ASM的场景,他只需要db和asm通信的cssd等结构,不需要crs。
3,启动,关闭和管理has流程类似rac,但是千万别混淆,Standbalone的叫做crsctl start/stop has,RAC的叫做crsctl start/stop crs(整套架构中所有进程由ohasd创建),如果弄错了,会出现一些异常情况,在其他案例中,我们遇到过直接ASM报错的,后续在总结了那个案例分享出来。
4,不管是Standbalone还是RAC,修改主机名和IP等都需要特别小心,因为他们不像单机数据库一样,单纯的主机层面就该就可以,他们需要分别在has和crs中进行重配或者修改。
5,无论发生了什么,只要没有动ASM和DB,那么不用担心丢数据。因为has或者crs的结构都跟db是独立的,一般不会丢数据,最差的结果,重装一下,也可以把asm和db拉起来。
6,很多时候,在做类似该ip或者主机名,或者升级crs或者has的版本,或者升级数据库软件时,都建议对现有环境进行备份,要么借助NBU之类的工具,要么使用tar命令。千万别用ftp或者直接copy,这类的否不靠谱,不能用于文件恢复。
具体过程如下:
ASM ALERT:
Tue Mar 03 14:58:34 2015 Warning: VKTM detected a time drift. Time drifts can result in an unexpected behavior such as time -outs. Please check trace file for more details. Wed Mar 18 17:01:23 2015 NOTE: client lunarogg:lunarogg deregistered Wed Mar 18 17:07:18 2015 NOTE: client lunarogg:lunarogg registered, osid 16085, mbr 0x1 ~ |
数据库的日志,DB ALERT:
Fri Mar 20 11:29:44 2015 Archived Log entry 650 added for thread 1 sequence 107161 ID 0x9911db3 dest 1: Fri Mar 20 11:30:15 2015 ERROR: unrecoverable error ORA-29701 raised in ASM I /O path; terminating process 25695 Fri Mar 20 11:32:05 2015 ERROR: unrecoverable error ORA-29701 raised in ASM I /O path; terminating process 25711 Fri Mar 20 11:32:54 2015 Thread 1 cannot allocate new log, sequence 107163 Private strand flush not complete Current log # 3 seq# 107162 mem# 0: /u01/oradata/redo/lunarogg/onlinelog/redo03.log Fri Mar 20 11:32:55 2015 ERROR: unrecoverable error ORA-29701 raised in ASM I /O path; terminating process 25766 Thread 1 advanced to log sequence 107163 (LGWR switch) Current log # 4 seq# 107163 mem# 0: /u01/oradata/redo/lunarogg/onlinelog/redo04.log Fri Mar 20 11:32:59 2015 Archived Log entry 651 added for thread 1 sequence 107162 ID 0x9911db3 dest 1: |
CRS和ohasd也有明显报错:
2015-03-03 14:32:22.832: [UiServer][1135483200]{0:0:2} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:22.833: [UiServer][1135483200]{0:0:2} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:22.836: [UiServer][1135483200]{0:0:54} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:22.836: [UiServer][1135483200]{0:0:54} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:53.108: [UiServer][1135483200]{0:0:78} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:53.109: [UiServer][1135483200]{0:0:78} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:53.118: [UiServer][1135483200]{0:0:79} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:32:53.119: [UiServer][1135483200]{0:0:79} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:33:22.609: [ CRSSEC][1133381952]{0:0:95} Exception: setGroupIdentity failed while checking groupname: asmadmin 2015-03-03 14:33:22.615: [ CRSPE][1133381952]{0:0:95} Exception in setGroupIdentity (ignoring groupIdentity): Authorization Exception:The input identity user id does not belong to the group is invalid 2015-03-03 14:33:53.733: [UiServer][1135483200]{0:0:91} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:33:53.738: [UiServer][1135483200]{0:0:94} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:33:53.739: [UiServer][1135483200]{0:0:95} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-03 14:33:53.740: [UiServer][1135483200]{0:0:95} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-18 17:01:17.748: [ CRSSEC][1133381952]{0:0:8811} Exception: setGroupIdentity failed while checking groupname: asmadmin 2015-03-18 17:01:17.748: [ CRSPE][1133381952]{0:0:8811} Exception in setGroupIdentity (ignoring groupIdentity): Authorization Exception:The input identity user id does not belong to the group is invalid 2015-03-18 17:01:59.271: [ CRSSEC][1133381952]{0:0:8812} Exception: setGroupIdentity failed while checking groupname: asmadmin 2015-03-18 17:01:59.271: [ CRSPE][1133381952]{0:0:8812} Exception in setGroupIdentity (ignoring groupIdentity): Authorization Exception:The input identity user id does not belong to the group is invalid 2015-03-18 17:01:59.277: [UiServer][1135483200]{0:0:8812} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-18 17:01:59.295: [UiServer][1135483200]{0:0:8812} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-18 17:01:59.296: [UiServer][1135483200]{0:0:8812} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-18 17:34:14.510: [UiServer][1135483200]{0:0:8834} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. 2015-03-18 17:34:14.510: [UiServer][1135483200]{0:0:8834} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. |
我猜测,此时如果做类似增加/删除文件数据库都会crash,果然,我让同事做rman备份,数据库就crash了。
处理的方法很简单,就是修改has的相关配置,讲新主机名配置进去。具体步骤如下:
首先记录当前的系统关键信息,因为修改完has后,需要将asm,db,diskgroup等关键信息重新注册到has中:
[grid@lunardb ~]$ srvctl config asm ASM home: /u01/app/grid/product/11 .2.0 /grid ASM listener: LISTENER Spfile: +DATA /asm/asmparameterfile/registry .253.847035651 ASM diskgroup discovery string: /dev/mapper/ * [grid@lunardb ~]$ [oracle@lunardb ~]$ srvctl config database -d lunarogg Database unique name: lunarogg Database name: lunarogg Oracle home: /u01/app/oracle/product/11 .2.0 /db_1 Oracle user: oracle Spfile: +DATA /lunarogg/spfilelunarogg .ora Domain: Start options: open Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Database instance: lunarogg Disk Groups: DATA,DATA2 Services: [oracle@lunardb ~]$ |
这里还发现了一个问题,这个库的ASM的在has中的信息和实际使用的spfile不一致
[grid@lunardb ~]$ asmcmd spbackup +DATA /asm/asmparameterfile/registry .253.847035651 /tmp/ASMspfile .backup ASMCMD-8303: invalid SPFILE '+DATA/asm/asmparameterfile/registry.253.847035651' ORA-15056: additional error message ORA-15173: entry 'asm' does not exist in directory '/' ORA-06512: at "SYS.X$DBMS_DISKGROUP" , line 322 ORA-06512: at line 3 (DBD ERROR: OCIStmtExecute) [grid@lunardb ~]$ 检查ASM实例,确实使用的是pfile启动的,这个从alert中也可以得到证实: [grid@lunardb ~]$ ss SQL*Plus: Release 11.2.0.4.0 Production on Mon Mar 23 13:56:01 2015 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Automatic Storage Management option SQL> show parameter pfile NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ spfile string SQL> |
在所有的磁盘组中,我们发现实际上是找不到在has中注册的ASM实例使用的spfile的:
[grid@lunardb ~]$ asmcmd find -- type ASMPARAMETERFILE +DATA2 "*" [grid@lunardb ~]$ asmcmd find -- type ASMPARAMETERFILE +DATA "*" |
使用kfod检查会发现,大量磁盘的报错:
[grid@lunardb ~]$ kfod asm_diskstring= '/dev/mapper/*' disks=all -------------------------------------------------------------------------------- Disk Size Path User Group ================================================================================ 1: 1429504 Mb /dev/mapper/D_R5_01_LUN01 grid oinstall 2: 1429504 Mb /dev/mapper/D_R5_01_LUN02 grid oinstall 3: 1425408 Mb /dev/mapper/D_R5_02_LUN01 grid oinstall 4: 1425408 Mb /dev/mapper/D_R5_02_LUN02 grid oinstall 5: 1425408 Mb /dev/mapper/D_R5_03_LUN01 grid oinstall 6: 1425408 Mb /dev/mapper/D_R5_03_LUN02 grid oinstall 7: 1425408 Mb /dev/mapper/D_R5_04_LUN01 grid oinstall 8: 1425408 Mb /dev/mapper/D_R5_04_LUN02 grid oinstall KFOD-00301: Unable to contact Cluster Synchronization Services (CSS). Return code 2 from kgxgncin. KFOD-00311: Error scanning device /dev/mapper/control ORA-15025: could not open disk "/dev/mapper/control" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol00 ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol00" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol03 ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol03" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol04 ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol04" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol02 ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol02" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol01 ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol01" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/Vol_oradata01-lv_oradata1 ORA-15025: could not open disk "/dev/mapper/Vol_oradata01-lv_oradata1" Linux-x86_64 Error: 13: Permission denied Additional information: 42 KFOD-00311: Error scanning device /dev/mapper/lunarogg01lun1 ORA-15025: could not open disk "/dev/mapper/lunarogg01lun1" Linux-x86_64 Error: 13: Permission denied Additional information: 42 Additional information: 22621383 Additional information: 1598903119 [grid@lunardb ~]$ |
接下来,我就开始重建has了,首先是使用force都无法停止has,因为已经异常了,这个在意料之中的,没关系:
[root@lunardb ~] # crsctl stop has -f CLSU-00100: Operating System function : opendir failed with error data: 2 CLSU-00101: Operating System error message: No such file or directory CLSU-00103: error location: scrsearch1 CLSU-00104: additional error information: cant open scr home dir scls_scr_getval CRS-4544: Unable to connect to OHAS CRS-4000: Command Stop failed, or completed with errors. [root@lunardb ~] # [root@lunardb ~] # crsctl stop crs CRS-4013: This command is not supported in a single-node configuration. CRS-4000: Command Stop failed, or completed with errors. [root@lunardb ~] # ps -ef|grep d.bin grid 9107 1 0 Mar03 ? 00:01:43 /u01/app/grid/product/11 .2.0 /grid/bin/evmd .bin grid 9140 9107 0 Mar03 ? 00:00:00 /u01/app/grid/product/11 .2.0 /grid/bin/evmlogger .bin -o /u01/app/grid/product/11 .2.0 /grid/evm/log/evmlogger .info -l /u01/app/grid/product/11 .2.0 /grid/evm/log/evmlogger .log grid 9149 1 0 Mar03 ? 00:15:09 /u01/app/grid/product/11 .2.0 /grid/bin/cssdagent grid 9172 1 0 Mar03 ? 00:01:50 /u01/app/grid/product/11 .2.0 /grid/bin/ocssd .bin grid 16751 1 0 Mar18 ? 00:00:17 /u01/app/grid/product/11 .2.0 /grid/bin/tnslsnr LISTENER -inherit root 31046 30878 0 14:17 pts /8 00:00:00 grep d.bin [root@lunardb ~] # |
这些进程已经停不掉了,于是只能重启主机(kill ocssd.bin 一样会导致主机重启,因此直接手工reboot了)。
起来后,先删除现有的has配置:
[root@lunardb ~] # $ORACLE_HOME/crs/install/roothas.pl -deconfig -force Using configuration parameter file : /u01/app/grid/product/11 .2.0 /grid/crs/install/crsconfig_params CRS-4639: Could not contact Oracle High Availability Services CRS-4000: Command Stop failed, or completed with errors. CRS-4639: Could not contact Oracle High Availability Services CRS-4000: Command Delete failed, or completed with errors. CLSU-00100: Operating System function : opendir failed with error data: 2 CLSU-00101: Operating System error message: No such file or directory CLSU-00103: error location: scrsearch1 CLSU-00104: additional error information: cant open scr home dir scls_scr_getval CRS-4544: Unable to connect to OHAS CRS-4000: Command Stop failed, or completed with errors. Successfully deconfigured Oracle Restart stack [root@lunardb ~] # |
然后执行重新配置:
[root@lunardb ~] # $ORACLE_HOME/crs/install/roothas.pl Using configuration parameter file : /u01/app/grid/product/11 .2.0 /grid/crs/install/crsconfig_params LOCAL ADD MODE Creating OCR keys for user 'grid' , privgrp 'oinstall' .. Operation successful. LOCAL ONLY MODE Successfully accumulated necessary OCR keys. Creating OCR keys for user 'root' , privgrp 'root' .. Operation successful. CRS-4664: Node lunardb successfully pinned. Adding Clusterware entries to inittab lunardb 2015 /03/23 14:29:29 /u01/app/grid/product/11 .2.0 /grid/cdata/lunardb/backup_20150323_142929 .olr Successfully configured Oracle Grid Infrastructure for a Standalone Server You have new mail in /var/spool/mail/root [root@lunardb ~] # |
这过程很快,比GI快多了,HAS还是结构简单啊,O(∩_∩)O哈哈~。
配置完成后,检查一下,服务都正常,只是需要将下面的服务改为自动启动:
[grid@lunardb ~]$ crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.ons OFFLINE OFFLINE lunardb -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.cssd 1 OFFLINE OFFLINE ora.diskmon 1 OFFLINE OFFLINE ora.evmd 1 ONLINE ONLINE lunardb [grid@lunardb ~]$ [grid@lunardb ~]$ crsctl modify resource "ora.cssd" -attr "AUTO_START=1" [grid@lunardb ~]$ crsctl modify resource "ora.evmd" -attr "AUTO_START=1" [grid@lunardb ~]$ crsctl modify resource "ora.ons" -attr "AUTO_START=1" |
使用NETCA重建监听:
[grid@lunardb ~]$ crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.LISTENER.lsnr ONLINE ONLINE lunardb ora.ons OFFLINE OFFLINE lunardb -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.cssd 1 OFFLINE OFFLINE ora.diskmon 1 OFFLINE OFFLINE ora.evmd 1 ONLINE ONLINE lunardb [grid@lunardb ~]$ |
添加asm:
[grid@lunardb ~]$ srvctl add asm [grid@lunardb ~]$ crsctl modify resource "ora.asm" -attr "AUTO_START=1" [grid@lunardb ~]$ srvctl config asm ASM home: /u01/app/grid/product/11 .2.0 /grid ASM listener: LISTENER Spfile: ASM diskgroup discovery string: ++no-value-at-resource-creation--never-updated-through-ASM++ [grid@lunardb ~]$ 修改asm和listener的配置: [grid@lunardb dbs]$ srvctl modify asm -l LISTENER -p "/u01/app/grid/product/11.2.0/grid/dbs/init+ASM.ora" -d "/dev/mapper/*" [grid@lunardb dbs]$ crsctl modify resource "ora.asm" -attr "AUTO_START=1" [grid@lunardb dbs]$ srvctl config asm ASM home: /u01/app/grid/product/11 .2.0 /grid ASM listener: LISTENER Spfile: /u01/app/grid/product/11 .2.0 /grid/dbs/init +ASM.ora ASM diskgroup discovery string: /dev/mapper/ * [grid@lunardb dbs]$ |
在has中添加db:
[oracle@lunardb ~]$ srvctl add database -d lunarogg -n lunardb -o /u01/app/oracle/product/11 .2.0 /db_1 -p +DATA /lunarogg/spfilelunarogg .ora [oracle@lunardb ~]$ srvctl modify database -d lunarogg -a "DATA,DATA2" [oracle@lunardb ~]$ |
现在再次检查has的配置,都已经好了:
[grid@lunardb dbs]$ crsctl status res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.DATA.dg ONLINE ONLINE lunardb ora.DATA2.dg ONLINE ONLINE lunardb ora.LISTENER.lsnr ONLINE ONLINE lunardb ora.asm ONLINE ONLINE lunardb Started ora.ons OFFLINE OFFLINE lunardb -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.cssd 1 ONLINE ONLINE lunardb ora.diskmon 1 OFFLINE OFFLINE ora.evmd 1 ONLINE ONLINE lunardb ora.lunarogg.db 1 OFFLINE OFFLINE [grid@lunardb dbs]$ |
在使用kfod检查asmdisk已经不抱错了:
[grid@lunardb dbs]$ kfod asm_diskstring= '/dev/mapper/*' disks=all -------------------------------------------------------------------------------- Disk Size Path User Group ================================================================================ 1: 1429504 Mb /dev/mapper/D_R5_01_LUN01 grid oinstall 2: 1429504 Mb /dev/mapper/D_R5_01_LUN02 grid oinstall 3: 1425408 Mb /dev/mapper/D_R5_02_LUN01 grid oinstall 4: 1425408 Mb /dev/mapper/D_R5_02_LUN02 grid oinstall 5: 1425408 Mb /dev/mapper/D_R5_03_LUN01 grid oinstall 6: 1425408 Mb /dev/mapper/D_R5_03_LUN02 grid oinstall 7: 1425408 Mb /dev/mapper/D_R5_04_LUN01 grid oinstall 8: 1425408 Mb /dev/mapper/D_R5_04_LUN02 grid oinstall -------------------------------------------------------------------------------- ORACLE_SID ORACLE_HOME ================================================================================ +ASM /u01/app/grid/product/11 .2.0 /grid [grid@lunardb dbs]$ |
使用has重新启动,把所有资源和db都带起来:
[grid@lunardb dbs]$ crsctl status res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.DATA.dg ONLINE ONLINE lunardb ora.DATA2.dg ONLINE ONLINE lunardb ora.LISTENER.lsnr ONLINE ONLINE lunardb ora.asm ONLINE ONLINE lunardb Started ora.ons OFFLINE OFFLINE lunardb -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.cssd 1 ONLINE ONLINE lunardb ora.diskmon 1 OFFLINE OFFLINE ora.evmd 1 ONLINE ONLINE lunardb ora.lunarogg.db 1 ONLINE ONLINE lunardb Open [grid@lunardb dbs]$ |
好了,都ok了。