exadata存储节点上的/etc/init.d/cell.d和celladmin

联系:QQ(5163721)

标题:exadata存储节点上的/etc/init.d/cell.d和celladmin

作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

在cell节点上,系统启动时,Oracle 运行 /etc/init.d/cell.d。使用celladmin操作系统用户执行 /etc/init.d/cell.d,运行“alter cell startup services rs”脚本启动Restart 服务。
休眠1秒钟后,这个脚本运行“alter cell startup services all”来启动了所有其他的进程,包括CELLSRV和MS。
在/etc/init.d/cell.d脚本中存在一个检测机制,用以确定是否存在任何故障或不正确的配置。如果存在,Oracle会尝试从最近一次好的状态重新启动。
上述内容,在/etc/init.d/cell.d中相关内容如下:
#this is just to make sure that if MS was running then it is down (with/without RS)
su celladmin -c “. /etc/profile.d/cell_env.sh; cellcli -e \”alter cell startup services rs\”; sleep 1; cellcli -e \”alter cell shutdown services ms\””
perl $OSS_SCRIPTS_HOME/msCheck.pl stop_ms

这个脚本很重要,曾经在一次升级Exadata的image过程中,我们发现一个cell升级失败,根据报错信息,结合升级脚本patchmgr,发现该脚本实际是调用了dostep.sh脚本,报错的一段内容如下:


check_cell_services ()
{
  local -i ret_code=0
  local -i running_services=0
 
  running_services=`service celld status | grep -i running | wc -l`
  
  if [ $running_services -lt 3 ] && [ "X$g_rolling" == "Xnon_rolling" ]; then
    service celld restart > /dev/null 2>/dev/null
    running_services=`service celld status | grep -i running | wc -l`
  fi
  if [ $running_services -lt 3 ]; then
    echo "[ERROR] Can not continue. All 3 cell services do not seem to be up or able to come up."
    ret_code=1
  fi
  return $ret_code
}

也就是说,因为cell的3个核心服务起不来,因此,不能继续升级。
那么手工使用“service celld restart”命令启动下,并检查状态“service celld status”,具体如下:
手工启动还报如下错误:

[root@dm01cel02 ~]# service celld start

CELL-01526: Local hostname mapping is inconsistent.  Verify cell /etc/hosts file content.
[root@dm01cel02 ~]# 

Exadata的/etc/hosts通常是安装时根据onecommand脚本自动写入的一些内部互联和交换机等的IP信息,其他的ip是通过DNS解析的,况且既然可以安装成功,后期无人更改,/etc/hosts文件应该没有什么问题。
事实上,检查后,/etc/hosts文件的确没有什么异常。

尝试重启,报告权限错误:

[root@dm01cel02 ~]# service celld restart

Stopping the RS, CELLSRV, and MS services...
CELL-01509: Restart Server (RS) not responding.
Starting the RS, CELLSRV, and MS services...
CELL-01512: Cannot start a new Restart Server (RS).  Exception received: Permission denied
[root@dm01cel02 ~]# 
[root@dm01cel02 ~]# 

检查信息,发现3个关键服务都起不来:

[root@dm01cel02 ~]# service celld status
         rsStatus:               stopped
         msStatus:               unknown
         cellsrvStatus:          unknown
[root@dm01cel02 ~]# 

对比其他升级成功的节点,发现celladmin的权限如下:

[celladmin@dm01cel05 ~]$ id celladmin
uid=1000(celladmin) gid=500(celladmin) groups=500(celladmin),502(cellusers) context=root:system_r:unconfined_t:s0-s0:c0.c1023
[celladmin@dm01cel05 ~]$ 

而鼓掌的cell节点,celladmin用户权限如下:

[celladmin@dm01cel02 ~]$ id
uid=1000(celladmin) gid=500(celladmin) groups=500(celladmin),502(cellusers)
[celladmin@dm01cel02 ~]$ id celladmin
uid=1000(celladmin) gid=500(celladmin) groups=500(celladmin),502(cellusers)
[celladmin@dm01cel02 ~]$ 

在网上查了下资料,没找到什么有用信息,也尝试修改了下权限,没有解决问题,升级时间窗口有限,因此使用strace进行跟踪:
strace -fo /tmp/service_celld_start.log “service celld start”

我们发现,实际上“service celld start”是调用了/etc/init.d/cell.d中的如下一段内容:

start()
{
    dynamic_deploy
    su celladmin -c ". /etc/profile.d/cell_env.sh; cellcli -e \"alter cell startup services all\""
}

同理,“service celld stop”是调用了:

stop()
{
    su celladmin -c ". /etc/profile.d/cell_env.sh; cellcli -e \"alter cell shutdown services all\""
}

“service celld restart”是调用了:

restart()
{
    dynamic_deploy
    su celladmin -c ". /etc/profile.d/cell_env.sh; cellcli -e \"alter cell restart services all\"" 
}

“service celld stop”是调用了:

status()
{
    su celladmin -c ". /etc/profile.d/cell_env.sh; cellcli -e \"list cell attributes rsStatus, msStatus, cellsrvStatus detail\""
}

不难看出,cell的服务是有celladmin用户来完成的,那么我们使用celladmin用户手工执行以下试试,看看为什么上面会出现“Exception received: Permission denied”的报错信息:

[root@dm01cel02 trace]# su - celladmin
[celladmin@dm01cel02 ~]$ sh -x cellcli -e alter cell restart services all
+ PS=/bin/ps
+ GREP=/bin/grep
+ SLEEP=/bin/sleep
+ ECHO=/bin/echo
+ MYPID=9280
+ MYPPID=9128
+ /bin/ps -fp 9128
+ /bin/grep @notty
+ /bin/grep ' sshd: '
+ '[' 1 -eq 0 ']'
+ SSHD_NOTTY_PARENT=0
+ [[ -z 1 ]]
+ [[ -z 1 ]]
+ [[ -z 1 ]]
+ JRE_HOME=/usr/java/jdk1.5.0_15/
+ JLINE=jline.ConsoleRunner
+ CMDECHO=0
+ AWK=/bin/awk
+ DEPENDENT_JARS_DIR=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies
+ '[' '' -a '' '!=' '' ']'
+ for arg in '$*'
+ [[ -e == \-\n ]]
+ [[ -e == \-\m ]]
+ [[ -e == \-\e ]]
+ JLINE=
+ for arg in '$*'
+ [[ alter == \-\n ]]
+ [[ alter == \-\m ]]
+ [[ alter == \-\e ]]
+ for arg in '$*'
+ [[ cell == \-\n ]]
+ [[ cell == \-\m ]]
+ [[ cell == \-\e ]]
+ for arg in '$*'
+ [[ restart == \-\n ]]
+ [[ restart == \-\m ]]
+ [[ restart == \-\e ]]
+ for arg in '$*'
+ [[ services == \-\n ]]
+ [[ services == \-\m ]]
+ [[ services == \-\e ]]
+ for arg in '$*'
+ [[ all == \-\n ]]
+ [[ all == \-\m ]]
+ [[ all == \-\e ]]
+ pfile=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/cellinit.ora
++ /bin/grep HTTP_PORT /opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/cellinit.ora
++ /bin/awk '{print substr($0,11)}'
+ HTTP_PORT=8888
+ CELLCLI_JAVACMD='/usr/java/jdk1.5.0_15//bin/java -client -Dpid=9280 -classpath  /opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/jline-0.9.9.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/lib/ossmgmt-cli.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/wsclient_extended.jar -Djava.util.logging.config.file=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/jse-logging.properties  oracle.ossmgmt.ms.cli.CellCLI 8888 -e alter cell restart services all'
+ '[' 0 -ne 0 ']'
+ '[' 0 -eq 1 ']'
+ /usr/java/jdk1.5.0_15//bin/java -client -Dpid=9280 -classpath /opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/jline-0.9.9.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/lib/ossmgmt-cli.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/wsclient_extended.jar -Djava.util.logging.config.file=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/jse-logging.properties oracle.ossmgmt.ms.cli.CellCLI 8888 -e alter cell restart services all

Stopping the RS, CELLSRV, and MS services...
CELL-01509: Restart Server (RS) not responding.
Starting the RS, CELLSRV, and MS services...
CELL-01512: Cannot start a new Restart Server (RS).  Exception received: Permission denied
[celladmin@dm01cel02 ~]$ 

这里不难看出,的确重现了错误信息:
Stopping the RS, CELLSRV, and MS services…
CELL-01509: Restart Server (RS) not responding.
Starting the RS, CELLSRV, and MS services…
CELL-01512: Cannot start a new Restart Server (RS). Exception received: Permission denied

那么为什么权限被禁止呢? 我们换成root用户试试看:

[root@dm01cel02 trace]# id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)
[root@dm01cel02 trace]# sh -x cellcli -e alter cell restart services all
+ PS=/bin/ps
+ GREP=/bin/grep
+ SLEEP=/bin/sleep
+ ECHO=/bin/echo
+ MYPID=4356
+ MYPPID=8533
+ /bin/ps -fp 8533
+ /bin/grep ' sshd: '
+ /bin/grep @notty
+ '[' 1 -eq 0 ']'
+ SSHD_NOTTY_PARENT=0
+ [[ -z 1 ]]
+ [[ -z 1 ]]
+ [[ -z 1 ]]
+ JRE_HOME=/usr/java/jdk1.5.0_15/
+ JLINE=jline.ConsoleRunner
+ CMDECHO=0
+ AWK=/bin/awk
+ DEPENDENT_JARS_DIR=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies
+ '[' '' -a '' '!=' '' ']'
+ for arg in '$*'
+ [[ -e == \-\n ]]
+ [[ -e == \-\m ]]
+ [[ -e == \-\e ]]
+ JLINE=
+ for arg in '$*'
+ [[ alter == \-\n ]]
+ [[ alter == \-\m ]]
+ [[ alter == \-\e ]]
+ for arg in '$*'
+ [[ cell == \-\n ]]
+ [[ cell == \-\m ]]
+ [[ cell == \-\e ]]
+ for arg in '$*'
+ [[ restart == \-\n ]]
+ [[ restart == \-\m ]]
+ [[ restart == \-\e ]]
+ for arg in '$*'
+ [[ services == \-\n ]]
+ [[ services == \-\m ]]
+ [[ services == \-\e ]]
+ for arg in '$*'
+ [[ all == \-\n ]]
+ [[ all == \-\m ]]
+ [[ all == \-\e ]]
+ pfile=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/cellinit.ora
++ /bin/grep HTTP_PORT /opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/cellinit.ora
++ /bin/awk '{print substr($0,11)}'
+ HTTP_PORT=8888
+ CELLCLI_JAVACMD='/usr/java/jdk1.5.0_15//bin/java -client -Dpid=4356 -classpath  /opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/jline-0.9.9.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/lib/ossmgmt-cli.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/wsclient_extended.jar -Djava.util.logging.config.file=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/jse-logging.properties  oracle.ossmgmt.ms.cli.CellCLI 8888 -e alter cell restart services all'
+ '[' 0 -ne 0 ']'
+ '[' 0 -eq 1 ']'
+ /usr/java/jdk1.5.0_15//bin/java -client -Dpid=4356 -classpath /opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/jline-0.9.9.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/lib/ossmgmt-cli.jar:/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/dependencies/wsclient_extended.jar -Djava.util.logging.config.file=/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/cellsrv/deploy/config/jse-logging.properties oracle.ossmgmt.ms.cli.CellCLI 8888 -e alter cell restart services all

Stopping the RS, CELLSRV, and MS services...
CELL-01509: Restart Server (RS) not responding.
Starting the RS, CELLSRV, and MS services...
Getting the state of RS services... 
 running
Starting CELLSRV services...
The STARTUP of CELLSRV services was successful.
Starting MS services...
The STARTUP of MS services was successful.
[root@dm01cel02 trace]# 

这里可以看到,root用户执行是成功的。

由于升级的窗口时间有限,本次,没有找到为什么celladmin用户权限被禁止了,但是可以肯定的是celladmin这用户在这个cell上已经失效了,因此我们做了如下此措施:
1,修改这个cell的/etc/init.d/cell.d脚本,将其中”su celladmin”的部分替换掉,直接使用root执行例如:
将下面的内容:

start()
{
    dynamic_deploy
    su celladmin -c ". /etc/profile.d/cell_env.sh; cellcli -e \"alter cell startup services all\""
}

修改为:

start()
{
    dynamic_deploy
    cellcli -e "alter cell startup services all"
}

就这样,相关的start,stop,restart,status四个内容都进行了替换,升级成功了,后期,客户也只能使用root来维护了。

具体celladmin的权限,没有继续跟进了,但是已经不影响使用了(其他系统安全等问题,暂不考虑)。
但是现在回想下,当时如果有时间,参考一下onecommand的安装脚本中创建用户等相关的脚步,重建celladmin用户,不知道是否可以解决这个问题……
Anyway,下次选择更多了,O(∩_∩)O哈哈~

此条目发表在 内部机制 分类目录,贴了 , , 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注