加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

oracle_调整主机时间导致的宕库-ntpupdate

(2017-01-06 22:34:42)
标签:

ntpupdate

date

12c

gen0

rac

分类: oraerror

由于调整数据库主机时间导致的数据库实例宕机重启的故障。

应用反应主机时间与正确的时间相差有8分多钟,影响了正常的业务, 登录发现主机的NTP服务是开启的,查看NTP同步状态:

     remote           refid      st t when poll reach   delay   offset  jitter
================================================================
*172.xx.xx.xx    139.114.32.135   2 u   22   64  377    0.180   0.051   0.028

可以看到offset是0.051s,基本没有延迟,那么问题就出在Ntpserver的时间存在不准确的可能,通过主机工程师
查看,果然server端存在延迟的情况。
为尽快恢复业务,通过以下方式来处理时间延迟,停止NTP服务更改服务端到一个正常的NTP服务器,在不停库的情况下,手工微调时间,来追平发生的延迟,步骤如下:
1.停止NTP服务修改服务器地址
#/etc/init.d/ntpd stop
#vi /etc/ntp.conf

# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
#server 172.72.20.131 prefer minpoll 6 maxpoll 6
server 10.19.244.52 prefer minpoll 6 maxpoll 6
logfile /var/log/dsware_ntp.log.0
2.每半分钟调一次, 等半分钟,再跳一次

date -s "10:41:00 2017-01-06";clock -w
date -s "10:42:00 2017-01-06";clock -w
date -s "10:43:00 2017-01-06";clock -w
date -s "10:44:00 2017-01-06";clock -w
date -s "10:45:00 2017-01-06";clock -w
date -s "10:46:00 2017-01-06";clock -w
date -s "10:47:00 2017-01-06";clock -w
date -s "10:48:00 2017-01-06";clock -w
date -s "10:49:00 2017-01-06";clock -w
date -s "10:50:00 2017-01-06";clock -w
date -s "10:51:00 2017-01-06";clock -w
date -s "10:52:00 2017-01-06";clock -w
date -s "10:53:00 2017-01-06";clock -w
date -s "10:54:00 2017-01-06";clock -w
date -s "10:55:00 2017-01-06";clock -w
date -s "10:56:00 2017-01-06";clock -w
date -s "10:57:00 2017-01-06";clock -w
date -s "10:58:00 2017-01-06";clock -w

3. 启动NTP服务
#/etc/init.d/ntpd start

以上操作在一个数据库主机上正常执行后,数据库没有发生任何异常的情况。

在调整另一台数据库主机服务器时间时,发生了宕机的情况,
主机工程师手动调整server的时间到正确的时间,然后又通过ntpdate调整数据库服务器的时间追平服务端。
也就是数据库主机一下子调整了8分多钟的时间跨度,当调整完成后,悲剧就放生了,数据库宕机,分析如下:

ALERT报错:
Fri Jan 06 11:33:30 2017
Errors in file /oracle_log/diag/rdbms/cdbgx2/ogx2a/trace/ogx2a_asmb_67035.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 90 Serial number: 56760
Fri Jan 06 11:33:30 2017
Errors in file /oracle_log/diag/rdbms/cdbgx2/ogx2a/trace/ogx2a_asmb_67035.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 90 Serial number: 56760
USER (ospid: 67035): terminating the instance due to error 15064
Fri Jan 06 11:33:30 2017
opiodr aborting process unknown ospid (22340) as a result of ORA-1092
Fri Jan 06 11:33:30 2017
ORA-1092 : opitsk aborting process

报错无法与ASM实例发生通信,那么接下来我们查看ASM的ALERT日志。
2016-12-27 23:05:53.756000 +08:00
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
2017-01-06 11:33:30.143000 +08:00
WARNING: client [+ASM1:+ASM:c5ogx2-cluster] not responsive for 494s; state=0x1. pid 121601
NOTE: umbilicus traces dumped to /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
WARNING: client [ogx2a:cdbgx2:c5ogx2-cluster] not responsive for 494s; state=0x1. killing pid 67039
NOTE: umbilicus traces dumped to /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
WARNING: fencing client [ogx2a:cdbgx2:c5ogx2-cluster] after 494 seconds (mbr 1)
WARNING: ASMB has not responded for 494 seconds
NOTE: ASM umbilicus running slower than expected, ASMB diagnostic requested after 494 seconds
NOTE: ASMB process state dumped to trace file /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
ERROR: terminating instance because ASMB is stuck for 494 seconds
System State dumped to trace file /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
2017-01-06 11:33:32.261000 +08:00

报错,客户端-cluster在494s内无法响应,导致ASMB阻塞终止了ASM实例,顺理成章的,DB实例无法连接ASM实例,也就宕机。

查看指定TRACE文件内容如下:
*** 2017-01-06 11:33:32.261
GEN0 (ospid: 97907): terminating the instance due to error 15082
ksuitm: waiting up to [5] seconds before killing DIAG(97913)

[/home/oracle] oerr ora 15082
15082, 00000, "ASM failed to communicate with client"
// *Cause:  There was a failure or time out when ASM tried to communicate with
//          a connected RDBMS or Oracle ASM Dynamic Volume Manager
//          (Oracle ADVM) client.
// *Action: Check the accompanying error messages and alert logs
//          for more information on the reason for the failure.
//          Check system specific logs (/var/log/messages on Linux,
//          Event Log on Windows) for Oracle ADVM messages.

通过错误提示,表明是ASM无法与客户端通信,或超时,让检查相关日志,比如系统日志。
Jan  6 11:21:34 c5ogx2b ntpd[18672]: ntpd 4.2.6p5@1.2349-o Fri Oct 11 03:18:05 UTC 2013 (1)

当然也就是主机工程师做的ntpupdate操作;

发现日志中的超时494s,换算成分钟,也就是8.33分钟,正好是修改的时间跨度。那么就可以认为是大跨度修改主机时间导致的宕机,  我理解,这里正常的timeout时间,也就是<1秒的时间,当时由于调整时间,两次获取操作系统的时间大于了允许的超时时间,导致ASM误认为现行是有问题的。


所以,当我们在调整数据库主机时间的时候,还是建议微调,禁止一次跨度太大。


在MOS上也有相关文章:
GEN0 terminating the ASM instance due to error 15082 (文档 ID 2096988.1)
ALERT日志也类似:
WARNING: client [+ASM1:+ASM:padcdbo-cluster] not responsive for 22270121s; state=0x1. pid 57967
The issue is due to Bug 19032250.
The issue is fixed in 12.1.0.2,Please request for patch if it is not available for your version.
不过版本是12.1.0.1,而我们的版本是12.1.0.2.

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有