oracle_调整主机时间导致的宕库-ntpupdate
(2017-01-06 22:34:42)
标签:
ntpupdatedate12cgen0rac |
分类: oraerror |
由于调整数据库主机时间导致的数据库实例宕机重启的故障。
应用反应主机时间与正确的时间相差有8分多钟,影响了正常的业务,
登录发现主机的NTP服务是开启的,查看NTP同步状态:
================================================================
*172.xx.xx.xx
139.114.32.135 2 u
22 64 377
0.180
0.051 0.028
可以看到offset是0.051s,基本没有延迟,那么问题就出在Ntpserver的时间存在不准确的可能,通过主机工程师
查看,果然server端存在延迟的情况。
为尽快恢复业务,通过以下方式来处理时间延迟,停止NTP服务更改服务端到一个正常的NTP服务器,在不停库的情况下,手工微调时间,来追平发生的延迟,步骤如下:
1.停止NTP服务修改服务器地址
#/etc/init.d/ntpd stop
#vi /etc/ntp.conf
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
#server 172.72.20.131 prefer minpoll 6 maxpoll 6
server 10.19.244.52 prefer minpoll 6 maxpoll 6
logfile /var/log/dsware_ntp.log.0
2.每半分钟调一次, 等半分钟,再跳一次
date -s "10:41:00 2017-01-06";clock -w
date -s "10:42:00 2017-01-06";clock -w
date -s "10:43:00 2017-01-06";clock -w
date -s "10:44:00 2017-01-06";clock -w
date -s "10:45:00 2017-01-06";clock -w
date -s "10:46:00 2017-01-06";clock -w
date -s "10:47:00 2017-01-06";clock -w
date -s "10:48:00 2017-01-06";clock -w
date -s "10:49:00 2017-01-06";clock -w
date -s "10:50:00 2017-01-06";clock -w
date -s "10:51:00 2017-01-06";clock -w
date -s "10:52:00 2017-01-06";clock -w
date -s "10:53:00 2017-01-06";clock -w
date -s "10:54:00 2017-01-06";clock -w
date -s "10:55:00 2017-01-06";clock -w
date -s "10:56:00 2017-01-06";clock -w
date -s "10:57:00 2017-01-06";clock -w
date -s "10:58:00 2017-01-06";clock -w
3. 启动NTP服务
#/etc/init.d/ntpd start
以上操作在一个数据库主机上正常执行后,数据库没有发生任何异常的情况。
在调整另一台数据库主机服务器时间时,发生了宕机的情况,
主机工程师手动调整server的时间到正确的时间,然后又通过ntpdate调整数据库服务器的时间追平服务端。
也就是数据库主机一下子调整了8分多钟的时间跨度,当调整完成后,悲剧就放生了,数据库宕机,分析如下:
ALERT报错:
Fri Jan 06 11:33:30 2017
Errors in file
/oracle_log/diag/rdbms/cdbgx2/ogx2a/trace/ogx2a_asmb_67035.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 90 Serial number: 56760
Fri Jan 06 11:33:30 2017
Errors in file
/oracle_log/diag/rdbms/cdbgx2/ogx2a/trace/ogx2a_asmb_67035.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 90 Serial number: 56760
USER (ospid: 67035): terminating the instance due to error
15064
Fri Jan 06 11:33:30 2017
opiodr aborting process unknown ospid (22340) as a result of
ORA-1092
Fri Jan 06 11:33:30 2017
ORA-1092 : opitsk aborting process
报错无法与ASM实例发生通信,那么接下来我们查看ASM的ALERT日志。
2016-12-27 23:05:53.756000 +08:00
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as
time-outs. Please check trace file for more details.
2017-01-06 11:33:30.143000 +08:00
WARNING: client
[+ASM1:+ASM:c5ogx2-cluster] not responsive for 494s; state=0x1. pid
121601
NOTE: umbilicus traces dumped to
/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
WARNING: client [ogx2a:cdbgx2:c5ogx2-cluster] not responsive
for 494s; state=0x1. killing pid 67039
NOTE: umbilicus traces dumped to
/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
WARNING: fencing client [ogx2a:cdbgx2:c5ogx2-cluster] after
494 seconds (mbr 1)
WARNING: ASMB has not responded for 494 seconds
NOTE: ASM umbilicus running slower than expected, ASMB
diagnostic requested after 494 seconds
NOTE: ASMB process state dumped to trace file
/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
ERROR: terminating instance because ASMB is stuck for 494
seconds
System State dumped to trace file
/oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc
2017-01-06 11:33:32.261000 +08:00
报错,客户端-cluster在494s内无法响应,导致ASMB阻塞终止了ASM实例,顺理成章的,DB实例无法连接ASM实例,也就宕机。
查看指定TRACE文件内容如下:
*** 2017-01-06 11:33:32.261
GEN0 (ospid: 97907): terminating the instance due to error
15082
ksuitm: waiting up to [5] seconds before killing
DIAG(97913)
[/home/oracle] oerr ora 15082
15082, 00000, "ASM failed to communicate with client"
// *Cause: There was a failure or time out
when ASM tried to communicate with
//
a connected RDBMS or Oracle
ASM Dynamic Volume Manager
//
(Oracle ADVM) client.
// *Action: Check the accompanying error messages and alert
logs
//
for more information on the
reason for the failure.
//
Check system specific logs
(/var/log/messages on Linux,
//
Event Log on Windows) for
Oracle ADVM messages.
通过错误提示,表明是ASM无法与客户端通信,或超时,让检查相关日志,比如系统日志。
Jan 6 11:21:34 c5ogx2b ntpd[18672]: ntpd
4.2.6p5@1.2349-o Fri Oct 11 03:18:05 UTC 2013 (1)
当然也就是主机工程师做的ntpupdate操作;
发现日志中的超时494s,换算成分钟,也就是8.33分钟,正好是修改的时间跨度。那么就可以认为是大跨度修改主机时间导致的宕机,
我理解,这里正常的timeout时间,也就是<1秒的时间,当时由于调整时间,两次获取操作系统的时间大于了允许的超时时间,导致ASM误认为现行是有问题的。
所以,当我们在调整数据库主机时间的时候,还是建议微调,禁止一次跨度太大。
在MOS上也有相关文章:
GEN0 terminating the ASM instance due to error 15082 (文档 ID
2096988.1)
ALERT日志也类似:
WARNING: client [+ASM1:+ASM:padcdbo-cluster]
not responsive for 22270121s; state=0x1. pid 57967
The issue is
due to Bug 19032250.
The
issue is fixed in 12.1.0.2,Please request for patch if it is not
available for your version.
不过版本是12.1.0.1,而我们的版本是12.1.0.2.