oracle_调整主机时间导致的宕库-ntpupdate_孙成子子

http://blog.sina.com.cn/u/2199343467

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

oracle_调整主机时间导致的宕库-ntpupdate

(2017-01-06 22:34:42)

标签：

ntpupdate

date

12c

gen0

rac

分类： oraerror

由于调整数据库主机时间导致的数据库实例宕机重启的故障。

应用反应主机时间与正确的时间相差有8分多钟，影响了正常的业务，登录发现主机的NTP服务是开启的，查看NTP同步状态：

remote refid st t when poll reach delay offset jitter

================================================================

*172.xx.xx.xx 139.114.32.135 2 u 22 64 377 0.180 0.051 0.028

可以看到offset是0.051s,基本没有延迟，那么问题就出在Ntpserver的时间存在不准确的可能，通过主机工程师

查看，果然server端存在延迟的情况。

为尽快恢复业务，通过以下方式来处理时间延迟，停止NTP服务更改服务端到一个正常的NTP服务器，在不停库的情况下，手工微调时间，来追平发生的延迟，步骤如下：

1.停止NTP服务修改服务器地址

#/etc/init.d/ntpd stop

#vi /etc/ntp.conf

# Enable writing of statistics records.

#statistics clockstats cryptostats loopstats peerstats

#server 172.72.20.131 prefer minpoll 6 maxpoll 6

server 10.19.244.52 prefer minpoll 6 maxpoll 6

logfile /var/log/dsware_ntp.log.0

2.每半分钟调一次，等半分钟，再跳一次

date -s "10:41:00 2017-01-06";clock -w

date -s "10:42:00 2017-01-06";clock -w

date -s "10:43:00 2017-01-06";clock -w

date -s "10:44:00 2017-01-06";clock -w

date -s "10:45:00 2017-01-06";clock -w

date -s "10:46:00 2017-01-06";clock -w

date -s "10:47:00 2017-01-06";clock -w

date -s "10:48:00 2017-01-06";clock -w

date -s "10:49:00 2017-01-06";clock -w

date -s "10:50:00 2017-01-06";clock -w

date -s "10:51:00 2017-01-06";clock -w

date -s "10:52:00 2017-01-06";clock -w

date -s "10:53:00 2017-01-06";clock -w

date -s "10:54:00 2017-01-06";clock -w

date -s "10:55:00 2017-01-06";clock -w

date -s "10:56:00 2017-01-06";clock -w

date -s "10:57:00 2017-01-06";clock -w

date -s "10:58:00 2017-01-06";clock -w

3. 启动NTP服务

#/etc/init.d/ntpd start

以上操作在一个数据库主机上正常执行后，数据库没有发生任何异常的情况。

在调整另一台数据库主机服务器时间时，发生了宕机的情况，

主机工程师手动调整server的时间到正确的时间，然后又通过ntpdate调整数据库服务器的时间追平服务端。

也就是数据库主机一下子调整了8分多钟的时间跨度，当调整完成后，悲剧就放生了，数据库宕机，分析如下：

ALERT报错：

Fri Jan 06 11:33:30 2017

Errors in file /oracle_log/diag/rdbms/cdbgx2/ogx2a/trace/ogx2a_asmb_67035.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 90 Serial number: 56760

Fri Jan 06 11:33:30 2017

Errors in file /oracle_log/diag/rdbms/cdbgx2/ogx2a/trace/ogx2a_asmb_67035.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 90 Serial number: 56760

USER (ospid: 67035): terminating the instance due to error 15064

Fri Jan 06 11:33:30 2017

opiodr aborting process unknown ospid (22340) as a result of ORA-1092

Fri Jan 06 11:33:30 2017

ORA-1092 : opitsk aborting process

报错无法与ASM实例发生通信，那么接下来我们查看ASM的ALERT日志。

2016-12-27 23:05:53.756000 +08:00

Warning: VKTM detected a time drift.

Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.

2017-01-06 11:33:30.143000 +08:00

WARNING: client [+ASM1:+ASM:c5ogx2-cluster] not responsive for 494s; state=0x1. pid 121601

NOTE: umbilicus traces dumped to /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

WARNING: client [ogx2a:cdbgx2:c5ogx2-cluster] not responsive for 494s; state=0x1. killing pid 67039

NOTE: umbilicus traces dumped to /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

WARNING: fencing client [ogx2a:cdbgx2:c5ogx2-cluster] after 494 seconds (mbr 1)

WARNING: ASMB has not responded for 494 seconds

NOTE: ASM umbilicus running slower than expected, ASMB diagnostic requested after 494 seconds

NOTE: ASMB process state dumped to trace file /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

ERROR: terminating instance because ASMB is stuck for 494 seconds

System State dumped to trace file /oracle_log/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_97907.trc

2017-01-06 11:33:32.261000 +08:00

报错，客户端-cluster在494s内无法响应，导致ASMB阻塞终止了ASM实例，顺理成章的，DB实例无法连接ASM实例，也就宕机。

查看指定TRACE文件内容如下：

*** 2017-01-06 11:33:32.261

GEN0 (ospid: 97907): terminating the instance due to error 15082

ksuitm: waiting up to [5] seconds before killing DIAG(97913)

[/home/oracle] oerr ora 15082

15082, 00000, "ASM failed to communicate with client"

// *Cause: There was a failure or time out when ASM tried to communicate with

// a connected RDBMS or Oracle ASM Dynamic Volume Manager

// (Oracle ADVM) client.

// *Action: Check the accompanying error messages and alert logs

// for more information on the reason for the failure.

// Check system specific logs (/var/log/messages on Linux,

// Event Log on Windows) for Oracle ADVM messages.

通过错误提示，表明是ASM无法与客户端通信，或超时，让检查相关日志，比如系统日志。

Jan 6 11:21:34 c5ogx2b ntpd[18672]: ntpd 4.2.6p5@1.2349-o Fri Oct 11 03:18:05 UTC 2013 (1)

当然也就是主机工程师做的ntpupdate操作；

发现日志中的超时494s,换算成分钟，也就是8.33分钟，正好是修改的时间跨度。那么就可以认为是大跨度修改主机时间导致的宕机，我理解，这里正常的timeout时间，也就是<1秒的时间，当时由于调整时间，两次获取操作系统的时间大于了允许的超时时间，导致ASM误认为现行是有问题的。

所以，当我们在调整数据库主机时间的时候，还是建议微调，禁止一次跨度太大。

在MOS上也有相关文章：

GEN0 terminating the ASM instance due to error 15082 (文档 ID 2096988.1)

ALERT日志也类似：

WARNING: client [+ASM1:+ASM:padcdbo-cluster] not responsive for 22270121s; state=0x1. pid 57967

The issue is due to Bug 19032250.

The issue is fixed in 12.1.0.2,Please request for patch if it is not available for your version.

不过版本是12.1.0.1，而我们的版本是12.1.0.2.

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：2016 was gone, 2017 is coming.

后一篇：oracle_xtts之源端11203目标端版本11204，使用dbms_file_transfer方式

新浪BLOG意见反馈留言板　欢迎批评指正