Linux4集群共享盘vg丢失故障一例_呆豆卡卡

http://blog.sina.com.cn/u/1646690892

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Linux4集群共享盘vg丢失故障一例

(2012-06-05 14:21:23)

标签：

linux

集群

丢盘

vg修复

vgcfgrestore

checksum

error

it

分类： LINUX

NODE1和 NODE2 为一集群的两个节点

系统版本 Redhat Linux4.6 32bit 集群为红帽版本

其中系统/etc/lvm/lvm.conf内对应如下：

volume_list = [ "vg00","cisvg01","cisvg04","cisvg05","@ NODE1" ]

locking_type = 1

cisvg02和cisvg03这个2个vg是加在资源包里的文件系统切在哪边表示即被激活，volume_list 里的vg为自动激活

由于新旧存储割接，系统管理员需对支票影像进行了停集群服务的操作，无报错，集群顺利停掉。

Jun 2 19:07:11 NODE1 clurgmgrd[21037]: <notice> Stopping service cis1res1

Jun 2 19:07:11 NODE1 clurgmgrd: [21037]: <info> Executing /etc/rc.d/init.d/ccb_cis stop

Jun 2 19:07:11 NODE1 su(pam_unix)[23404]: session opened for user xwj by (uid=0)

Jun 2 19:07:23 NODE1 su(pam_unix)[23404]: session closed for user xwj

Jun 2 19:07:23 NODE1 su(pam_unix)[23493]: session opened for user informix by (uid=0)

Jun 2 19:07:26 NODE1 su(pam_unix)[23493]: session closed for user informix

Jun 2 19:07:26 NODE1 clurgmgrd: [21037]: <info> Removing IPv4 address 52.0.33.11 from bond0

Jun 2 19:07:37 NODE1 clurgmgrd: [21037]: <info> unmounting /home/ap/xwj/Log

Jun 2 19:07:37 NODE1 clurgmgrd: [21037]: <info> unmounting /home/ap/xwj/Bin/CISFiles

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Deactivating cisvg03/cisv03l5002

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Making resilient : lvchange -an cisvg03/cisv03l5002

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Resilient command: lvchange -an cisvg03/cisv03l5002 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/sdg|","a|/dev/sdh1|","a|/dev/sdh2|","a|/dev/sdh3|","r|.*|"]}

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Removing ownership tag (NODE1) from cisvg03/cisv03l5002

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Deactivating cisvg02/cisv02l5001

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Making resilient : lvchange -an cisvg02/cisv02l5001

Jun 2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Resilient command: lvchange -an cisvg02/cisv02l5001 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/sdg|","a|/dev/sdh1|","a|/dev/sdh2|","a|/dev/sdh3|","r|.*|"]}

Jun 2 19:07:39 NODE1 clurgmgrd: [21037]: <notice> Removing ownership tag (NODE1) from cisvg02/cisv02l5001

Jun 2 19:07:39 NODE1 clurgmgrd[21037]: <notice> Service cis1res1 is disabled

Jun 2 19:09:23 NODE1 rgmanager: [25472]: <notice> Shutting down Cluster Service Manager...

Jun 2 19:09:23 NODE1 clurgmgrd[21037]: <notice> Shutting down

Jun 2 19:09:25 NODE1 clurgmgrd[21037]: <notice> Shutdown complete, exiting

Jun 2 19:09:25 NODE1 rgmanager: [25472]: <notice> Cluster Service Manager is stopped.

Jun 2 19:09:26 NODE1 fenced: shutdown succeeded

Jun 2 19:09:27 NODE1 kernel: CMAN: we are leaving the cluster. Removed

Jun 2 19:09:27 NODE1 kernel: WARNING: dlm_emergency_shutdown

Jun 2 19:09:27 NODE1 ccsd[20304]: Cluster manager shutdown. Attemping to reconnect...

Jun 2 19:09:30 NODE1 kernel: NET: Unregistered protocol family 30

Jun 2 19:09:30 NODE1 cman: shutdown succeeded

Jun 2 19:09:31 NODE1 ccsd[20304]: Stopping ccsd, SIGTERM received.

Jun 2 19:09:32 NODE1 ccsd: shutdown succeeded

接着存储厂商工程师进行的480存储的控制器spa和spb的依次重启操作，对应日志：

Jun 2 20:13:23 NODE1 kernel: Error:Mpx:Path Bus 0 Tgt 1 Lun 1 to FCN00114400004 is dead.

Jun 2 20:13:23 NODE1 kernel: Error:Mpx:Killing bus 0 to Clariion FCN00114400004 port SP A2.

Jun 2 20:13:23 NODE1 kernel: Error:Mpx:Path Bus 0 Tgt 1 Lun 0 to FCN00114400004 is dead.

Jun 2 20:15:46 NODE1 kernel: Info:Mpx:Assigned volume 6006016084802E001057FFF3D37FE111 to SPB

Jun 2 20:15:46 NODE1 kernel: SCSI device sdg: 461373440 512-byte hdwr sectors (236223 MB)

Jun 2 20:15:46 NODE1 kernel: SCSI device sdg: drive cache: write through

Jun 2 20:15:46 NODE1 kernel: sdg: unknown partition table

Jun 2 20:23:43 NODE1 kernel: Info:Mpx:Path Bus 0 Tgt 1 Lun 0 to FCN00114400004 is alive.

Jun 2 20:23:43 NODE1 kernel: Info:Mpx:Assigned volume 6006016084802E001057FFF3D37FE111 to SPA

Jun 2 20:23:53 NODE1 kernel: Info:Mpx:Path Bus 0 Tgt 1 Lun 1 to FCN00114400004 is alive.

Jun 2 20:27:35 NODE1 kernel: Error:Mpx:Path Bus 1 Tgt 1 Lun 0 to FCN00114400004 is dead.

Jun 2 20:27:35 NODE1 kernel: Error:Mpx:Killing bus 1 to Clariion FCN00114400004 port SP B2.

Jun 2 20:27:35 NODE1 kernel: Error:Mpx:Path Bus 1 Tgt 1 Lun 1 to FCN00114400004 is dead.

Jun 2 20:35:45 NODE1 kernel: Info:Mpx:Path Bus 1 Tgt 1 Lun 0 to FCN00114400004 is alive.

Jun 2 20:35:45 NODE1 kernel: Info:Mpx:Path Bus 1 Tgt 1 Lun 1 to FCN00114400004 is alive.

完后系统管理员准备恢复集群，启动报错，集群资源启动失败

怀疑和割接有关，查看日志如下

Jun 2 20:52重启主机尝试再次启动集群报错（对应cisvg02这个vg有问题）

Jun 2 20:56:05 NODE1 clurgmgrd: [22072]: <err> stop: Could not match /dev/cisvg03/cisv03l5002 with a real device

Jun 2 20:56:05 NODE1 clurgmgrd[22072]: <notice> stop on fs:log returned 2 (invalid argument(s))

Jun 2 20:56:05 NODE1 clurgmgrd: [22072]: <notice> Deactivating cisvg03/cisv03l5002

Jun 2 20:56:05 NODE1 clurgmgrd: [22072]: <notice> Making resilient : lvchange -an cisvg03/cisv03l5002

Jun 2 20:56:05 NODE1 clurgmgrd: [22072]: <notice> Resilient command: lvchange -an cisvg03/cisv03l5002 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/emcpowerc|","a|/dev/emcpowerd1|","a|/dev/emcpowerd2|","a|/dev/emcpowerd3|","r|.*|"]}

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Removing ownership tag (NODE1) from cisvg03/cisv03l5002

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Deactivating cisvg02/cisv02l5001

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Making resilient : lvchange -an cisvg02/cisv02l5001

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Resilient command: lvchange -an cisvg02/cisv02l5001 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/emcpowerc|","a|/dev/emcpowerd1|","a|/dev/emcpowerd2|","a|/dev/emcpowerd3|","r|.*|"]}

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <err> lvm_exec_resilient failed

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <err> lv_activate_resilient stop failed on cisvg02/cisv02l5001

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <err> Unable to deactivate cisvg02/cisv02l5001

Jun 2 20:56:06 NODE1 clurgmgrd: [22072]: <err> Failed to stop cisvg02/cisv02l5001

Jun 2 20:56:06 NODE1 clurgmgrd[22072]: <notice> stop on lvm:cisvg02 returned 1 (generic error)

日志里有/dev/emcpowerd1: Checksum error的报错（/dev/emcpowerd1即对应cisvg02的pv）

Jun 2 21:35:12 NODE1 vgchange: /dev/sdd1: Checksum error

Jun 2 21:35:37 NODE1 kernel: PCI: HP ProLiant BL460c G1 detected, enabling pci=bfsort.

Jun 2 21:35:12 NODE1 vgchange: Found duplicate PV g8eRGtYloHimccQfSlk3iScGM6GQhJ8Q: using /dev/emcpowerd1 not /dev/sdd1

Jun 2 21:35:37 NODE1 kernel: PCI: Probing PCI hardware (bus 00)

Jun 2 21:35:12 NODE1 vgchange: /dev/emcpowerd1: Checksum error

Jun 2 21:35:37 NODE1 kernel: PCI: Transparent bridge - 0000:00:1e.0

由此报错判断是该共享盘有问题

于是在操作系统下面查看物理盘，发现是/dev/emcpowerd1这个分区的cisvg02信息丢失，导致集群无法正常启动：

PV VG Fmt Attr PSize PFree

/dev/cciss/c0d0p2 cisvg01 lvm2 a- 96.50G 1.50G

/dev/cciss/c0d0p3 vg00 lvm2 a- 39.97G 2.97G

/dev/emcpowerc cisvg04 lvm2 a- 219.88G 20.00G

/dev/emcpowerd1 lvm2 -- 499.99G 499.99G （vg信息丢失）

/dev/emcpowerd2 cisvg03 lvm2 a- 99.88G 2.38G

/dev/emcpowerd3 cisvg05 lvm2 a- 108.00M 28.00M

于是用dd命令将此物理盘的导出一个2G的文件，并且strings读取还是有信息的，说明盘数据还在

dd if=/dev/emcpowerd1 of=/tmp/d1 bs=1024k count=2048

基于此决定启动修复工作，修复前临时新划块1T的存储盘将其dd备份出来（备出文件约500G）

nohup dd if=/dev/emcpowerd1 of=/mnt/emcpowerd1/d1 bs=1024k &

完成备份后，启动修复命令

vgcfgrestore cisvg02（系统lvm目录里有盘的变更信息，故可恢复）

接着启动集群无报错，应用恢复正常。

故障分析: 应该是共享盘在割接时，系统丢失了部分盘头信息导致，修复即可。

PS =====================================================

pvdisplay 可以看见盘PV，但VG NAME为空同时报checksum error, vg无法激活，VG中所用磁盘为存储磁盘。

可以用vgcfgrestore修复

# vgcfgrestore -f /etc/lvm/archive/VG_Name_XXXXX.vg <vgname>

如果用pvcreate恢复盘头，估计得对所用的pv执行

# pvcreate --uuid "<UUID>" --restorefile /etc/lvm/archive/VG_Name_XXXXX.vg <PhysicalVolume>

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Sybase 运维

后一篇：网络组播设置导致Linux5.6集群故障一例

新浪BLOG意见反馈留言板　欢迎批评指正