NODE1和
NODE2
为一集群的两个节点
系统版本 Redhat Linux4.6
32bit 集群为红帽版本
其中系统/etc/lvm/lvm.conf内对应如下:
volume_list = [ "vg00","cisvg01","cisvg04","cisvg05","@ NODE1"
]
locking_type = 1
cisvg02和cisvg03这个2个vg是加在资源包里的文件系统切在哪边表示即被激活,volume_list 里的vg为自动激活
由于新旧存储割接,系统管理员需对支票影像进行了停集群服务的操作,无报错,集群顺利停掉。
Jun 2 19:07:11 NODE1
clurgmgrd[21037]: <notice> Stopping
service cis1res1
Jun 2 19:07:11 NODE1
clurgmgrd: [21037]: <info> Executing
/etc/rc.d/init.d/ccb_cis stop
Jun 2 19:07:11 NODE1
su(pam_unix)[23404]: session opened for user xwj by
(uid=0)
Jun 2 19:07:23 NODE1
su(pam_unix)[23404]: session closed for user xwj
Jun 2 19:07:23 NODE1
su(pam_unix)[23493]: session opened for user informix by
(uid=0)
Jun 2 19:07:26 NODE1
su(pam_unix)[23493]: session closed for user
informix
Jun 2 19:07:26 NODE1
clurgmgrd: [21037]: <info> Removing
IPv4 address 52.0.33.11 from bond0
Jun 2 19:07:37 NODE1
clurgmgrd: [21037]: <info> unmounting
/home/ap/xwj/Log
Jun 2 19:07:37 NODE1
clurgmgrd: [21037]: <info> unmounting
/home/ap/xwj/Bin/CISFiles
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice>
Deactivating cisvg03/cisv03l5002
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice> Making
resilient : lvchange -an cisvg03/cisv03l5002
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice>
Resilient command: lvchange -an cisvg03/cisv03l5002 --config
devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/sdg|","a|/dev/sdh1|","a|/dev/sdh2|","a|/dev/sdh3|","r|.*|"]}
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice> Removing
ownership tag (NODE1) from cisvg03/cisv03l5002
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice>
Deactivating cisvg02/cisv02l5001
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice> Making
resilient : lvchange -an cisvg02/cisv02l5001
Jun 2 19:07:38 NODE1
clurgmgrd: [21037]: <notice>
Resilient command: lvchange -an cisvg02/cisv02l5001 --config
devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/sdg|","a|/dev/sdh1|","a|/dev/sdh2|","a|/dev/sdh3|","r|.*|"]}
Jun 2 19:07:39 NODE1
clurgmgrd: [21037]: <notice> Removing
ownership tag (NODE1) from cisvg02/cisv02l5001
Jun 2 19:07:39 NODE1
clurgmgrd[21037]: <notice> Service
cis1res1 is disabled
Jun 2 19:09:23 NODE1
rgmanager: [25472]: <notice> Shutting
down Cluster Service Manager...
Jun 2 19:09:23 NODE1
clurgmgrd[21037]: <notice> Shutting
down
Jun 2 19:09:25 NODE1
clurgmgrd[21037]: <notice> Shutdown
complete, exiting
Jun 2 19:09:25 NODE1
rgmanager: [25472]: <notice> Cluster
Service Manager is stopped.
Jun 2 19:09:26 NODE1
fenced: shutdown succeeded
Jun 2 19:09:27 NODE1
kernel: CMAN: we are leaving the cluster. Removed
Jun 2 19:09:27 NODE1
kernel: WARNING: dlm_emergency_shutdown
Jun 2 19:09:27 NODE1
kernel: WARNING: dlm_emergency_shutdown
Jun 2 19:09:27 NODE1
ccsd[20304]: Cluster manager shutdown. Attemping to
reconnect...
Jun 2 19:09:30 NODE1
kernel: NET: Unregistered protocol family 30
Jun 2 19:09:30 NODE1
cman: shutdown succeeded
Jun 2 19:09:31 NODE1
ccsd[20304]: Stopping ccsd, SIGTERM received.
Jun 2 19:09:32 NODE1
ccsd: shutdown succeeded
接着存储厂商工程师进行的480存储的控制器spa和spb的依次重启操作,对应日志:
Jun 2 20:13:23 NODE1
kernel: Error:Mpx:Path Bus 0 Tgt 1 Lun 1 to FCN00114400004 is
dead.
Jun 2 20:13:23 NODE1
kernel: Error:Mpx:Killing bus 0 to Clariion
FCN00114400004 port SP A2.
Jun 2 20:13:23 NODE1
kernel: Error:Mpx:Path Bus 0 Tgt 1 Lun 0 to FCN00114400004 is
dead.
Jun 2 20:15:46 NODE1
kernel: Info:Mpx:Assigned volume 6006016084802E001057FFF3D37FE111
to SPB
Jun 2 20:15:46 NODE1
kernel: SCSI device sdg: 461373440 512-byte hdwr sectors (236223
MB)
Jun 2 20:15:46 NODE1
kernel: SCSI device sdg: drive cache: write
through
Jun 2 20:15:46 NODE1
kernel:
sdg: unknown partition table
Jun 2 20:23:43 NODE1
kernel: Info:Mpx:Path Bus 0 Tgt 1 Lun 0 to FCN00114400004 is
alive.
Jun 2 20:23:43 NODE1
kernel: Info:Mpx:Assigned volume 6006016084802E001057FFF3D37FE111
to SPA
Jun 2 20:23:53 NODE1
kernel: Info:Mpx:Path Bus 0 Tgt 1 Lun 1 to FCN00114400004 is
alive.
Jun 2 20:27:35 NODE1
kernel: Error:Mpx:Path Bus 1 Tgt 1 Lun 0 to FCN00114400004 is
dead.
Jun 2 20:27:35 NODE1
kernel: Error:Mpx:Killing bus 1 to Clariion
FCN00114400004 port SP B2.
Jun 2 20:27:35 NODE1
kernel: Error:Mpx:Path Bus 1 Tgt 1 Lun 1 to FCN00114400004 is
dead.
Jun 2 20:35:45 NODE1
kernel: Info:Mpx:Path Bus 1 Tgt 1 Lun 0 to FCN00114400004 is
alive.
Jun 2 20:35:45 NODE1
kernel: Info:Mpx:Path Bus 1 Tgt 1 Lun 1 to FCN00114400004 is
alive.
完后系统管理员准备恢复集群,启动报错,集群资源启动失败
怀疑和割接有关,查看日志如下
Jun 2
20:52重启主机尝试再次启动集群报错(对应cisvg02这个vg有问题)
Jun 2 20:56:05 NODE1
clurgmgrd: [22072]: <err> stop: Could
not match /dev/cisvg03/cisv03l5002 with a real
device
Jun 2 20:56:05 NODE1
clurgmgrd[22072]: <notice> stop on
fs:log returned 2 (invalid argument(s))
Jun 2 20:56:05 NODE1
clurgmgrd: [22072]: <notice>
Deactivating cisvg03/cisv03l5002
Jun 2 20:56:05 NODE1
clurgmgrd: [22072]: <notice> Making
resilient : lvchange -an cisvg03/cisv03l5002
Jun 2 20:56:05 NODE1
clurgmgrd: [22072]: <notice>
Resilient command: lvchange -an cisvg03/cisv03l5002 --config
devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/emcpowerc|","a|/dev/emcpowerd1|","a|/dev/emcpowerd2|","a|/dev/emcpowerd3|","r|.*|"]}
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <notice> Removing
ownership tag (NODE1) from cisvg03/cisv03l5002
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <notice>
Deactivating cisvg02/cisv02l5001
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <notice> Making
resilient : lvchange -an cisvg02/cisv02l5001
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <notice>
Resilient command: lvchange -an cisvg02/cisv02l5001 --config
devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/emcpowerc|","a|/dev/emcpowerd1|","a|/dev/emcpowerd2|","a|/dev/emcpowerd3|","r|.*|"]}
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <err>
lvm_exec_resilient failed
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <err>
lv_activate_resilient stop failed on
cisvg02/cisv02l5001
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <err> Unable to
deactivate cisvg02/cisv02l5001
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <err> Failed to
stop cisvg02/cisv02l5001
Jun 2 20:56:06 NODE1
clurgmgrd: [22072]: <err> Failed to
stop cisvg02/cisv02l5001
Jun 2 20:56:06 NODE1
clurgmgrd[22072]: <notice> stop on
lvm:cisvg02 returned 1 (generic error)
日志里有/dev/emcpowerd1: Checksum error的报错(/dev/emcpowerd1即对应cisvg02的pv)
Jun 2 21:35:12 NODE1
vgchange:
/dev/sdd1: Checksum error
Jun 2 21:35:37 NODE1
kernel: PCI: HP ProLiant BL460c G1 detected, enabling
pci=bfsort.
Jun 2 21:35:12 NODE1
vgchange:
Found duplicate PV g8eRGtYloHimccQfSlk3iScGM6GQhJ8Q: using
/dev/emcpowerd1 not /dev/sdd1
Jun 2 21:35:37 NODE1
kernel: PCI: Probing PCI hardware (bus 00)
Jun 2 21:35:12 NODE1
vgchange:
/dev/emcpowerd1: Checksum error
Jun 2 21:35:37 NODE1
kernel: PCI: Transparent bridge - 0000:00:1e.0
由此报错判断是该共享盘有问题
于是在操作系统下面查看物理盘,发现是/dev/emcpowerd1这个分区的cisvg02信息丢失,导致集群无法正常启动:
PV
VG
Fmt Attr
PSize
PFree
/dev/cciss/c0d0p2
cisvg01 lvm2 a-
96.50G
1.50G
/dev/cciss/c0d0p3
vg00
lvm2 a-
39.97G
2.97G
/dev/emcpowerc
cisvg04 lvm2 a-
219.88G
20.00G
/dev/emcpowerd1 lvm2
--
499.99G 499.99G (vg信息丢失)
/dev/emcpowerd2
cisvg03 lvm2 a-
99.88G
2.38G
/dev/emcpowerd3
cisvg05 lvm2 a-
108.00M
28.00M
于是用dd命令将此物理盘的导出一个2G的文件,并且strings读取还是有信息的,说明盘数据还在
dd if=/dev/emcpowerd1 of=/tmp/d1
bs=1024k count=2048
基于此决定启动修复工作,修复前临时新划块1T的存储盘
将其dd备份出来(备出文件约500G)
nohup dd if=/dev/emcpowerd1
of=/mnt/emcpowerd1/d1 bs=1024k &
完成备份后,启动修复命令
vgcfgrestore
cisvg02(系统lvm目录里有盘的变更信息,故可恢复)
接着启动集群无报错,应用恢复正常。
故障分析:
应该是共享盘在割接时,系统丢失了部分盘头信息导致,修复即可。
PS
=====================================================
pvdisplay 可以看见盘PV,但VG
NAME为空同时报checksum error, vg无法激活,VG中所用磁盘为存储磁盘。
可以用vgcfgrestore修复
# vgcfgrestore -f
/etc/lvm/archive/VG_Name_XXXXX.vg
<vgname>
如果用pvcreate恢复盘头,估计得对所用的pv执行
# pvcreate --uuid
"<UUID>" --restorefile
/etc/lvm/archive/VG_Name_XXXXX.vg
<PhysicalVolume>
加载中,请稍候......