加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

Linux4集群共享盘vg丢失故障一例

(2012-06-05 14:21:23)
标签:

linux

集群

丢盘

vg修复

vgcfgrestore

checksum

error

it

分类: LINUX

 

NODE1 NODE2 为一集群的两个节点

系统版本 Redhat Linux4.6 32bit  集群为红帽版本

 

其中系统/etc/lvm/lvm.conf内对应如下:

     volume_list = [ "vg00","cisvg01","cisvg04","cisvg05","@ NODE1" ]

    locking_type = 1

cisvg02cisvg03这个2vg是加在资源包里的文件系统切在哪边表示即被激活,volume_list 里的vg为自动激活

 

由于新旧存储割接,系统管理员需对支票影像进行了停集群服务的操作,无报错,集群顺利停掉。

Jun  2 19:07:11 NODE1 clurgmgrd[21037]: <notice> Stopping service cis1res1

Jun  2 19:07:11 NODE1 clurgmgrd: [21037]: <info> Executing /etc/rc.d/init.d/ccb_cis stop

Jun  2 19:07:11 NODE1 su(pam_unix)[23404]: session opened for user xwj by (uid=0)

Jun  2 19:07:23 NODE1 su(pam_unix)[23404]: session closed for user xwj

Jun  2 19:07:23 NODE1 su(pam_unix)[23493]: session opened for user informix by (uid=0)

Jun  2 19:07:26 NODE1 su(pam_unix)[23493]: session closed for user informix

Jun  2 19:07:26 NODE1 clurgmgrd: [21037]: <info> Removing IPv4 address 52.0.33.11 from bond0

Jun  2 19:07:37 NODE1 clurgmgrd: [21037]: <info> unmounting /home/ap/xwj/Log

Jun  2 19:07:37 NODE1 clurgmgrd: [21037]: <info> unmounting /home/ap/xwj/Bin/CISFiles

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Deactivating cisvg03/cisv03l5002

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Making resilient : lvchange -an cisvg03/cisv03l5002

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Resilient command: lvchange -an cisvg03/cisv03l5002 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/sdg|","a|/dev/sdh1|","a|/dev/sdh2|","a|/dev/sdh3|","r|.*|"]}

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Removing ownership tag (NODE1) from cisvg03/cisv03l5002

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Deactivating cisvg02/cisv02l5001

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Making resilient : lvchange -an cisvg02/cisv02l5001

Jun  2 19:07:38 NODE1 clurgmgrd: [21037]: <notice> Resilient command: lvchange -an cisvg02/cisv02l5001 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/sdg|","a|/dev/sdh1|","a|/dev/sdh2|","a|/dev/sdh3|","r|.*|"]}

Jun  2 19:07:39 NODE1 clurgmgrd: [21037]: <notice> Removing ownership tag (NODE1) from cisvg02/cisv02l5001

Jun  2 19:07:39 NODE1 clurgmgrd[21037]: <notice> Service cis1res1 is disabled

Jun  2 19:09:23 NODE1 rgmanager: [25472]: <notice> Shutting down Cluster Service Manager...

Jun  2 19:09:23 NODE1 clurgmgrd[21037]: <notice> Shutting down

Jun  2 19:09:25 NODE1 clurgmgrd[21037]: <notice> Shutdown complete, exiting

Jun  2 19:09:25 NODE1 rgmanager: [25472]: <notice> Cluster Service Manager is stopped.

Jun  2 19:09:26 NODE1 fenced: shutdown succeeded

Jun  2 19:09:27 NODE1 kernel: CMAN: we are leaving the cluster. Removed

Jun  2 19:09:27 NODE1 kernel: WARNING: dlm_emergency_shutdown

Jun  2 19:09:27 NODE1 kernel: WARNING: dlm_emergency_shutdown

Jun  2 19:09:27 NODE1 ccsd[20304]: Cluster manager shutdown.  Attemping to reconnect...

Jun  2 19:09:30 NODE1 kernel: NET: Unregistered protocol family 30

Jun  2 19:09:30 NODE1 cman: shutdown succeeded

Jun  2 19:09:31 NODE1 ccsd[20304]: Stopping ccsd, SIGTERM received.

Jun  2 19:09:32 NODE1 ccsd: shutdown succeeded

 

接着存储厂商工程师进行的480存储的控制器spaspb的依次重启操作,对应日志:

Jun  2 20:13:23 NODE1 kernel: Error:Mpx:Path Bus 0 Tgt 1 Lun 1 to FCN00114400004 is dead.

Jun  2 20:13:23 NODE1 kernel: Error:Mpx:Killing bus 0 to Clariion   FCN00114400004 port SP A2.

Jun  2 20:13:23 NODE1 kernel: Error:Mpx:Path Bus 0 Tgt 1 Lun 0 to FCN00114400004 is dead.

Jun  2 20:15:46 NODE1 kernel: Info:Mpx:Assigned volume 6006016084802E001057FFF3D37FE111 to SPB

Jun  2 20:15:46 NODE1 kernel: SCSI device sdg: 461373440 512-byte hdwr sectors (236223 MB)

Jun  2 20:15:46 NODE1 kernel: SCSI device sdg: drive cache: write through

Jun  2 20:15:46 NODE1 kernel:  sdg: unknown partition table

Jun  2 20:23:43 NODE1 kernel: Info:Mpx:Path Bus 0 Tgt 1 Lun 0 to FCN00114400004 is alive.

Jun  2 20:23:43 NODE1 kernel: Info:Mpx:Assigned volume 6006016084802E001057FFF3D37FE111 to SPA

Jun  2 20:23:53 NODE1 kernel: Info:Mpx:Path Bus 0 Tgt 1 Lun 1 to FCN00114400004 is alive.

Jun  2 20:27:35 NODE1 kernel: Error:Mpx:Path Bus 1 Tgt 1 Lun 0 to FCN00114400004 is dead.

Jun  2 20:27:35 NODE1 kernel: Error:Mpx:Killing bus 1 to Clariion   FCN00114400004 port SP B2.

Jun  2 20:27:35 NODE1 kernel: Error:Mpx:Path Bus 1 Tgt 1 Lun 1 to FCN00114400004 is dead.

Jun  2 20:35:45 NODE1 kernel: Info:Mpx:Path Bus 1 Tgt 1 Lun 0 to FCN00114400004 is alive.

Jun  2 20:35:45 NODE1 kernel: Info:Mpx:Path Bus 1 Tgt 1 Lun 1 to FCN00114400004 is alive.

 

完后系统管理员准备恢复集群,启动报错,集群资源启动失败

怀疑和割接有关,查看日志如下

Jun  2 20:52重启主机尝试再次启动集群报错(对应cisvg02这个vg有问题)

Jun  2 20:56:05 NODE1 clurgmgrd: [22072]: <err> stop: Could not match /dev/cisvg03/cisv03l5002 with a real device

Jun  2 20:56:05 NODE1 clurgmgrd[22072]: <notice> stop on fs:log returned 2 (invalid argument(s))

Jun  2 20:56:05 NODE1 clurgmgrd: [22072]: <notice> Deactivating cisvg03/cisv03l5002

Jun  2 20:56:05 NODE1 clurgmgrd: [22072]: <notice> Making resilient : lvchange -an cisvg03/cisv03l5002

Jun  2 20:56:05 NODE1 clurgmgrd: [22072]: <notice> Resilient command: lvchange -an cisvg03/cisv03l5002 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/emcpowerc|","a|/dev/emcpowerd1|","a|/dev/emcpowerd2|","a|/dev/emcpowerd3|","r|.*|"]}

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Removing ownership tag (NODE1) from cisvg03/cisv03l5002

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Deactivating cisvg02/cisv02l5001

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Making resilient : lvchange -an cisvg02/cisv02l5001

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <notice> Resilient command: lvchange -an cisvg02/cisv02l5001 --config devices{filter=["a|/dev/cciss/c0d0p2|","a|/dev/cciss/c0d0p3|","a|/dev/emcpowerc|","a|/dev/emcpowerd1|","a|/dev/emcpowerd2|","a|/dev/emcpowerd3|","r|.*|"]}

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <err> lvm_exec_resilient failed

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <err> lv_activate_resilient stop failed on cisvg02/cisv02l5001

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <err> Unable to deactivate cisvg02/cisv02l5001

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <err> Failed to stop cisvg02/cisv02l5001

Jun  2 20:56:06 NODE1 clurgmgrd: [22072]: <err> Failed to stop cisvg02/cisv02l5001

Jun  2 20:56:06 NODE1 clurgmgrd[22072]: <notice> stop on lvm:cisvg02 returned 1 (generic error)

 

日志里有/dev/emcpowerd1: Checksum error的报错(/dev/emcpowerd1即对应cisvg02pv

Jun  2 21:35:12 NODE1 vgchange:   /dev/sdd1: Checksum error

Jun  2 21:35:37 NODE1 kernel: PCI: HP ProLiant BL460c G1 detected, enabling pci=bfsort.

Jun  2 21:35:12 NODE1 vgchange:   Found duplicate PV g8eRGtYloHimccQfSlk3iScGM6GQhJ8Q: using /dev/emcpowerd1 not /dev/sdd1

Jun  2 21:35:37 NODE1 kernel: PCI: Probing PCI hardware (bus 00)

Jun  2 21:35:12 NODE1 vgchange:   /dev/emcpowerd1: Checksum error

Jun  2 21:35:37 NODE1 kernel: PCI: Transparent bridge - 0000:00:1e.0

由此报错判断是该共享盘有问题

 

于是在操作系统下面查看物理盘,发现是/dev/emcpowerd1这个分区的cisvg02信息丢失,导致集群无法正常启动

  PV                VG      Fmt  Attr PSize   PFree 

  /dev/cciss/c0d0p2 cisvg01 lvm2 a-    96.50G   1.50G

  /dev/cciss/c0d0p3 vg00    lvm2 a-    39.97G   2.97G

  /dev/emcpowerc    cisvg04 lvm2 a-   219.88G  20.00G

  /dev/emcpowerd1           lvm2 --   499.99G 499.99G   (vg信息丢失)

  /dev/emcpowerd2   cisvg03 lvm2 a-    99.88G   2.38G

  /dev/emcpowerd3   cisvg05 lvm2 a-   108.00M  28.00M

 

 

于是用dd命令将此物理盘的导出一个2G的文件,并且strings读取还是有信息的,说明盘数据还在

dd if=/dev/emcpowerd1 of=/tmp/d1 bs=1024k count=2048

 

基于此决定启动修复工作,修复前临时新划块1T的存储盘 将其dd备份出来(备出文件约500G

nohup dd if=/dev/emcpowerd1 of=/mnt/emcpowerd1/d1 bs=1024k &

 

完成备份后,启动修复命令

vgcfgrestore  cisvg02(系统lvm目录里有盘的变更信息,故可恢复)

 

接着启动集群无报错,应用恢复正常。

 

故障分析: 应该是共享盘在割接时,系统丢失了部分盘头信息导致,修复即可。

 

 

 

PS =====================================================

pvdisplay 可以看见盘PV,但VG NAME为空同时报checksum error, vg无法激活,VG中所用磁盘为存储磁盘。

可以用vgcfgrestore修复

# vgcfgrestore -f /etc/lvm/archive/VG_Name_XXXXX.vg <vgname>

如果用pvcreate恢复盘头,估计得对所用的pv执行

# pvcreate --uuid "<UUID>" --restorefile /etc/lvm/archive/VG_Name_XXXXX.vg <PhysicalVolume>
  

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有