corrupt vgda - recovery impossible?

Discussion:

(too old to reply)

mr_t_has_a_van

2006-04-10 20:17:45 UTC

Hello,

I've thoroughly read both redbooks on this topic and googled and been
unable to fix this. I wonder now if calling IBM will even help? We
don't currently have software support and I'm not certain if we're
allowed (university) to get per hour support -- I think we have to
plunk down 1500 or so, which is big bucks for us. Here's what's going
on....

There was a hardware failure, and two of the four "disks" in the volume
group were completely trashed. The vg was reduced by two disks. Then
the ODM info was totally removed. No backups of any kind. At this point
I took over. The volume group was of course not in the ODM. I did
redefinevg -d hdisk3 vgdvd, etc. No vary on worked. (I'm condensing
enormously.) Since quorum value is "3", and only two disks remain, I
tried to "chvg -Q n vgdvd" and rebooting. Still, here's the problem:

# varyonvg -f vgdvd
PV Status: hdisk3 000c444d54876714 PVNOTINVG
hdisk4 000c444d5487695e PVNOTINVG
0516-013 varyonvg: The volume group cannot be varied on because
there are no good copies of the descriptor area.

Examining the vgdas with od, I can see the vgdvd volume group's id in
there, and things seem to be in the proper place, so I don't know why
it won't vary on. Now I wonder if I call IBM I'll get someone who's
basically going from the redbook, or might I get some hardcore person
who'll hack the VGDA by hand? My only clue is this from lvm.h:
/*
* PV status values which may be returned from lvm_varyonvg in the
* varyonvg output structure if a quorum is not obtained. (Error
return
* of LVM_NOQUORUM or LVM_NOVGDAS).
*/
#define LVM_PVNOTFND 10 /* physical volume could not be opened
*/
/* or its IPL record or LVM record
*/
/* could not be read
*/
#define LVM_PVNOTINVG 11 /* the PV's LVM record indicates it is
*/
/* not a member of the specified VG
*/
#define LVM_PVINVG 12 /* the PV's LVM record indicates it is
*/
/* a member of the specified VG
*/

But I can see the vgid in the vgda with od.

By way of providing all relevent info, the only lvm commands that will
give me meaningful information without vary on are:
# lsvg -o -n hdisk3
VOLUME GROUP: vgdvd VG IDENTIFIER:
000c444dcf17e567
VG STATE: inactive PP SIZE: 1024
megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 3210 (3287040
megabytes)
MAX LVs: 256 FREE PPs: 0 (0
megabytes)
LVs: 5 USED PPs: 3210 (3287040
megabytes)
OPEN LVs: 0 QUORUM: 1
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 0 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 32

# lqueryvg -p hdisk3 -tA
Max LVs: 256
PP Size: 30
Free PPs: 0
LV count: 5
PV count: 4
Total VGDAs: 4
Conc Allowed 0
MAX PPs per 1016
MAX PVs: 32
Conc Autovar 0
Varied on Co 0
Logical: 000c444dcf17e567.1 data1 1
000c444dcf17e567.2 loglv02 1
000c444dcf17e567.3 data2 1
000c444dcf17e567.4 data3 1
000c444dcf17e567.5 data4 1
Physical: 00093560987bcfe0 1 0
00093560987bd3db 1 0
000c444d6e1c94f1 1 0
000c444da0462df9 1 0
Total PPs: 3210

# lquerypv -h /dev/hdisk3 80 10
00000080 000C444D 54876714 00000000 00000000 |..DMT.g.........|

# lspv hdisk3
0516-010 lspv: Volume group must be varied on; use varyonvg command.
PHYSICAL VOLUME: hdisk3 VOLUME GROUP: vgdvd
PV IDENTIFIER: 000c444d54876714 VG IDENTIFIER 000c444dcf17e567
PV STATE: ???????
STALE PARTITIONS: ??????? ALLOCATABLE: ???????
PP SIZE: ??????? LOGICAL VOLUMES: ???????
TOTAL PPs: ??????? VG DESCRIPTORS: ???????
FREE PPs: ???????
USED PPs: ???????
FREE DISTRIBUTION: ???????
USED DISTRIBUTION: ???????

Dave

2006-04-10 20:36:39 UTC

Permalink

Have you tried importvg -y vgdvd hdisk3, then varyonvg vgdvd, then
synclvodm vgdvd?

Markus Baertschi

2006-04-10 22:43:44 UTC

Permalink

mr_t_has_a_van wrote:
...

If importing the vg like already said is not working, the it will be
very difficult to get at any data on the disks. How do you know that the
data you are looking for is on the two remaining good disks and not on
the two broken ones ?

I would not bet too much on IBM Support to hack your data back. Support
people are not trained for data recovery. You'd need a data recovery
service for that.

Start by defining what kind of data was on these that you want to
recover. Files ? Binary ? Database ? If you can locate some of it by
reading the raw disk (dd | strings | grep) the you might get lucky hand,
but it will be a lot of work.

Markus

Dmitri Pasyutin

2006-04-11 07:51:15 UTC

Permalink

Post by mr_t_has_a_van
There was a hardware failure, and two of the four "disks" in
the volume group were completely trashed. The vg was reduced by
two disks. Then the ODM info was totally removed. No backups of
any kind. At this point I took over. The volume group was of
course not in the ODM. I did redefinevg -d hdisk3 vgdvd, etc.
No vary on worked. (I'm condensing enormously.) Since quorum
value is "3", and only two disks remain, I tried to "chvg -Q n
# varyonvg -f vgdvd
PV Status: hdisk3 000c444d54876714 PVNOTINVG
hdisk4 000c444d5487695e PVNOTINVG
0516-013 varyonvg: The volume group cannot be varied on because
there are no good copies of the descriptor area.

[snip]

Post by mr_t_has_a_van
# lsvg -o -n hdisk3
000c444dcf17e567
VG STATE: inactive PP SIZE: 1024
megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 3210
(3287040 megabytes)
MAX LVs: 256 FREE PPs: 0 (0
megabytes)
LVs: 5 USED PPs: 3210
(3287040 megabytes)
OPEN LVs: 0 QUORUM: 1
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 0 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 32
# lqueryvg -p hdisk3 -tA
Max LVs: 256
PP Size: 30
Free PPs: 0
LV count: 5
PV count: 4
Total VGDAs: 4
Conc Allowed 0
MAX PPs per 1016
MAX PVs: 32
Conc Autovar 0
Varied on Co 0
Logical: 000c444dcf17e567.1 data1 1
000c444dcf17e567.2 loglv02 1
000c444dcf17e567.3 data2 1
000c444dcf17e567.4 data3 1
000c444dcf17e567.5 data4 1
Physical: 00093560987bcfe0 1 0
00093560987bd3db 1 0
000c444d6e1c94f1 1 0
000c444da0462df9 1 0
Total PPs: 3210

I lost an entire VG in a similar way last week, this is how I
recovered it. I called IBM support, their answer was to restore
from tape backup (which I didn't have), but they did point me to
the readvgda command (see below). The LVM redbook vol. 2 did the
rest.

As always, *use this at your own risk!!*

1. Use the (undocumented) readvgda command to get the detailed
VGDA from one of the disks, e.g. "readvgda /dev/hdisk3". I'd
advise you to do the same on a disk that *didn't* fail and
compare (a disk that didn't fail might be a more reliable
source). Check the list of PVIDs and make sure they are all
present and accessible by the system (no hardware problems).
Save the readvgda output in a file, you will need it if you make
a mistake later and want to start over.

2. Among the readvgda output is the layout of each LV on the PVs,
from which you can construct LV map files for the mlkv command.

3. You will also need a good copy of the /etc/filesystems file,
unless you remember the original LV mount points and filesystem
types.

4. When you have all the LV map files, export the volume group
and re-create it on the original PVs, with the original PP size:
mkvg -f -y vgdvd -s 1024 hdisk3 hdisk4 ...

5. Re-create each LV using the map files. I assume here that your
filesystems were jfs (change accordingly if you had jfs2):
mklv -y loglv02 -m <loglv02_mapfile> -t jfslog vgdvd <numlp>
mklv -y data1 -m <data1_mapfile> -t jfs vgdvd <numlp>
...

6. Restore /etc/filesystems or add the missing stanzas manually.
*Do not mount the filesystems yet.*

7. For each filesystem LV, update the LVCB with the log and label
information:
chfs -a log=/dev/loglv02 <mountpoint>
chlv -L <mountpoint> data1
...

8. Run a fsck on each filesystem. This is the critical step - if
the command fails because it can't find the superblock, then the
LVs were not re-created correctly, stop and check your map
files. If the jfslog LV was not re-created correctly, fsck will
not be able to replay the log and you will probably lose some
data. Consider stopping and double-checking the map files before
answering yes to the fsck "fix" questions. Also, if filesystems
were active at the time of the crash, fsck might find
unrecoverable errors on some inodes or blocks, your only option
here is to fix these errors at the risk of losing some inodes
(the price for not having a tape backup). If errors were fixed,
re-run fsck until there are no errors.

9. When fsck run without errors on each FS, mount the
filesystems. If errors were fixed during fsck, check the
lost+found directories, they might contain missing files or
directories you can salvage.

--
Dmitri

mr_t_has_a_van

2006-04-12 19:34:22 UTC

Permalink

Sounds good, but I can't locate "readvgda". If it's a script, would you
post it? If it's a binary, would you email a copy or post for download?
My hardware's working, so I could try it.
# oslevel
4.3.3.0

Tao Chen

2006-04-12 20:23:51 UTC

Permalink

Post by mr_t_has_a_van
Sounds good, but I can't locate "readvgda". If it's a script, would you
post it? If it's a binary, would you email a copy or post for download?
My hardware's working, so I could try it.
# oslevel
4.3.3.0

Unfortunately 4.3.3 is only supported under special contract with best
effort now.
Also since you don't have a regular AIX contract, it would be "consult
line" even for 5.x anyway.

In theory, as long as you have at least one good copy of VGDA, the
whole VG can be recovered. Folks in LVM support (L2/L3) do this from
time to time.

I say give IBM a call to see what your options are - if the data is
really important and you don't have backup (which is unfortunate).

Good luck.

Tao

Dmitri Pasyutin

2006-04-14 07:57:05 UTC

Permalink

Post by mr_t_has_a_van
# oslevel
4.3.3.0

Looks like readvgda is available as of AIX 5.1. Do you have a
more recent box to copy it from?

--
Dmitri

mr_t_has_a_van

2006-04-18 16:17:48 UTC

Permalink

Not yet, but I'll be upgrading the box to 5.1 later this week. I assume
that won't harm the vgdas further, and I'll gain use of the readvgda
command.

2006-04-11 09:46:15 UTC

Permalink

Use the "errpt" command to check what kind of problem it is.Use the
errpt -a option to get details on the hardware problems.While varying
on the vg, check the LED codes shown and get their meaning from the
message guide.....u can also get it easily on the net.

In case the complete data has been lost, it is not possible to recover
it without a backup.But if the problem is only limited to the
superblock and not to the data blocks, it can be fixed by copying the
supeblock from an alternate location.

SO, first do 'importvg, varyonvg, synclvodm' and bring the volume group
online. Then do an "fsck" on all the filesystems and check if the
problem is related to superbock.Also check the detailed report of the
hardware problems as shown in the "errpt -a" command output.IN case the
problem is limited only to the superblock, it can be copied from an
alternate location by using the command:
dd count=1 bs=4k skip=31 seek=1 if=/dev/hdn of=/dev/hdn
where n is the number of the filesystem.