Discussion:
[linux-lvm] LVM RAID repair trouble
Giuliano Procida
2016-09-26 07:54:02 UTC
Permalink
Original post possibly lost in moderation, this is a repost with more
details and updated information.

Linux hilfy 4.6.0-1-amd64 #1 SMP Debian 4.6.4-1a~test (2016-09-16)
x86_64 GNU/Linux
(essentially stock Debian kernel)

LVM version: 2.02.164(2) (2016-08-15)
Library version: 1.02.133 (2016-08-15)
Driver version: 4.34.0

I have a 5 disk setup. There is one large PV per disk and a single VG.
LVs are mostly raid5 and raid6.

I needed to replace one disk so I researched and tested in a VM
(debian-8.6.0-amd64-netinst.iso) first. Web searching LVM RAID repair
was not too encouraging.

The following recipe worked in the VM:

Shutdown.
Remove one virtual disk.
Add new empty virtual disk.
Boot.
partition
pvcreate, vgextend
3x lvconvert --repair LV
vgreduce --removemissing

I tested this with raid1, raid5 and raid6 LVs spanning all the PVs. It
worked just fine.

I came to do the same with a physical disk change.
Take backups.
Shutdown
Remove one failing disk.
Add new empty disk.
Boot.
partition, pcreate, vgextend

I first repaired the root volume and this worked just fine.

Then I tried to repair the var volume.
I very unfortunately did not keep the error messages (but see below).
They included a complaint about "new device injected into existing
array without 'rebuild' parameter specified" and ended with a failure
to lock something. Based on later research the output was probably:

WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Couldn't find device with uuid [UUID].
Attempt to replace failed RAID images (requires full device resync)?
[y/n]: Accepted input: [y]
device-mapper: reload ioctl on ([major:minor]) failed: Invalid argument
Failed to lock logical volume vg0/[LV].
Failed to replace faulty devices in vg0/[LV].

Various tests showed that /var was still functioning fine (and there
appeared to have been a resync, possibly after I re-ran the command
or touched the var FS).

It was a bit disturbing to see duplicate symlinks (-> ../dm-$n) in
/dev/mapper, but each duplicate was either a _missing or [inactive]
device. The duplicates are no longer there (at least two reboots after
repairing var and one after repairing home). I thought there might be
some odd state in-kernel after the first repair; this is what prompted
me to try repairing the next LV after a reboot.

The net effect is that I now have RAID volumes in an odd state with
extracted / missing images (I think just a cosmetic issue) and which
rebuild on every boot (or rather re-activation and use).
The volumes take a long time to resync, especially concurrently!

I've verified parity (and syndrome) manually - the rebuilds are good
(on to the replacement disk) and no data appear to have been lost or
corrupted. The problem appears to be with the metadata.

I've since tried to repair a fourth RAID volume and collected detailed
diagnostics.

The obvious symptoms are visible in lvs output and in /dev/mapper below.

I've included much relevant information here:
https://drive.google.com/drive/folders/0B8dHrWSoVcaDN0RHTjFjQjdQNW8

This includes kernel logs and full -vvvdddddd output and error of some
commands, plus kernel logs as well as rmeta subLVs and a few utilities
I wrote. Please let me know if any more information would be useful or
whether this all belongs in a bug tracker somewhere.

Thanks for any assistance. I really hope I can get this working properly again.

--gp

Earlier excerpt from lvs -a (I've since lvremove'd the extracted LVs
and tried repairing photos):

root vg0 rwi-aor--- 25.00g
100.00
[root_rimage_0] vg0 iwi-aor--- 6.25g
[root_rimage_1] vg0 iwi-aor--- 6.25g
[root_rimage_2] vg0 iwi-aor--- 6.25g
[root_rimage_3] vg0 iwi-aor--- 6.25g
[root_rimage_4] vg0 iwi-aor--- 6.25g
[root_rmeta_0] vg0 ewi-aor--- 4.00m
[root_rmeta_1] vg0 ewi-aor--- 4.00m
[root_rmeta_2] vg0 ewi-aor--- 4.00m
[root_rmeta_3] vg0 ewi-aor--- 4.00m
[root_rmeta_4] vg0 ewi-aor--- 4.00m
var vg0 rwi-a-r--- 25.00g
100.00
var_rimage_0__extracted vg0 -wi-----p- 6.25g
[var_rimage_1] vg0 iwi-aor--- 6.25g
[var_rimage_2] vg0 iwi-aor--- 6.25g
[var_rimage_3] vg0 iwi-aor--- 6.25g
[var_rimage_4] vg0 iwi-aor--- 6.25g
[var_rimage_5] vg0 iwi-aor--- 6.25g
var_rmeta_0__extracted vg0 -wi-----p- 4.00m
[var_rmeta_1] vg0 ewi-aor--- 4.00m
[var_rmeta_2] vg0 ewi-aor--- 4.00m
[var_rmeta_3] vg0 ewi-aor--- 4.00m
[var_rmeta_4] vg0 ewi-aor--- 4.00m
[var_rmeta_5] vg0 ewi-aor--- 4.00m
home vg0 rwi-aor--- 400.01g
100.00
home_rimage_0__extracted vg0 -wi-----p- 133.34g
[home_rimage_1] vg0 iwi-aor--- 133.34g
[home_rimage_2] vg0 iwi-aor--- 133.34g
[home_rimage_3] vg0 iwi-aor--- 133.34g
[home_rimage_4] vg0 iwi-aor--- 133.34g
[home_rimage_5] vg0 iwi-aor--- 133.34g
home_rmeta_0__extracted vg0 -wi-----p- 4.00m
[home_rmeta_1] vg0 ewi-aor--- 4.00m
[home_rmeta_2] vg0 ewi-aor--- 4.00m
[home_rmeta_3] vg0 ewi-aor--- 4.00m
[home_rmeta_4] vg0 ewi-aor--- 4.00m
[home_rmeta_5] vg0 ewi-aor--- 4.00m
photos vg0 rwi-aor-p- 600.00g
100.00
[photos_rimage_0] vg0 iwi-a-r-p- 200.00g
[photos_rimage_1] vg0 iwi-aor--- 200.00g
[photos_rimage_2] vg0 iwi-aor--- 200.00g
[photos_rimage_3] vg0 iwi-aor--- 200.00g
[photos_rimage_4] vg0 iwi-aor--- 200.00g
[photos_rmeta_0] vg0 ewi-a-r-p- 4.00m
[photos_rmeta_1] vg0 ewi-aor--- 4.00m
[photos_rmeta_2] vg0 ewi-aor--- 4.00m
[photos_rmeta_3] vg0 ewi-aor--- 4.00m
[photos_rmeta_4] vg0 ewi-aor--- 4.00m

Earlier excerpt from ls -l /dev/mapper:

lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-root -> ../dm-10
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rimage_0 -> ../dm-1
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rimage_1 -> ../dm-3
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rimage_2 -> ../dm-5
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rimage_3 -> ../dm-7
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rimage_4 -> ../dm-9
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rmeta_0 -> ../dm-0
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rmeta_1 -> ../dm-2
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rmeta_2 -> ../dm-4
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rmeta_3 -> ../dm-6
lrwxrwxrwx 1 root root 7 Sep 24 01:27 vg0-root_rmeta_4 -> ../dm-8

lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var -> ../dm-21
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rimage_1 -> ../dm-14
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rimage_2 -> ../dm-16
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rimage_3 -> ../dm-18
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rimage_4 -> ../dm-20
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rimage_5 -> ../dm-12
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rmeta_1 -> ../dm-13
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rmeta_2 -> ../dm-15
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rmeta_3 -> ../dm-17
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rmeta_4 -> ../dm-19
lrwxrwxrwx 1 root root 8 Sep 24 02:19 vg0-var_rmeta_5 -> ../dm-11

lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home -> ../dm-32
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rimage_1 -> ../dm-25
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rimage_2 -> ../dm-27
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rimage_3 -> ../dm-29
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rimage_4 -> ../dm-31
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rimage_5 -> ../dm-23
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rmeta_1 -> ../dm-24
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rmeta_2 -> ../dm-26
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rmeta_3 -> ../dm-28
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rmeta_4 -> ../dm-30
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-home_rmeta_5 -> ../dm-22

lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos -> ../dm-45
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rimage_0 -> ../dm-36
lrwxrwxrwx 1 root root 8 Sep 24 01:27
vg0-photos_rimage_0-missing_0_0 -> ../dm-35
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rimage_1 -> ../dm-38
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rimage_2 -> ../dm-40
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rimage_3 -> ../dm-42
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rimage_4 -> ../dm-44
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rmeta_0 -> ../dm-34
lrwxrwxrwx 1 root root 8 Sep 24 01:27
vg0-photos_rmeta_0-missing_0_0 -> ../dm-33
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rmeta_1 -> ../dm-37
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rmeta_2 -> ../dm-39
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rmeta_3 -> ../dm-41
lrwxrwxrwx 1 root root 8 Sep 24 01:27 vg0-photos_rmeta_4 -> ../dm-43
Giuliano Procida
2016-09-28 06:51:02 UTC
Permalink
On 26 September 2016 at 08:54, Giuliano Procida
Post by Giuliano Procida
The net effect is that I now have RAID volumes in an odd state with
extracted / missing images (I think just a cosmetic issue) and which
rebuild on every boot (or rather re-activation and use).
The volumes take a long time to resync, especially concurrently!
I'm mostly interested in fixing my system but I had a look at the
stderr for the last repair attempt. If I had to make a guess, it would
be that lvconvert is racing against state changes in the kernel and
eventually asking dm-raid to set up an array with mismatched
expectations about whether device 0 should be flagged for rebuild or
not.

Looking through the LVM code, there is clearly an effort to clear
rebuild flags to prevent just the problem I'm having. Unfortunately,
with lvconvert exiting early these paths are presumably not taken. I
also noticed that the rebuild flag is singled out for exclusion when
generating (backup?) metadata.

I've extracted metadata from the PVs using another small utility I
wrote. The format is not quite the same as the /etc/lvm one and in
particular REBUILD status flags are NOT filtered out here.

It turns out that I do have the REBUILD flag on the rmeta_5 and rimage_5 subLVs!

Questions:
How can I clear REBUILD safely?
Can I renumber _5 to _0 as well at the same time?
Does it make sense to try the latest version of LVM user space tools
to try the final repair conversions?

--gp
Giuliano Procida
2016-09-28 08:33:05 UTC
Permalink
On 28 September 2016 at 07:51, Giuliano Procida
Post by Giuliano Procida
On 26 September 2016 at 08:54, Giuliano Procida
Post by Giuliano Procida
The net effect is that I now have RAID volumes in an odd state with
extracted / missing images (I think just a cosmetic issue) and which
rebuild on every boot (or rather re-activation and use).
The volumes take a long time to resync, especially concurrently!
I'm mostly interested in fixing my system but I had a look at the
stderr for the last repair attempt. If I had to make a guess, it would
be that lvconvert is racing against state changes in the kernel and
eventually asking dm-raid to set up an array with mismatched
expectations about whether device 0 should be flagged for rebuild or
not.
Looking through the LVM code, there is clearly an effort to clear
rebuild flags to prevent just the problem I'm having. Unfortunately,
with lvconvert exiting early these paths are presumably not taken. I
also noticed that the rebuild flag is singled out for exclusion when
generating (backup?) metadata.
The last bit's not true. Also, I *do* have consistently REBUILD flags
in the on-disk and backed up metadata for the new rmeta and rimage
subLVs. I must have been grepping a typo.
Post by Giuliano Procida
How can I clear REBUILD safely?
vgcfgrestore -t and the source code show that trying this will fail
until I lvconvert --repair everything off the missing PV and vgreduce
--removemissing.

The only other thing I can think of at present is to overwrite the
text [, "REBUILD"] with spaces and recalculate and re-write the CRC
for the metadata; sync and power off the machine (well after running
various LVM commands that read the on-disk format!).
Post by Giuliano Procida
Can I renumber _5 to _0 as well at the same time?
Does it make sense to try the latest version of LVM user space tools
to try the final repair conversions?
--gp
Giuliano Procida
2016-09-30 16:22:40 UTC
Permalink
I have resolved this myself.

I wrote a tool to modify on-disk PV metadata in-place, recalculating checksums.
I had to write a bitmap superblock fixer script as well to get all my
testing done.

As I see things, these are the outstanding issues, not all of which
are LVM's fault.

0. bitm magic not always written to rmeta subLVs
reported as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=839189

1. An aborted lvconvert --repair (for a RAID LV) can leave behind
__extracted subLVs and leave the new subLVs with their new name.

Might it make more sense to rename the missing subLVs out of the way
first and create new subLVs with the right name?

2. lvconvert --repair (often?) dies, probably due to a bad DM ioctl -
see logs previously gathered

3. A working but interrupted (e.g. due to power failure) or an aborted
lvconvert --repair will leave a sticky REBUILD flag on the new subLVs;
this will cause a full rebuild on every deactivated->open transition.

4. vgcfgrestore cannot be used to remove the REBUILD flag until all
LVs with missing devices have been repaired AND the missing PV has
been removed from metadata.
This really limits the usefulness of this as emergency repair tool
when RAID volumes are present.

5. Multiple REBUILDs happen concurrently even when the same PVs are in
use for each RAID LV - things really slow down.

I'll update the Google Drive folder with the utility source, scripts
and other notes in a day or so.

If someone finds these notes useful, let me know.
Giuliano Procida
2016-10-01 12:24:23 UTC
Permalink
One additional issue.

6. Once the PV metadata ring has been used a lot, pvck finds few or
zero metadata regions, often not including the current one.

This should be straightforward to reproduce. Here's what I currently see.

# pvck -v /dev/sdb2 # the replacement disk
Scanning /dev/sdb2
Found label on /dev/sdb2, sector 1, type=LVM2 001
Found text metadata area: offset=4096, size=1044480
Found LVM2 metadata record at offset=481280, size=22016, offset2=0 size2=0
Found LVM2 metadata record at offset=459264, size=22016, offset2=0 size2=0
Found LVM2 metadata record at offset=437248, size=22016, offset2=0 size2=0
Found LVM2 metadata record at offset=415232, size=22016, offset2=0 size2=0
Found LVM2 metadata record at offset=392704, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=370176, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=347136, size=23040, offset2=0 size2=0
Found LVM2 metadata record at offset=324608, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=302080, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=279040, size=23040, offset2=0 size2=0
Found LVM2 metadata record at offset=256512, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=233984, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=210944, size=23040, offset2=0 size2=0
Found LVM2 metadata record at offset=188416, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=165888, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=142848, size=23040, offset2=0 size2=0
Found LVM2 metadata record at offset=119296, size=23552, offset2=0 size2=0
Found LVM2 metadata record at offset=95744, size=23552, offset2=0 size2=0
Found LVM2 metadata record at offset=72704, size=23040, offset2=0 size2=0
Found LVM2 metadata record at offset=50176, size=22528, offset2=0 size2=0
Found LVM2 metadata record at offset=27136, size=23040, offset2=0 size2=0

# .../parse_pv_meta /dev/sdb2 | wc
/dev/sdb2
Found label at 512 of type LVM2 001 with CRC 668745670
PV header for UUID UTRUfMiTVIWpuGbVMYbRh8te3c2QvuSE with device_size
1999861972992
data area at 1048576 size 0
metadata area at 4096 size 1044480
reading metadata at 4096 of size 1044480
metadata header looks OK
region at 499200 rsize 21907 rflags 0 and CRC 3460382297
1532 3190 21906

# pvck -v /dev/sda2 # an older disk
Scanning /dev/sda2
Found label on /dev/sda2, sector 1, type=LVM2 001
Found text metadata area: offset=4096, size=1044480
Found LVM2 metadata record at offset=935936, size=22016, offset2=0 size2=0

# .../parse_pv_meta /dev/sda2 | wc # my utility
/dev/sda2
Found label at 512 of type LVM2 001 with CRC 3061781507
PV header for UUID 9WMRxLdsD6Or0ehN0xuRsRqmiprfdJtK with device_size
1999860997632
data area at 1048576 size 0
metadata area at 4096 size 1044480
reading metadata at 4096 of size 1044480
metadata header looks OK
region at 953856 rsize 21907 rflags 0 and CRC 3460382297
1532 3190 21906

Loading...