[linux-lvm] LVM hangs

Discussion:

Zdenek Kabelac

2017-11-13 14:51:21 UTC

Hi!
I have a EL7 desktop box with two sata harddisks and two ssds in a
LVM raid1 - thin pool - cache configuration. (Just migrated to this
setup a few weeks ago.)
After some days, individual processes start to block in disk wait.
I don't know if the problem resides in the cache-, thin- or raid1-layer
but the underlying block-devices are fully responsive.
http://leo.kloburg.at/tmp/lvm-blocks/
Do the stack backtraces provide enough information to locate the source
of the blocks?
I'd be happy to provide additional info, if necessary.
Meanwhile I'll disable the LVM cache layer to eliminate this potential
candidate.

Hi

It would be probably nice to see the result of 'dmsetup status'

I'd have guessed you are probably hitting 'frozen' raid state
which is unfortunate existing upstream bug.

Regards

Zdenek

Zdenek Kabelac

2017-11-13 21:56:35 UTC

Permalink

Are you talking about RH bug 1388632?
https://bugzilla.redhat.com/show_bug.cgi?id=1388632
Unfortunately I can only view the google-cached version of the bugzilla
page, since the bug is restricted to internal view only.

that could be similar issue yes

But the google-cached version suggests that the bug is mainly hit when
removing the raid-backed cache pool under IO.
I my scenario, no modification (like cache removal) of the lvm setup was
done when the blocks occured.

Easiest is to check 'dmsetup status' - just to exclude if it's frozen raid
case.

Hi Zdeneck,
due to how easy is to trigger the bug, it seems a very serious problem to me.
As the bug report is for internal use only, can you shed some light on what
causes it and how to avoid?
Specifically can you confirm that, if using an "old-school" mdadm RAID device,
the bug does not apply?

IMHO this particular issue is probably not triggerable (at least not so
easily) by mdadm.

lvm2 has some sort of problem compared to mdadm - it's able to 'generate' more
device state changes per second then mdadm.

BZ is still being examined AFAIK....

Zdenek

Alexander 'Leo' Bergolth

2017-11-16 11:02:58 UTC

Permalink

Post by Zdenek Kabelac

I have a EL7 desktop box with two sata harddisks and two ssds in a
LVM raid1 - thin pool - cache configuration. (Just migrated to this
setup a few weeks ago.)
After some days, individual processes start to block in disk wait.
I don't know if the problem resides in the cache-, thin- or raid1-layer
but the underlying block-devices are fully responsive.

It would be probably nice to see the result of 'dmsetup status'
I'd have guessed you are probably hitting 'frozen' raid state
which is unfortunate existing upstream bug.

As it just happened again, I have collected some additional info like
dmsetup status
dmsetup info -c (do the event counts look suspicious?)

https://leo.kloburg.at/tmp/lvm-blocks/2017-11-16/

I don't see any volume in "frozen" state.

I haven't rebooted the box yet. Maybe I provide some more info?

Cheers,
--leo

--
e-mail ::: Leo.Bergolth (at) wu.ac.at
fax ::: +43-1-31336-906050
location ::: IT-Services | Vienna University of Economics | Austria

Zdenek Kabelac

2017-11-16 11:47:42 UTC

Permalink

Post by Alexander 'Leo' Bergolth

Post by Zdenek Kabelac

It would be probably nice to see the result of 'dmsetup status'
I'd have guessed you are probably hitting 'frozen' raid state
which is unfortunate existing upstream bug.

As it just happened again, I have collected some additional info like
dmsetup status
dmsetup info -c (do the event counts look suspicious?)
https://leo.kloburg.at/tmp/lvm-blocks/2017-11-16/
I don't see any volume in "frozen" state.
I haven't rebooted the box yet. Maybe I provide some more info?

From the plain look over those file - it doesn't even seem there is anything
wrong with dm devices as such.

So it looks like possibly XFS got into some unhappy moment.

I'd probably recommend to open regular Bugzilla case and attach files from
your directory.

You can try if individual devices in the 'stack' are blocked.

i.e. try 'dd' read from every 'dm' if there is something blocked.

From status all device looks fully operational and also process stack trace
do look reasonable idle.

I'm not sure how 'afs' is involved here - can you reproduce without afs ?

Zdenek

Alexander 'Leo' Bergolth

2017-11-16 14:16:57 UTC

Permalink

Post by Zdenek Kabelac

OK.

Post by Zdenek Kabelac
You can try if individual devices in the 'stack' are blocked.
i.e. try 'dd' read from every 'dm' if there is something blocked.

No device is currently blocking. I can read from all LV devices
(including meta devices), all underlying PVs and all filesystems:

for dev in $(lvs -a -olv_dm_path --noheadings); do
echo $dev;
dd if=$dev of=/dev/null bs=4k count=10000 iflag=direct;
done
for pv in $(pvs -oname --noheadings); do
echo $pv
dd if=$pv of=/dev/null bs=4k count=10000 iflag=direct
done
echo 3 >/proc/sys/vm/drop_caches
for mp in $(findmnt -t xfs,ext4 -o TARGET -l -n); do
echo $mp;
tar -cf- --one-file-system "$mp" 2>/dev/null| head -c $((1024**3))

Post by Zdenek Kabelac
/dev/null;

done

Post by Zdenek Kabelac
From status all device looks fully operational and also process stack
trace do look reasonable idle.
I'm not sure how 'afs' is involved here - can you reproduce without afs ?

OK. I'll try.

Thanks for your help!

--leo

--
e-mail ::: Leo.Bergolth (at) wu.ac.at
fax ::: +43-1-31336-906050
location ::: IT-Services | Vienna University of Economics | Austria