Discussion:
[linux-lvm] LVM RAID: task mdX_raid1:221 blocked for more than 120 seconds
Cesare Leonardi
2018-11-24 23:30:05 UTC
Permalink
Since my message did not reach the list, I'm resending, but first I've
subscribed myself.

------------------
Hello, I'm writing here to have your opinion and possibly some advice
about some Debian bugs related to LVM RAID that are still unresolved:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913119
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913138
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904822

Bug #913119 was filed by me, so I can personally provide some more
information and do tests.

Premises related di Debian unstable:
* Debian's kernel is currently 4.18.20.
* From kernel 4.17~rc7 Debian enabled SCSI_MQ_DEFAULT and DM_MQ_DEFAULT.
* Debian's LVM userland is 2.02.176

The above reports shows blocked I/O with different type of LVM RAID and
with #913119 I've succesfully workarounded passing the following kernel
parameters:
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0

I've read that RHEL will default to enabling SCSI_MQ_DEFAULT and
DM_MQ_DEFAULT and will use kernel 4.18. Maybe you have already
encountered this bug and it's already resolved. Or there are patches
pending.

What do you think? Should I file a bug in Red Hat bug tracker?

Cesare.
Jack Wang
2018-11-26 07:25:21 UTC
Permalink
+cc linux-raid

The call trace looks kinds of deadlock in raid
Post by Cesare Leonardi
Since my message did not reach the list, I'm resending, but first I've
subscribed myself.
------------------
Hello, I'm writing here to have your opinion and possibly some advice
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913119
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913138
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904822
Bug #913119 was filed by me, so I can personally provide some more
information and do tests.
* Debian's kernel is currently 4.18.20.
* From kernel 4.17~rc7 Debian enabled SCSI_MQ_DEFAULT and DM_MQ_DEFAULT.
* Debian's LVM userland is 2.02.176
The above reports shows blocked I/O with different type of LVM RAID and
with #913119 I've succesfully workarounded passing the following kernel
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0
I've read that RHEL will default to enabling SCSI_MQ_DEFAULT and
DM_MQ_DEFAULT and will use kernel 4.18. Maybe you have already
encountered this bug and it's already resolved. Or there are patches
pending.
What do you think? Should I file a bug in Red Hat bug tracker?
Cesare.
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Cesare Leonardi
2018-11-24 15:43:31 UTC
Permalink
Hello, I'm writing here to have your opinion and possibly some advice
about some Debian bugs related to LVM RAID that are still unresolved:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913119
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913138
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904822

Bug #913119 was filed by me, so I can personally provide some more
information and do tests.

Premises related di Debian unstable:
* Debian's kernel is currently 4.18.20.
* From kernel 4.17~rc7 Debian enabled SCSI_MQ_DEFAULT and DM_MQ_DEFAULT.
* Debian's LVM userland is 2.02.176

The above reports shows blocked I/O with different type of LVM RAID and
with #913119 I've succesfully workarounded passing the following kernel
parameters:
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0

I've read that RHEL will default to enabling SCSI_MQ_DEFAULT and
DM_MQ_DEFAULT and will use kernel 4.18. Maybe you have already
encountered this bug and it's already resolved. Or there are patches
pending.

What do you think? Should I file a bug in Red Hat bug tracker?

Please, keep me in CC as I'm not subscribed to the list.

Cesare.
Zdenek Kabelac
2018-11-26 08:49:21 UTC
Permalink
Post by Cesare Leonardi
Since my message did not reach the list, I'm resending, but first I've
subscribed myself.
------------------
Hello, I'm writing here to have your opinion and possibly some advice about
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913119
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913138
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904822
Bug #913119 was filed by me, so I can personally provide some more information
and do tests.
* Debian's kernel is currently 4.18.20.
* From kernel 4.17~rc7 Debian enabled SCSI_MQ_DEFAULT and DM_MQ_DEFAULT.
* Debian's LVM userland is 2.02.176
The above reports shows blocked I/O with different type of LVM RAID and with
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0
I've read that RHEL will default to enabling SCSI_MQ_DEFAULT and DM_MQ_DEFAULT
and will use kernel 4.18. Maybe you have already encountered this bug and it's
already resolved. Or there are patches pending.
What do you think? Should I file a bug in Red Hat bug tracker?
Hi


Traces are completely misleading.

It does look like 'freeze' happens during LV resize of device
(just wild guess from bug=913138)

To track down the issue - there would need to be probably some communication
with bug reporters - they would need to expose what they were doing plus state
of dm tables and number of other things.

It's nearly impossible to guess just out of this 'trace' of sleeping process.

From traces it seems - raid kernel driver is sleeping - so it could i.e. mean
some 'dm' target is left in suspended state - possibly due to ?bug? of lvm2
command that has crashed and left table in the incorrect state??

Anyway without way more info such bug report is meaningless.


Regards


Zdenek
Cesare Leonardi
2018-11-26 11:31:41 UTC
Permalink
Resending, I erroneusly replied only to Zdenek, sorry.
It does look like 'freeze' happens during LV  resize of device
(just wild guess from bug=913138)
To track down the issue - there would need to be probably some
communication with bug reporters - they would need to expose what they
were doing plus state
of dm tables and number of other things.
I can provide details about this, that was filed by me:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913119

It's about a desktop PC, with two SSD (Samsung 850 EVO) on which i build
RAID1 using LVM.
# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb3 vg0 lvm2 a-- <250,00g 15,98g
/dev/sdc3 vg0 lvm2 a-- <250,00g 15,98g

# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
home vg0 rwi-aor--- 200,00g 100,00
root vg0 rwi-aor--- 30,00g 100,00
swap0 vg0 rwi-aor--- 4,00g 100,00

It's a desktop PC using Debian unstable, so it's rebooted quite often
due to frequent updates.
The freezes happens during normal work, without any resizing or any
maintenance on LVM going on. Most of the time I noted the freeze while I
was using Thunderbird. But eventually they resolve by themself: I wait
minutes and the system suddenly became responsive again. Sometimes I've
noted freezes but without any notice in dmesg: maybe they resolved
before some kernel threshold.
But most of the time another freeze will happen soon (it could be 1-2
hours but also minutes), so a reboot is really necessary.

I've not noticed any corruption due to these freeze but often they are
very long and very impacting. The only reliable workaround found was to
reboot with:
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0

Or to reboot with Debian kernel 4.16.16 (linux-image-4.16.0-2-amd) the
last that work without problem but also the last before Debian
maintaner's activated SCSI_MQ_DEFAULT and DM_MQ_DEFAULT.

To me the only evidence is that disabling blk-mq the problem doesn't
happen and so it looks an interaction with blk-mq.
I've read in RHEL8 release notes that it will enable it by default, so I
wonder if that happened to others. I have a fedora-server 29 VM,
upgraded from 28, but there, if I recall correctly, SCSI_MQ_DEFAULT and
DM_MQ_DEFAULT are not set.
Anyway without way more info such bug report is meaningless.
Please ask, I'll do my best to provide any info you need.

Cesare.
Zdenek Kabelac
2018-11-26 11:40:40 UTC
Permalink
Post by Cesare Leonardi
Resending, I erroneusly replied only to Zdenek, sorry.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913119
It's about a desktop PC, with two SSD (Samsung 850 EVO) on which i build RAID1
using LVM.
# pvs
  PV         VG  Fmt  Attr PSize    PFree
  /dev/sdb3  vg0 lvm2 a--  <250,00g 15,98g
  /dev/sdc3  vg0 lvm2 a--  <250,00g 15,98g
# lvs
  LV    VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync
Convert
  home  vg0 rwi-aor--- 200,00g 100,00
  root  vg0 rwi-aor---  30,00g 100,00
  swap0 vg0 rwi-aor---   4,00g 100,00
It's a desktop PC using Debian unstable, so it's rebooted quite often due to
frequent updates.
So you should probably start first with running latest available kernel - 4.19.

You also should collect 'dmesg' report
Post by Cesare Leonardi
The freezes happens during normal work, without any resizing or any
maintenance on LVM going on. Most of the time I noted the freeze while I was
using Thunderbird. But eventually they resolve by themself: I wait minutes and
Aren't you running out-of-memory ?

Install some CPU/MEM monitoring service and watch out for problems
(AFAIK OOM doesn't really work on my machine - and often FF + Thunderbird
combo brings it to the state mouse barely moves and CPU spins in kswapd...)
Post by Cesare Leonardi
I've not noticed any corruption due to these freeze but often they are very
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0
I doubt this have anything in common with this.

Regards

Zdenek
Cesare Leonardi
2018-11-26 12:43:29 UTC
Permalink
Il giorno lun 26 nov 2018 alle ore 12:40 Zdenek Kabelac
Post by Zdenek Kabelac
So you should probably start first with running latest available kernel - 4.19.
You also should collect 'dmesg' report
I've already the last Debian kernel (4.18.20). Kernel 4.19 is not packaged yet.
In #913119 there was a complete dmesg, that I've also attached here.
Post by Zdenek Kabelac
Aren't you running out-of-memory ?
Not that I'm aware of. In the top bar I have an applet that shows cpu, ram
and swap usage and I've never noticed too high ram usage. I'll pay more
attention to that the next time I'll make a test.
Post by Zdenek Kabelac
Post by Cesare Leonardi
I've not noticed any corruption due to these freeze but often they are very
scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0
I doubt this have anything in common with this.
That's surprising to me. I assure you that, until now, it's the only
thing that really resolved for me and that let me use Debian kernels
from 4.17 to 4.18.
I look forward to test 4.19 as soon as it will be available.

Cesare.

Loading...