Discussion:
[linux-lvm] [lvmlockd] "VGLK res_unlock lm error -250" and lvm command hung forever
Damon Wang
2018-05-24 14:44:05 UTC
Permalink
Hi all,

I'm using lvmlockd + sanlock on iSCSI, and sometimes (usually
intensive operations), it shows vglock is failed:

/var/log/messages:

May 24 21:14:29 dev1 sanlock[1108]: 2018-05-24 21:14:29 605471
[1112]: r627 paxos_release 8255 other lver 8258
May 24 21:14:29 dev1 sanlock[1108]: 2018-05-24 21:14:29 605471
[1112]: r627 release_token release leader -250
May 24 21:14:29 dev1 lvmlockd[1061]: 1527167669 S
lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK unlock_san release error
-250
May 24 21:14:29 dev1 lvmlockd[1061]: 1527167669 S
lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK res_unlock lm error -250
May 24 21:14:29 dev1 lvmlockd[1061]: 1527167669 S
lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK unlock_san release error
-1
May 24 21:14:29 dev1 lvmlockd[1061]: 1527167669 S
lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK res_unlock lm error -1
May 24 21:14:29 dev1 sanlock[1108]: 2018-05-24 21:14:29 605471
[1111]: cmd_release 2,9,1061 no resource VGLK


meanwhile lvmlockctl -d:

1527167668 close lvchange[34042] cl 476 fd 9
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK res_lock
rv 0 read vb 101 0 1761
1527167669 send lvcreate[34073] cl 478 lock vg rv 0
1527167669 recv lvcreate[34073] cl 478 find_free_lock vg
"ff35ecc8217543e0a5be9cbe935ffc84" mode iv flags 0
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 find free lock
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84
find_free_lock_san found unused area at 127926272
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 find free lock 0
offset 127926272
1527167669 send lvcreate[34073] cl 478 find_free_lock vg rv 0
1527167669 recv lvcreate[34073] cl 478 init lv
"ff35ecc8217543e0a5be9cbe935ffc84" mode iv flags 0
1527167669 work init_lv
ff35ecc8217543e0a5be9cbe935ffc84/c0dda9c02a7347a098203fbe2a7df1d0 uuid
Wnj8vm-2Hiy-gdqs-Uh1i-fUiA-W3mj-zdASzK
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 init_lv_san
Wnj8vm-2Hiy-gdqs-Uh1i-fUiA-W3mj-zdASzK found unused area at 127926272
1527167669 send lvcreate[34073] cl 478 init lv rv 0 vg_args
1.0.0:lvmlock lv_args 1.0.0:127926272
1527167669 recv lvcreate[34073] cl 478 update vg
"ff35ecc8217543e0a5be9cbe935ffc84" mode iv flags 0
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK action update iv
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
res_update cl 478 lk version to 1762
1527167669 send lvcreate[34073] cl 478 update vg rv 0
1527167669 recv lvcreate[34073] cl 478 lock lv
"ff35ecc8217543e0a5be9cbe935ffc84" mode un flags 1
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R
Wnj8vm-2Hiy-gdqs-Uh1i-fUiA-W3mj-zdASzK action lock un
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R
Wnj8vm-2Hiy-gdqs-Uh1i-fUiA-W3mj-zdASzK res_unlock cl 478 no locks
1527167669 send lvcreate[34073] cl 478 lock lv rv -2
1527167669 recv lvcreate[34073] cl 478 lock vg
"ff35ecc8217543e0a5be9cbe935ffc84" mode un flags 0
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK action lock un
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK res_unlock cl 478
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
res_unlock r_version new 1762
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san ex r_version 1762 flags 0
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san set r_version 1762
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san release error -250
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
res_unlock lm error -250
1527167669 send lvcreate[34073] cl 478 lock vg rv -250
1527167669 close lvcreate[34073] cl 478 fd 11
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
res_unlock cl 478 from close
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san ex r_version 0 flags 0
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san release error -1
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
res_unlock lm error -1
1527167670 new cl 479 pi 2 fd 9


and then any lvm related commands will hung:

[***@dev1 ~]# lvmlockctl -i
VG ff35ecc8217543e0a5be9cbe935ffc84 lock_type=sanlock
Nz4PmR-Bwfi-pItr-C3om-Ewnj-SICA-5rFuFg
LS sanlock lvm_ff35ecc8217543e0a5be9cbe935ffc84
LK VG ex ver 1762 pid 0 ()
LW VG sh ver 0 pid 34216 (lvchange)
LW VG sh ver 0 pid 75685 (lvs)
LW VG sh ver 0 pid 83741 (lvdisplay)
LW VG sh ver 0 pid 90569 (lvchange)
LW VG sh ver 0 pid 92735 (lvchange)
LW VG sh ver 0 pid 99982 (lvs)
LW VG sh ver 0 pid 14069 (lvchange)
LK LV sh onuPlt-YXI0-nYtv-CV8Q-yS93-6Zt9-QvlfyH
LK GL sh ver 0 pid 75685 (lvs)
LK GL sh ver 0 pid 99982 (lvs)
LK LV un XOgw6I-3Nuh-ejIP-ydPq-rr7g-LMjb-oklfup

My questions are:

1. why VGLK failed, is it because network failure(cause iSCSI fail and
sanlock could not find VGLK volume), can I find a direct proof?
2. Is it recoverable? I have tried kill all hung commands but new
command still hung forever.

Thanks a lot,
Damon
David Teigland
2018-05-24 15:46:23 UTC
Permalink
Post by Damon Wang
Hi all,
I'm using lvmlockd + sanlock on iSCSI, and sometimes (usually
Hi, thanks for this report.
Post by Damon Wang
May 24 21:14:29 dev1 sanlock[1108]: 2018-05-24 21:14:29 605471
[1112]: r627 paxos_release 8255 other lver 8258
I believe this is the sanlock bug that was fixed here:
https://pagure.io/sanlock/c/735781d683e99cccb3be7ffe8b4fff1392a2a4c8?branch=master

By itself, the bug isn't a big problem, the lock was released but sanlock
returns an error. The bigger problem is that lvmlockd then believes that
Post by Damon Wang
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san release error -1
LW VG sh ver 0 pid 34216 (lvchange)
LW VG sh ver 0 pid 75685 (lvs)
LW VG sh ver 0 pid 83741 (lvdisplay)
LW VG sh ver 0 pid 90569 (lvchange)
LW VG sh ver 0 pid 92735 (lvchange)
LW VG sh ver 0 pid 99982 (lvs)
LW VG sh ver 0 pid 14069 (lvchange)
1. why VGLK failed, is it because network failure(cause iSCSI fail and
sanlock could not find VGLK volume), can I find a direct proof?
I believe the bug. Failures of the storage network can also cause similar
issues, but you would see error messages related to i/o timeouts.
Post by Damon Wang
2. Is it recoverable? I have tried kill all hung commands but new
command still hung forever.
There are recently added options for this kind of situation, but I don't
believe there is an lvm release with those yet.

If you are prepared to build your own version of lvm, build lvm release
2.02.178 (which should be ready shortly, if it's not, take git master
branch). Be sure to configure with --enable-lvmlockd-sanlock. Then try:

lvchange -an --lockopt skipvg <vgname>
lvmlockctl --drop <vgname>
stop lvmlockd, stop sanlock
restart everything as usual

If that doesn't work, or if you don't want to build lvm, then unmount file
systems, kill lvmlockd, kill sanlock, you might need to do some dm cleanup
if LVs were active (or perhaps just reboot the machine.) Restart
everything as usual.

Dave
Damon Wang
2018-05-24 16:50:27 UTC
Permalink
Thank you for your reply!

I'll try to sanlock-3.6.0 first (currently I'm using 3.5.0) and try
whether it happen again

Damon
Post by David Teigland
Post by Damon Wang
Hi all,
I'm using lvmlockd + sanlock on iSCSI, and sometimes (usually
Hi, thanks for this report.
Post by Damon Wang
May 24 21:14:29 dev1 sanlock[1108]: 2018-05-24 21:14:29 605471
[1112]: r627 paxos_release 8255 other lver 8258
https://pagure.io/sanlock/c/735781d683e99cccb3be7ffe8b4fff1392a2a4c8?branch=master
By itself, the bug isn't a big problem, the lock was released but sanlock
returns an error. The bigger problem is that lvmlockd then believes that
Post by Damon Wang
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san release error -1
LW VG sh ver 0 pid 34216 (lvchange)
LW VG sh ver 0 pid 75685 (lvs)
LW VG sh ver 0 pid 83741 (lvdisplay)
LW VG sh ver 0 pid 90569 (lvchange)
LW VG sh ver 0 pid 92735 (lvchange)
LW VG sh ver 0 pid 99982 (lvs)
LW VG sh ver 0 pid 14069 (lvchange)
1. why VGLK failed, is it because network failure(cause iSCSI fail and
sanlock could not find VGLK volume), can I find a direct proof?
I believe the bug. Failures of the storage network can also cause similar
issues, but you would see error messages related to i/o timeouts.
Post by Damon Wang
2. Is it recoverable? I have tried kill all hung commands but new
command still hung forever.
There are recently added options for this kind of situation, but I don't
believe there is an lvm release with those yet.
If you are prepared to build your own version of lvm, build lvm release
2.02.178 (which should be ready shortly, if it's not, take git master
lvchange -an --lockopt skipvg <vgname>
lvmlockctl --drop <vgname>
stop lvmlockd, stop sanlock
restart everything as usual
If that doesn't work, or if you don't want to build lvm, then unmount file
systems, kill lvmlockd, kill sanlock, you might need to do some dm cleanup
if LVs were active (or perhaps just reboot the machine.) Restart
everything as usual.
Dave
Damon Wang
2018-05-30 13:17:31 UTC
Permalink
After days testing, I'm pretty sure the problem has been solved by
upgrade sanlock to 3.6.0, thanks Dave!

Damon
Post by Damon Wang
Thank you for your reply!
I'll try to sanlock-3.6.0 first (currently I'm using 3.5.0) and try
whether it happen again
Damon
Post by David Teigland
Post by Damon Wang
Hi all,
I'm using lvmlockd + sanlock on iSCSI, and sometimes (usually
Hi, thanks for this report.
Post by Damon Wang
May 24 21:14:29 dev1 sanlock[1108]: 2018-05-24 21:14:29 605471
[1112]: r627 paxos_release 8255 other lver 8258
https://pagure.io/sanlock/c/735781d683e99cccb3be7ffe8b4fff1392a2a4c8?branch=master
By itself, the bug isn't a big problem, the lock was released but sanlock
returns an error. The bigger problem is that lvmlockd then believes that
Post by Damon Wang
1527167669 S lvm_ff35ecc8217543e0a5be9cbe935ffc84 R VGLK
unlock_san release error -1
LW VG sh ver 0 pid 34216 (lvchange)
LW VG sh ver 0 pid 75685 (lvs)
LW VG sh ver 0 pid 83741 (lvdisplay)
LW VG sh ver 0 pid 90569 (lvchange)
LW VG sh ver 0 pid 92735 (lvchange)
LW VG sh ver 0 pid 99982 (lvs)
LW VG sh ver 0 pid 14069 (lvchange)
1. why VGLK failed, is it because network failure(cause iSCSI fail and
sanlock could not find VGLK volume), can I find a direct proof?
I believe the bug. Failures of the storage network can also cause similar
issues, but you would see error messages related to i/o timeouts.
Post by Damon Wang
2. Is it recoverable? I have tried kill all hung commands but new
command still hung forever.
There are recently added options for this kind of situation, but I don't
believe there is an lvm release with those yet.
If you are prepared to build your own version of lvm, build lvm release
2.02.178 (which should be ready shortly, if it's not, take git master
lvchange -an --lockopt skipvg <vgname>
lvmlockctl --drop <vgname>
stop lvmlockd, stop sanlock
restart everything as usual
If that doesn't work, or if you don't want to build lvm, then unmount file
systems, kill lvmlockd, kill sanlock, you might need to do some dm cleanup
if LVs were active (or perhaps just reboot the machine.) Restart
everything as usual.
Dave
Loading...