[linux-lvm] Pvmove can work on cluster with LVM 2.02.120(2) (2015-05-15), or not?

Discussion:

Gang He

2018-04-25 05:00:21 UTC

Hello List,

This is another pvmove problem, the LVM version is 2.02.120(2) (2015-05-15).
This bug can be reproduced (not each time, but very easy),
the problem is, online pvmove brings the upper file system hang.
the environment is a three-node cluster (CLVM+OCFS2).
1) create two PV, create one VG, create one LV.
sles12sp3r1-nd1:/ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda1 cluster-vg2 lvm2 a-- 120.00g 60.00g
/dev/sda2 lvm2 --- 30.00g 30.00g
/dev/sdb1 cluster-vg2 lvm2 a-- 120.00g 60.00g
/dev/sdb2 lvm2 --- 30.00g 30.00g
sles12sp3r1-nd1:/ # vgs
VG #PV #LV #SN Attr VSize VFree
cluster-vg2 2 2 0 wz--nc 239.99g 119.99g
sles12sp3r1-nd1:/ # lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
test-lv cluster-vg2 -wI-ao---- 20.00g

2) mkfs.ocfs2 test-lv LV, and mount this LV on each node.
mkfs.ocfs2 -N 4 /dev/cluster-vg2/test-lv (on one node)
mount /dev/cluster-vg2/test-lv /mnt/shared (on each node)

3) write/truncate some files in /mnt/shared from each node continually.

4) run pvmove command on node1 while step 3) is in progress on each node.
sles12sp3r1-nd1:/ # pvmove -i 5 /dev/sda1 /dev/sdb1
Pvmove process will enter this stack,
sles12sp3r1-nd1:/ # cat /proc/12748/stack
[<ffffffff810f429f>] hrtimer_nanosleep+0xaf/0x170
[<ffffffff810f43b6>] SyS_nanosleep+0x56/0x70
[<ffffffff8160916e>] entry_SYSCALL_64_fastpath+0x12/0x6d
[<ffffffffffffffff>] 0xffffffffffffffff

5)Then, I can encounter ocfs2 file system write/truncate process hang
problem on each node,
The root cause is blocked at getting journal lock.
but the journal lock is being used by ocfs2_commit thread, this thread is
blocked at flushing journal to the disk (LVM disk).
sles12sp3r1-nd3:/ # cat /proc/2310/stack
[<ffffffffa021ab4a>] jbd2_log_wait_commit+0x8a/0xf0 [jbd2]
[<ffffffffa021e5c7>] jbd2_journal_flush+0x47/0x180 [jbd2]
[<ffffffffa04d2621>] ocfs2_commit_thread+0xa1/0x350 [ocfs2]
[<ffffffff8109b627>] kthread+0xc7/0xe0
[<ffffffff8160617f>] ret_from_fork+0x3f/0x70
[<ffffffff8109b560>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff

So, I want to confirm if online pvmove is supported by LVM 2.02.120(2)
(2015-05-15)?
If yes, how to debug this bug? it looks ocfs2 journal thread can not flush
data to the underlying LVM disk.

Thanks
Gang

Gang He

2018-04-26 02:07:48 UTC

Permalink

Hello Zdenek,

Do you remember LVM for this version supports PVmove, or not?
Since there is a user which is pinging this question.

Thanks
Gang

Post by Gang He
Hello List,
This is another pvmove problem, the LVM version is 2.02.120(2) (2015-05-15).
This bug can be reproduced (not each time, but very easy),
the problem is, online pvmove brings the upper file system hang.
the environment is a three-node cluster (CLVM+OCFS2).
1) create two PV, create one VG, create one LV.
sles12sp3r1-nd1:/ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda1 cluster-vg2 lvm2 a-- 120.00g 60.00g
/dev/sda2 lvm2 --- 30.00g 30.00g
/dev/sdb1 cluster-vg2 lvm2 a-- 120.00g 60.00g
/dev/sdb2 lvm2 --- 30.00g 30.00g
sles12sp3r1-nd1:/ # vgs
VG #PV #LV #SN Attr VSize VFree
cluster-vg2 2 2 0 wz--nc 239.99g 119.99g
sles12sp3r1-nd1:/ # lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
test-lv cluster-vg2 -wI-ao---- 20.00g
2) mkfs.ocfs2 test-lv LV, and mount this LV on each node.
mkfs.ocfs2 -N 4 /dev/cluster-vg2/test-lv (on one node)
mount /dev/cluster-vg2/test-lv /mnt/shared (on each node)
3) write/truncate some files in /mnt/shared from each node continually.
4) run pvmove command on node1 while step 3) is in progress on each node.
sles12sp3r1-nd1:/ # pvmove -i 5 /dev/sda1 /dev/sdb1
Pvmove process will enter this stack,
sles12sp3r1-nd1:/ # cat /proc/12748/stack
[<ffffffff810f429f>] hrtimer_nanosleep+0xaf/0x170
[<ffffffff810f43b6>] SyS_nanosleep+0x56/0x70
[<ffffffff8160916e>] entry_SYSCALL_64_fastpath+0x12/0x6d
[<ffffffffffffffff>] 0xffffffffffffffff
5)Then, I can encounter ocfs2 file system write/truncate process hang
problem on each node,
The root cause is blocked at getting journal lock.
but the journal lock is being used by ocfs2_commit thread, this thread is
blocked at flushing journal to the disk (LVM disk).
sles12sp3r1-nd3:/ # cat /proc/2310/stack
[<ffffffffa021ab4a>] jbd2_log_wait_commit+0x8a/0xf0 [jbd2]
[<ffffffffa021e5c7>] jbd2_journal_flush+0x47/0x180 [jbd2]
[<ffffffffa04d2621>] ocfs2_commit_thread+0xa1/0x350 [ocfs2]
[<ffffffff8109b627>] kthread+0xc7/0xe0
[<ffffffff8160617f>] ret_from_fork+0x3f/0x70
[<ffffffff8109b560>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
So, I want to confirm if online pvmove is supported by LVM 2.02.120(2)
(2015-05-15)?
If yes, how to debug this bug? it looks ocfs2 journal thread can not flush
data to the underlying LVM disk.
Thanks
Gang

Zdenek Kabelac

2018-04-26 08:33:02 UTC

Permalink

Post by Gang He
Hello Zdenek,
Do you remember LVM for this version supports PVmove, or not?
Since there is a user which is pinging this question.

Gang He

2018-04-27 06:30:10 UTC

Permalink

Hello Zdenek,

Thank for your reply.
As you said some days ago, you will release LVM 2.02.178,
which will be a stable version? include some PVmove bug fixes?

Thanks
Gang

Post by Zdenek Kabelac

Post by Gang He
Hello Zdenek,
Do you remember LVM for this version supports PVmove, or not?
Since there is a user which is pinging this question.

Hi
lvm2 should be supporting clustered pvmove (in case cmirrord is fully
functional on your system) - but I've no idea if you are hitting some old bug
or you see some still existing one.
For reporting bug use some more recent version of tools - nobody is likely
really going to hunt bugs in your 3 year old system here...
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/