Discussion:
[linux-lvm] pvmove speed
Roy Sigurd Karlsbakk
2017-02-11 08:59:02 UTC
Permalink
Hi all

I'm doing pvmove of some rather large volumes from a Dell Equallogic system to Dell Compellent. Both are connected on iSCSI to vmware and the guest OS seems to do this well, but it's slow. I get around 50MB/s at most, even though the Equallogic is connected at 4x1Gbps and the Compellent is on 10Gbps. Without multipath, this should give me 100MB/s or so, but then, I get half of that. Interestingly, the "utilisation" reported by munin shows me a 100% in total of the two devices compared, as in Loading Image...

Any idea why this utilisation is so high, and what I can do to speed this thing up? I'm at a 14 days mark to move these 40TiB or so, and I'd like to reduce it if possible. The backend is obviously not a problem here.

Vennlig hilsen / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Da mihi sis bubulae frustrum assae, solana tuberosa in modo Gallico fricta, ac quassum lactatum coagulatum crassum. Quod me nutrit me destruit.
Roy Sigurd Karlsbakk
2017-02-16 18:22:53 UTC
Permalink
Post by Roy Sigurd Karlsbakk
I'm doing pvmove of some rather large volumes from a Dell Equallogic system to
Dell Compellent. Both are connected on iSCSI to vmware and the guest OS seems
to do this well, but it's slow. I get around 50MB/s at most, even though the
Equallogic is connected at 4x1Gbps and the Compellent is on 10Gbps. Without
multipath, this should give me 100MB/s or so, but then, I get half of that.
Interestingly, the "utilisation" reported by munin shows me a 100% in total of
the two devices compared, as in https://karlsbakk.net/tmp/pvmove-dev-util.png
Any idea why this utilisation is so high, and what I can do to speed this thing
up? I'm at a 14 days mark to move these 40TiB or so, and I'd like to reduce it
if possible. The backend is obviously not a problem here.
Anyone with an idea on this? I just don't think it makes sense...

Vennlig hilsen / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Da mihi sis bubulae frustrum assae, solana tuberosa in modo Gallico fricta, ac quassum lactatum coagulatum crassum. Quod me nutrit me destruit.
L A Walsh
2017-02-17 00:44:36 UTC
Permalink
Post by Roy Sigurd Karlsbakk
I'm doing pvmove of some rather large volumes from a Dell Equallogic system to
Dell Compellent. Both are connected on iSCSI ....
----
I've never had very great speeds over a network. I've gotten the
impression
that iSCSI is slower than some other network protocols.

Locally (RAID=>RAID) I got about 400-500MB/s, but the best I've
gotten, recently, over a 10Gb network card has been about 200MB/s.
Oddly, when I first got the cards, I was getting up to 400-600MB/s, but
after MS started pushing
Win10 and "updates" to Win7 (my communication has been between
Win7SP1<->linux server), my speed dropped to barely over 100MB/s which
is about what I got with
a 1Gb card. I wasn't able to get any better speeds using *windows*
single-threaded SMB proto even using 2x10Gb (have a dedicated link tween
workstation and server) -- but I did notice the cpu maxing out on either the
windows or the Samba side depending on packet size and who was doing the
sending.

50MB sounds awfully slow, but not out of the ballpark -- I had
benched a few
NAS solutions @ home, but could rarely get about 10MB/s (usually
slower), so gave up on those and went w/a linux server -- but still alot
slower than I'd
like (100-200MB/s sustained, but those figures may change w/the next MS
"update"). But gave up on commercial, out-of-the-box solutions, and the
4x1Gb
connect you have may be costing you more cpu than its worth... Problem I
noted
on 2x10G was too many duplicate packets -- so running 1x10Gb now but
still maxing out around 200MB/s over an unencrypted SMB/CIFS session.

I'm not sure it could be an LVM problem given its local speed for
pvmoves --
do you have some measurement of faster file I/O throughput using iSCSI
over your
connections?
p***@yahoo.com
2017-02-17 01:24:11 UTC
Permalink
They both being dell products the smartest method is to do array to array snapshot replication. Then at some point quiescing the source and do the final sync. Pvmove has an implicit transaction size and is probably too small for good efficiency.

Unless you are running parrallel‎ tcp copy threads more links doesn't help. 
Roy Sigurd Karlsbakk
2017-02-17 15:12:09 UTC
Permalink
I get 200MB on a cold day and 500 on a nice one from this SAN. These 40-50MB/s seem really being limited by pvmove alone. Nothing else shows this limit. That's why I'm asking.

Vennlig hilsen / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Da mihi sis bubulae frustrum assae, solana tuberosa in modo Gallico fricta, ac quassum lactatum coagulatum crassum. Quod me nutrit me destruit.

----- Original Message -----
Sent: Friday, 17 February, 2017 01:44:36
Subject: Re: [linux-lvm] pvmove speed
Post by Roy Sigurd Karlsbakk
I'm doing pvmove of some rather large volumes from a Dell Equallogic system to
Dell Compellent. Both are connected on iSCSI ....
----
I've never had very great speeds over a network. I've gotten the
impression
that iSCSI is slower than some other network protocols.
Locally (RAID=>RAID) I got about 400-500MB/s, but the best I've
gotten, recently, over a 10Gb network card has been about 200MB/s.
Oddly, when I first got the cards, I was getting up to 400-600MB/s, but
after MS started pushing
Win10 and "updates" to Win7 (my communication has been between
Win7SP1<->linux server), my speed dropped to barely over 100MB/s which
is about what I got with
a 1Gb card. I wasn't able to get any better speeds using *windows*
single-threaded SMB proto even using 2x10Gb (have a dedicated link tween
workstation and server) -- but I did notice the cpu maxing out on either the
windows or the Samba side depending on packet size and who was doing the
sending.
50MB sounds awfully slow, but not out of the ballpark -- I had
benched a few
slower), so gave up on those and went w/a linux server -- but still alot
slower than I'd
like (100-200MB/s sustained, but those figures may change w/the next MS
"update"). But gave up on commercial, out-of-the-box solutions, and the
4x1Gb
connect you have may be costing you more cpu than its worth... Problem I
noted
on 2x10G was too many duplicate packets -- so running 1x10Gb now but
still maxing out around 200MB/s over an unencrypted SMB/CIFS session.
I'm not sure it could be an LVM problem given its local speed for
pvmoves --
do you have some measurement of faster file I/O throughput using iSCSI
over your
connections?
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
John Stoffel
2017-02-17 17:01:51 UTC
Permalink
Roy> I get 200MB on a cold day and 500 on a nice one from this
Roy> SAN. These 40-50MB/s seem really being limited by pvmove
Roy> alone. Nothing else shows this limit. That's why I'm asking.
Roy> Vennlig hilsen / Best regards

Is any one CPU pegged on your server running the pmove? It's probably
single threaded and if it's a single LV you're moving, that might
explain the problem.

It might also be that even though you have 4 x 1gb to one server,
since it's the same IP pair on each end, it's not really spreading the
load across all the links.

It might be faster to do a clone on the source side, and then use DD
to copy the data from the quiet clone to the destination. Then mount
the destination and use rsync to bring them up to sync.

But it would be helpful to see more details about the OS, etc.

John
Roy Sigurd Karlsbakk
2017-02-17 17:33:31 UTC
Permalink
Post by John Stoffel
Roy> I get 200MB on a cold day and 500 on a nice one from this
Roy> SAN. These 40-50MB/s seem really being limited by pvmove
Roy> alone. Nothing else shows this limit. That's why I'm asking.
Roy> Vennlig hilsen / Best regards
Is any one CPU pegged on your server running the pmove? It's probably
single threaded and if it's a single LV you're moving, that might
explain the problem.
No. CPU is hardly in use at all, and the little is spread across all four vcores - Loading Image.... It's running on a very low-traffic host.
Post by John Stoffel
It might also be that even though you have 4 x 1gb to one server,
since it's the same IP pair on each end, it's not really spreading the
load across all the links.
That would make sense if I got around 100MB/s from that, but I get a *total* of 100MB/s, read and write inclusive. The host machine is connected to 10Gbps, the new storage, dell compellent, is on 10Gbps, everything, really is 10Gbps except the old equallogic stuff, being on 4x1Gbps. For utilisation and bandwidth, see the munin graphs here Loading Image.... A pvmove /dev/sdc /dev/sde is currently running
Post by John Stoffel
It might be faster to do a clone on the source side, and then use DD
to copy the data from the quiet clone to the destination. Then mount
the destination and use rsync to bring them up to sync.
It's not practically possible, since the amount of data is too high. Thus pvmove, to allow for concurrent use.
Post by John Stoffel
But it would be helpful to see more details about the OS, etc.
It's not much to say - standard RHEL7 with some bits from EPEL, but basically just the normal bits. System is running on an ESXi 6.5 host. Old storage is raw device mappings to an Equallogic box (or two), new storage is VMFS 6. I really can't see any issues here, except the fact that the utilisation of sdc + sde == 100 and that this is constant.

So - any ideas?

roy
L A Walsh
2017-02-17 23:00:25 UTC
Permalink
Post by Roy Sigurd Karlsbakk
I get 200MB on a cold day and 500 on a nice one from this SAN. These 40-50MB/s seem really being limited by pvmove alone. Nothing else shows this limit. That's why I'm asking.
---
200-500... impressive for a SAN... but considering the bandwidth
you have to the box (4x1+10), I'd hope for at least 200 (what I get
w/just a 10)... so must be some parallel TCP channels there... he..
What showed those speeds? I'm _guessing_, but its likely that pvmove
is single threaded. So could be related to the I/O transfer size as
@pattonme was touching on, since multi-threaded I/O can slow things
down for local I/O when local I/O is in the .5-1GBps and higher range.

Curious -- do you know your network's cards' MTU size?
I know that even w/1Gb cards I got 2-4X speed improvement over
standard 1500 packets (run 9000/9014 byte MTU's over local net).

I'd have to look at the source for more info...

-l
Roy Sigurd Karlsbakk
2017-02-18 08:12:01 UTC
Permalink
Post by L A Walsh
200-500... impressive for a SAN... but considering the bandwidth
you have to the box (4x1+10), I'd hope for at least 200 (what I get
w/just a 10)... so must be some parallel TCP channels there... he..
What showed those speeds? I'm _guessing_, but its likely that pvmove
is single threaded. So could be related to the I/O transfer size as
@pattonme was touching on, since multi-threaded I/O can slow things
down for local I/O when local I/O is in the .5-1GBps and higher range.
Well, it’s a Compellent thing with net storage of around 350TiB, of which around 20 is on SSDs, so really, it should be good. Tiering is turned off during migration of this data, though (that is, we’re migrating the data directly to a low tier, since it’s 40TiB worth of archives).
Post by L A Walsh
Curious -- do you know your network's cards' MTU size?
I know that even w/1Gb cards I got 2-4X speed improvement over
standard 1500 packets (run 9000/9014 byte MTU's over local net).
Everything’s using jumboframes (9000) on the SAN (storage network), and it’s a dedicated network with its own switches and copper/fiber. The rest of the system works well (at least the Compellent things, EqualLogic has a bad nervous breakdown on its way to the cemetery, but that’s another story. The Exchange servers running off it, gave us around 400MB/s (that is, wirespeed) last backup. That wasn’t raw i/o from vmware, this is, but then again, I should at least be able to sustain a gigabit link (the EQL storage is hardly in use anymore, perhaps that’s why it’s depressed), and as shown, I’m limited to around half of that.

Vennlig hilsen / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Da mihi sis bubulae frustrum assae, solana tuberosa in modo Gallico fricta, ac quassum lactatum coagulatum crassum. Quod me nutrit me destruit.
Mark Mielke
2017-02-18 16:55:43 UTC
Permalink
One aspect that has confused me in this discussion, that I was hoping
somebody would address...

I believe I have seen slower than expected pvmove times in the past (but I
only rarely do it, so it has never particularly concerned me). When I saw
it, my first assumption was that the pvmove had to be done "carefully" to
ensure that every segment was safely moved in such a way that it was
definitely in one place, or definitely in the other, and not "neither" or
"both". This is particularly important if the volume is mounted, and is
being actively used, which was my case.

Would these safety checks not reduce overall performance? Sure, it would
transfer one segment at full speed, but then it might pause to do some
book-keeping, making sure to fully synch the data and metadata out to both
physical volumes and ensure that it was still crash-safe?

For SAN speeds - I don't think LVM has ever been proven to be a bottleneck
for me. On our new OpenStack cluster, I am seeing 550+ MByte/s with iSCSI
backed disks, and 700+ MByte/s with NFS backed disks (with read and write
cached disabled). I don't even look at LVM as a cause of concern here as
there is usually something else at play. In fact, on the same OpenStack
cluster, I am using LVM on NVMe drives, with an XFS LV to back the QCOW2
images, and I can get 2,000+ MByte/s sustained with this setup. Again, LVM
isn't even a performance consideration for me.
Post by L A Walsh
200-500... impressive for a SAN... but considering the bandwidth
you have to the box (4x1+10), I'd hope for at least 200 (what I get
w/just a 10)... so must be some parallel TCP channels there... he..
What showed those speeds? I'm _guessing_, but its likely that pvmove
is single threaded. So could be related to the I/O transfer size as
@pattonme was touching on, since multi-threaded I/O can slow things
down for local I/O when local I/O is in the .5-1GBps and higher range.
Well, it’s a Compellent thing with net storage of around 350TiB, of which
around 20 is on SSDs, so really, it should be good. Tiering is turned off
during migration of this data, though (that is, we’re migrating the data
directly to a low tier, since it’s 40TiB worth of archives).
Post by L A Walsh
Curious -- do you know your network's cards' MTU size?
I know that even w/1Gb cards I got 2-4X speed improvement over
standard 1500 packets (run 9000/9014 byte MTU's over local net).
Everything’s using jumboframes (9000) on the SAN (storage network), and
it’s a dedicated network with its own switches and copper/fiber. The rest
of the system works well (at least the Compellent things, EqualLogic has a
bad nervous breakdown on its way to the cemetery, but that’s another story.
The Exchange servers running off it, gave us around 400MB/s (that is,
wirespeed) last backup. That wasn’t raw i/o from vmware, this is, but then
again, I should at least be able to sustain a gigabit link (the EQL storage
is hardly in use anymore, perhaps that’s why it’s depressed), and as shown,
I’m limited to around half of that.
Vennlig hilsen / Best regards
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Da mihi sis bubulae frustrum assae, solana tuberosa in modo Gallico
fricta, ac quassum lactatum coagulatum crassum. Quod me nutrit me destruit.
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
--
Mark Mielke <***@gmail.com>
Zdenek Kabelac
2017-02-20 09:59:26 UTC
Permalink
One aspect that has confused me in this discussion, that I was hoping somebody
would address...
I believe I have seen slower than expected pvmove times in the past (but I
only rarely do it, so it has never particularly concerned me). When I saw it,
my first assumption was that the pvmove had to be done "carefully" to ensure
that every segment was safely moved in such a way that it was definitely in
one place, or definitely in the other, and not "neither" or "both". This is
particularly important if the volume is mounted, and is being actively used,
which was my case.
Would these safety checks not reduce overall performance? Sure, it would
transfer one segment at full speed, but then it might pause to do some
book-keeping, making sure to fully synch the data and metadata out to both
physical volumes and ensure that it was still crash-safe?
For SAN speeds - I don't think LVM has ever been proven to be a bottleneck for
me. On our new OpenStack cluster, I am seeing 550+ MByte/s with iSCSI backed
disks, and 700+ MByte/s with NFS backed disks (with read and write cached
disabled). I don't even look at LVM as a cause of concern here as there is
usually something else at play. In fact, on the same OpenStack cluster, I am
using LVM on NVMe drives, with an XFS LV to back the QCOW2 images, and I can
get 2,000+ MByte/s sustained with this setup. Again, LVM isn't even a
performance consideration for me.
So let's recap some fact first:

lvm2 is NOT doing any device itself - all the lvm2 does - it manages dm tables
and keeps metadata for them in sync.
So it's always some 'dm' device what does the actual work.

For pvmove there is currently a bit 'oldish' dm mirror target
(see dmsetup targets for available one).
Once it will be possible lvm2 will switch to use 'raid' target which might
provide slightly better speed for some tasks.

There is some 'known' issue with old mirror and smaller region size if there
is parallel read&write into a mirror - this was not yet fully addressed,
but if the device in the mirror the have 'bigger' latencies, usage of
bigger chunks size does help to increase throughput.
(In simple words - bigger --regionsize has less commit points)


However this is likely not the case here - all devices are supposedly very
fast and attached over hyperfast network.

When looking at this graph: https://karlsbakk.net/tmp/pvmove-dev-util.png
it strikes in the eyes that initial couple hours were running fine, but after
a while 'controller' started to prefer /dev/sdd over /dev/sde and the
usage is mostly 'reflected'.

So my question would be - how well the controller works over the longer period
of time of sustained load ?
To me this looks more like a 'driver' issue for this iSCSI hardware blackbox?

Could you also try the same load with 'dd' ?

i.e. running 'dd' 1/2 day whether the performance will not start to drop as
can be observed with pvmove ?

dm mirror target is basically only using kernel kcopyd thread to copy device
'A' to device 'B' and it does sync bitmap (a bit of slowdown factor)
So in theory it should work just like 'dd'. For 'dd' you could however
configure some better options for 'directio' and buffer sizes.


Regards

Zdenek
Roy Sigurd Karlsbakk
2017-03-31 16:27:53 UTC
Permalink
Post by Zdenek Kabelac
One aspect that has confused me in this discussion, that I was hoping somebody
would address...
I believe I have seen slower than expected pvmove times in the past (but I
only rarely do it, so it has never particularly concerned me). When I saw it,
my first assumption was that the pvmove had to be done "carefully" to ensure
that every segment was safely moved in such a way that it was definitely in
one place, or definitely in the other, and not "neither" or "both". This is
particularly important if the volume is mounted, and is being actively used,
which was my case.
Would these safety checks not reduce overall performance? Sure, it would
transfer one segment at full speed, but then it might pause to do some
book-keeping, making sure to fully synch the data and metadata out to both
physical volumes and ensure that it was still crash-safe?
For SAN speeds - I don't think LVM has ever been proven to be a bottleneck for
me. On our new OpenStack cluster, I am seeing 550+ MByte/s with iSCSI backed
disks, and 700+ MByte/s with NFS backed disks (with read and write cached
disabled). I don't even look at LVM as a cause of concern here as there is
usually something else at play. In fact, on the same OpenStack cluster, I am
using LVM on NVMe drives, with an XFS LV to back the QCOW2 images, and I can
get 2,000+ MByte/s sustained with this setup. Again, LVM isn't even a
performance consideration for me.
lvm2 is NOT doing any device itself - all the lvm2 does - it manages dm tables
and keeps metadata for them in sync.
So it's always some 'dm' device what does the actual work.
For pvmove there is currently a bit 'oldish' dm mirror target
(see dmsetup targets for available one).
Once it will be possible lvm2 will switch to use 'raid' target which might
provide slightly better speed for some tasks.
There is some 'known' issue with old mirror and smaller region size if there
is parallel read&write into a mirror - this was not yet fully addressed,
but if the device in the mirror the have 'bigger' latencies, usage of
bigger chunks size does help to increase throughput.
(In simple words - bigger --regionsize has less commit points)
However this is likely not the case here - all devices are supposedly very
fast and attached over hyperfast network.
When looking at this graph: https://karlsbakk.net/tmp/pvmove-dev-util.png
it strikes in the eyes that initial couple hours were running fine, but after
a while 'controller' started to prefer /dev/sdd over /dev/sde and the
usage is mostly 'reflected'.
So my question would be - how well the controller works over the longer period
of time of sustained load ?
To me this looks more like a 'driver' issue for this iSCSI hardware blackbox?
Could you also try the same load with 'dd' ?
i.e. running 'dd' 1/2 day whether the performance will not start to drop as
can be observed with pvmove ?
dm mirror target is basically only using kernel kcopyd thread to copy device
'A' to device 'B' and it does sync bitmap (a bit of slowdown factor)
So in theory it should work just like 'dd'. For 'dd' you could however
configure some better options for 'directio' and buffer sizes.
Just to cap this up, we kept on using pvmove with the disks where we couldn't vmotion the data (that is, on raw devices) or where we setup new things and rsynced the data over. This took some time for the large servers (the 45TiB machine spent three weeks or more on this), but the data was moved and now, a few weeks later, no issues have turned up. Seems to be that although pvmove may be slow compared to the hardware, at least it works flawlessly.

Thanks

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Da mihi sis bubulae frustrum assae, solana tuberosa in modo Gallico fricta, ac quassum lactatum coagulatum crassum. Quod me nutrit me destruit.
Loading...