[linux-lvm] Why LVM metadata locations are not properly aligned

Post by Ming-Hung Tsai
Hi,
I'm trying to find any opportunity to accelerate LVM metadata IO, in order to
take lvm-thin snapshots in a very short time. My scenario is connecting
lvm-thin volumes to a Windows host, then taking snapshots on those volumes for
Windows VSS (Volume Shadow Copy Service). Since that the Windows VSS can only
suspend IO for 10 seconds, LVM should finish taking snapshots within 10 seconds.

Hmm do you observe taking a snapshot takes more then a second ?
IMHO the largest portion of time should be the 'disk' synchronization
when suspending (full flush and fs sync)
Unless you have lvm2 metadata in range of MiB (and lvm2 was not designed for
that) - you should be well bellow a second...

Post by Ming-Hung Tsai
However, it's hard to achieve that if the PV is busy running IO. The major

Changing disk scheduler to deadline ?
Lowering percentage of dirty-pages ?
While your questions are valid points for discussion - you will save couple
disk reads - but this will not save your time problem a lot if you have
overloaded disk I/O system.
Note lvm2 is using direct I/O which is your trouble maker here I guess...

Post by Ming-Hung Tsai
1. The metadata locations (raw_locn::offset) are not properly aligned.
Function _aligned_io() requires the IO to be logical-block aligned,
but metadata locations returned by next_rlocn_offset() are 512-byte aligned.
If a device's logical block size is greater than 512b, then LVM need to use
bounce buffer to do the IO.
How about setting raw_locn::offset to logical-block boundary?
(or max(logical_block_size, physical_block_size) for 512-byte logical-/4KB
physical-block drives?)

This looks like a bug - lvm2 should start to write metadata always on physical
block aligned position.

Post by Ming-Hung Tsai
2. In most cases, the memory buffers passed to dev_read() and dev_write() are
not aligned. (e.g, raw_read_mda_header(), _find_vg_rlocn())
3. Why LVM uses such complex process to update metadata?
The are three operations to update metadata: write, pre-commit, then commit.
Each operation requires one header read (raw_read_mda_header),
one metadata checking (_find_vg_rlocn()), and metadata update via bounce
buffer. So we need at least 9 reads and 3 writes for one PV.
Could we simplify that?

It's been already simplified once ;) and we have lost quite important property
of validation of written data during pre-commit - which is quite useful when
user is running on misconfigured multipath device...

Each state has its logic and with each state we need to be sure data are
there. This doesn't sound like a problem with a single PV - but in a server
world of many different kind of misconfiguration and failing devices it may be
more important then you might think.

The valid idea might be - to maybe support 'riskier' variant of metadata
update, where lvm2 might skip some disk security checking, but may not catch
all trouble associated - thus you may run for days with dm table you will not
find then in your lvm2 metadata....

Post by Ming-Hung Tsai
4. Commit fb003cdf & a3686986 causes additional metadata read.
Could we improve that? (We had checked the metadata in _find_vg_rlocn())

Fight with disk corruption and duplications is a major topic in lvm2....
But ATM are fishing for bigger fish :)
So yes this optimizations are in a queue - but not as top priority.

Post by Ming-Hung Tsai
5. Feature request: could we take multiple snapshots in a batch, to reduce
the number of metadata IO operations?
e.g., lvcraete vg1/lv1 vg1/lv2 vg1/lv3 --snapshot
(I know that it would be trouble for the --addtag options...)

Yes another already existing and planned RFE - to have support for
atomic snapshot for multiple device at once - in a queue.

Post by Ming-Hung Tsai
This post mentioned that lvresize will support resizing multiple volumes,

It's not about resizing mutliple volume with once command,
it's about resizing data & metadata in one command via policy more correctly/

Post by Ming-Hung Tsai
but I think that taking multiple snapshots is also helpful.
https://www.redhat.com/archives/linux-lvm/2016-February/msg00023.html

There is also some ongoing work on better lvresize support for more then 1
single LV. This will also implement better approach to resize of lvmetad
which is using different mechanism in kernel.

dm-suspend origin0
dm-message create_snap 3 0
dm-message set_transaction_id 3 4

Every transaction update here - needs lvm2 metadata confirmation - i.e.
double-commit lvm2 does not allow to jump by more then 1 transaction here,
and the error path also cleans 1 transaction.

Post by Ming-Hung Tsai
dm-resume origin0
dm-suspend origin1
dm-message create_snap 4 1
dm-message set_transaction_id 4 5
dm-resume origin1
dm-suspend origin2
dm-message create_snap 5 2
dm-message set_transaction_id 5 6
dm-resume origin2
...
6. Is there any other way to accelerate LVM operation? I had enabled lvmetad,
setting global_filter and md_component_detection=0 in lvm.conf.

Reducing number of PVs with metadata in case your VG has lots of PVs
(may reduce metadata resistance in case PVs with them are lost...)

Filters are magic - try to accept only devices which are potential PVs and
reject everything else. (by default every device is accepted and scanned...)

Disabling archiving & backup in filesystem (in lvm.conf) may help a lot if you
run lots of lvm2 commands and you do not care about archive.

Checking /etc/lvm/archive is not full of thousands of files.

Checking with 'strace -tttt' what delays your command.

And yes - there are always couple on going transmutation in lvm2 which may
have introduced some performance regression - so open BZ is always useful if
you spot such thing.

Regards

Zdenek

Zdenek Kabelac

2016-04-21 13:22:26 UTC

Post by Zdenek Kabelac

Post by Ming-Hung Tsai
Hi,
1. The metadata locations (raw_locn::offset) are not properly aligned.
Function _aligned_io() requires the IO to be logical-block aligned,
but metadata locations returned by next_rlocn_offset() are 512-byte aligned.
If a device's logical block size is greater than 512b, then LVM need to use
bounce buffer to do the IO.
How about setting raw_locn::offset to logical-block boundary?
(or max(logical_block_size, physical_block_size) for 512-byte logical-/4KB
physical-block drives?)

This looks like a bug - lvm2 should start to write metadata always on physical
block aligned position.

Hi

I've opened RFE BZ for this one - https://bugzilla.redhat.com/1329234
It's not completely trivial to fix this in a backward compatible way - but I'm
mostly 100% sure it's not cause your 10s delay unless.

Regards

Zdenek

Ming-Hung Tsai

2016-04-22 08:43:16 UTC

Post by Ming-Hung Tsai
However, it's hard to achieve that if the PV is busy running IO.

So flush your data in advance of running the snapshot commands so there is only
minimal data to sync during the snapshot process itself.

Post by Ming-Hung Tsai
The major overhead is LVM metadata IO.

Are you sure? That would be unusual. How many copies of the metadata have you
chosen to keep? (metadata/vgmetadatacopies) How big is this metadata? (E.g.
size of /etc/lvm/backup/<vgname> file.)

My configurations:
- Only one PV in a volume group
- A thinpool with several thin volumes
- size of a metadata record is less than 16KB
- lvm.conf:
metadata/vgmetadatacopies=1
devices/md_component_detection=0 because it requires disk IO.
Other filters are relatively faster.
device/global_filter=[ "a/md/", "r/.*/" ]
backup/retain_days=0 and backup/retain_min=30 so there are at most
30 backups

Despite there's no IO on the target volume to take snapshot, the system is
still doing IO on other volumes, which increases the latency of direct IOs
issued by LVM.

Post by Alasdair G Kergon
Hmm do you observe taking a snapshot takes more then a second ?
IMHO the largest portion of time should be the 'disk' synchronization
when suspending (full flush and fs sync)
Unless you have lvm2 metadata in range of MiB (and lvm2 was not designed for
that) - you should be well bellow a second...
you will save couple
disk reads - but this will not save your time problem a lot if you have
overloaded disk I/O system.
Note lvm2 is using direct I/O which is your trouble maker here I guess...

That's the point. I should not say "LVM metadata IO is the overhead".
LVM just suffered from the system loading, so it cannot finish metadata
direct IOs within seconds. I can try to manage data flushing and filesystem sync
before taking snapshots, but on the other hand, I wish to reduce
the number of IOs issued by LVM.

Post by Alasdair G Kergon
Changing disk scheduler to deadline ?
Lowering percentage of dirty-pages ?

In my previous testing on kernel 3.12, CFQ+ionice performs better than
deadline in this case, but now it seems that the schedulers for blk-mq are not
yet ready.
I also tried to use cgroup to do IO throttling when taking snapshots.
I can do some more testing.

Post by Ming-Hung Tsai
3. Why LVM uses such complex process to update metadata?

I'm not well understand the purpose of pre-commit. Why not write the metadata
then update the mda header immediately?. Could you give me an example?

Post by Ming-Hung Tsai
5. Feature request: could we take multiple snapshots in a batch, to reduce
the number of metadata IO operations?

Every transaction update here - needs lvm2 metadata confirmation - i.e.
double-commit lvm2 does not allow to jump by more then 1 transaction here,
and the error path also cleans 1 transaction.

How about setting the snapshots with same transaction_id ?

IOCTL sequence:
LVM commit metadata with queued create_snap messages
dm-suspend origin0
dm-message create_snap 3 0
dm-resume origin0
dm-suspend origin1
dm-message create_snap 4 1
dm-resume origin1
dm-message set_transaction_id 3 4
LVM commit metadata with updated transaction_id

Related post: https://www.redhat.com/archives/dm-devel/2016-March/msg00071.html

Post by Ming-Hung Tsai
6. Is there any other way to accelerate LVM operation?

Reducing number of PVs with metadata in case your VG has lots of PVs
(may reduce metadata resistance in case PVs with them are lost...)

There's only one PV in my case. For multiple PVs cases, I think I could
temporarily disable metadata writing on some PVs by setting --metadataignore.

Post by Alasdair G Kergon
Filters are magic - try to accept only devices which are potential PVs and
reject everything else. (by default every device is accepted and scanned...)

One more question: Why the filter cache is disabled when using lvmetad?
(comments in init_filters(): "... Also avoid it when lvmetad is enabled.")
Thus LVM needs to check all the devices under /dev when it start.

Alternatively, is there any way to let lvm_cache handles some specific
devices only, instead of check the entire directory?
(e.g, allow devices/scan=["/dev/md[0-9]*"], to filter devices at earlier
stage. The current strategy is calling dev_cache_add_dir("/dev"),
then checking individual devices, which requires a lot of unnecessary
stat() syscalls)

There's also an undocumented configuration devices/loopfiles. Seems for loop
loop device files.

Post by Alasdair G Kergon
Disabling archiving & backup in filesystem (in lvm.conf) may help a lot if
you run lots of lvm2 commands and you do not care about archive.

I know there's -An option in lvcreate, but now the system loading and direct IO
is the main issue.

Thanks,
Ming-Hung Tsai

Zdenek Kabelac

2016-04-22 09:49:44 UTC

Post by Ming-Hung Tsai
However, it's hard to achieve that if the PV is busy running IO.

So flush your data in advance of running the snapshot commands so there is only
minimal data to sync during the snapshot process itself.

Post by Ming-Hung Tsai
The major overhead is LVM metadata IO.

ote lvm2 is using direct I/O which is your trouble maker here I guess...

Post by Ming-Hung Tsai
That's the point. I should not say "LVM metadata IO is the overhead".
LVM just suffered from the system loading, so it cannot finish metadata
direct IOs within seconds. I can try to manage data flushing and filesystem sync
before taking snapshots, but on the other hand, I wish to reduce
the number of IOs issued by LVM.

Post by Alasdair G Kergon
Changing disk scheduler to deadline ?
Lowering percentage of dirty-pages ?

yep - if simple set of I/O do take several seconds - it's not really
a problem lvm2 can solve.

You should consider lowering the amount of dirty pages so you are
not using system with with the extreme delay in write-queue.

Defaults are like 60% of RAM can be dirty and if you have a lot or RAM - it
may take quite while to sync all this to device - and that's
what will happen with 'suspend'

You may just try to measure it with plain 'dmsetup suspend/resume'
on a device you want to make a snapshot on your loaded hw.

Interesting thing to play with could be 'dmstats' (relatively recent addition)
for tracking latencies and i/o load on disk areas...

Post by Ming-Hung Tsai
3. Why LVM uses such complex process to update metadata?

I'm not well understand the purpose of pre-commit. Why not write the metadata
then update the mda header immediately?. Could you give me an example?

You need to see 'command' and 'activation/locking' part as 2 different
entities/processes - which may not have any common data.

Command knows data and does some operation on them.

Locking code then only sees data written on disk (+couple extra bits of passed
info).

So in cluster one node runs command and different node might be activating
a device purely from written metadata - having no common structure with
command code.
Now there are 'some' bypass code paths to avoid reread of info if it is a
single command doing also locking part...

The 'magic' is a 'suspend' operation - which is the ONLY operation that
sees 'committed' & 'pre-commited' metadata (lvm2 has 2 slots)
If anything fails in 'pre-commit' - metadata are dropped
and state remains at 'committed' state.
When pre-commit suspend is successful - then we may commit and resume
now committed metadata.

It's quite complicated state machine with many constrains and obviously still
with some bugs and tweaks.

Sometime we do miss some bits of information and trying to remaining
compatible is making it challenging....

Post by Ming-Hung Tsai
5. Feature request: could we take multiple snapshots in a batch, to reduce
the number of metadata IO operations?

Every transaction update here - needs lvm2 metadata confirmation - i.e.
double-commit lvm2 does not allow to jump by more then 1 transaction here,
and the error path also cleans 1 transaction.

How about setting the snapshots with same transaction_id

Yes - that's how it will work - it's in plan....
It's the error path handling that needs some thinking.
First I want to improve check for free space in metadata to be matching
kernel logic more closely..

Post by Alasdair G Kergon
Filters are magic - try to accept only devices which are potential PVs and
reject everything else. (by default every device is accepted and scanned...)

lvmetad is only "cache" for lvmetad - however we do not 'treat' lvmetad
is trustful source of info for many reason - primarily 'udevd' is toy-tool
process with many unhandled corner cases - particularly whenever you have
duplicate/dead devices - it's getting useless...

So the purpose is avoid looking for metadata - but whenever we write new
metadata - we grab protecting locks and need to be sure there are not racing
commands - this can't be ensure by udev controlled lvmetad with completely
unpredictable update timing and synchronization
(udev has built-in 30sec timeout for rule processing which might be far too
small on loaded system...)

In other words - 'lvmetad' is somehow useful for 'lvs', but cannot be trusted
for lvcreate/lvconvert...

Post by Ming-Hung Tsai
Alternatively, is there any way to let lvm_cache handles some specific
devices only, instead of check the entire directory?
(e.g, allow devices/scan=["/dev/md[0-9]*"], to filter devices at earlier
stage. The current strategy is calling dev_cache_add_dir("/dev"),
then checking individual devices, which requires a lot of unnecessary
stat() syscalls)
There's also an undocumented configuration devices/loopfiles. Seems for loop
loop device files.

Always best opening RHBZ for such items so they are not lost...

Post by Alasdair G Kergon
Disabling archiving & backup in filesystem (in lvm.conf) may help a lot if
you run lots of lvm2 commands and you do not care about archive.

I know there's -An option in lvcreate, but now the system loading and direct IO
is the main issue.

Direct IO is mostly mandatory - since many caching layers these day may ruin
everything - i.e. using qemu over SAN - you may get completely unpredicatble
races without directio.
But maybe supporting some 'untrustful' cached write might be usable for
some users... not sure - but I'd image an lvm.conf option for this.
Just such lvm2 would not be then supportable for customers...
(so we would need to track user has been using such option...)

Regards

Zdenek

Alasdair G Kergon

2016-04-21 10:11:53 UTC