Discussion:
[linux-lvm] Higher than expected metadata usage?
Gionatan Danti
2018-03-27 07:44:22 UTC
Permalink
Hi all,
I can't wrap my head on the following reported data vs metadata usage
before/after a snapshot deletion.

System is an updated CentOS 7.4 x64

BEFORE SNAP DEL:
[root@ ~]# lvs
LV VG Attr LSize Pool Origin Data%
Meta% Move Log Cpy%Sync Convert
000-ThinPool vg_storage twi-aot--- 7.21t 80.26
56.88
Storage vg_storage Vwi-aot--- 7.10t 000-ThinPool 76.13

ZZZSnap vg_storage Vwi---t--k 7.10t 000-ThinPool Storage

As you can see, a ~80% full data pool resulted in a ~57% metadata usage

AFTER SNAP DEL:
[root@ ~]# lvremove vg_storage/ZZZSnap
Logical volume "ZZZSnap" successfully removed
[root@ ~]# lvs
LV VG Attr LSize Pool Origin Data%
Meta% Move Log Cpy%Sync Convert
000-ThinPool vg_storage twi-aot--- 7.21t 74.95
36.94
Storage vg_storage Vwi-aot--- 7.10t 000-ThinPool 76.13

Now data is at ~75 (5% lower), but metadata is at only ~37%: a whopping
20% metadata difference for a mere 5% data freed.

This was unexpected: I thought there was a more or less linear relation
between data and metadata usage as, after all, the first is about
allocated chunks tracked by the latter. I know that snapshots pose
additional overhead on metadata tracking, but based on previous tests I
expected this overhead to be much smaller. In this case, we are speaking
about a 4X amplification for a single snapshot. This is concerning
because I want to *never* run out of metadata space.

If it can help, just after taking the snapshot I sparsified some file on
the mounted filesystem, *without* fstrimming it (so, from lvmthin
standpoint, nothing changed on chunk allocation).

What am I missing? Is the "data%" field a measure of how many data
chunks are allocated, or does it even track "how full" are these data
chunks? This would benignly explain the observed discrepancy, as a
partially-full data chunks can be used to store other data without any
new metadata allocation.

Full LVM information:

[root@ ~]# lvs -a -o +chunk_size
LV VG Attr LSize Pool
Origin Data% Meta% Move Log Cpy%Sync Convert Chunk
000-ThinPool vg_storage twi-aot--- 7.21t
74.95 36.94 4.00m
[000-ThinPool_tdata] vg_storage Twi-ao---- 7.21t
0
[000-ThinPool_tmeta] vg_storage ewi-ao---- 116.00m
0
Storage vg_storage Vwi-aot--- 7.10t 000-ThinPool
76.13 0
[lvol0_pmspare] vg_storage ewi------- 116.00m
0

Thanks.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Zdenek Kabelac
2018-03-27 08:30:17 UTC
Permalink
Post by Gionatan Danti
Hi all,
I can't wrap my head on the following reported data vs metadata usage
before/after a snapshot deletion.
System is an updated CentOS 7.4 x64
  LV           VG         Attr       LSize  Pool         Origin  Data% Meta%
Move Log Cpy%Sync Convert
  000-ThinPool vg_storage twi-aot---  7.21t                      80.26 56.88
  Storage      vg_storage Vwi-aot---  7.10t 000-ThinPool         76.13
  ZZZSnap      vg_storage Vwi---t--k  7.10t 000-ThinPool Storage
As you can see, a ~80% full data pool resulted in a ~57% metadata usage
  Logical volume "ZZZSnap" successfully removed
  LV           VG         Attr       LSize  Pool         Origin Data% Meta%
Move Log Cpy%Sync Convert
  000-ThinPool vg_storage twi-aot---  7.21t                     74.95 36.94
  Storage      vg_storage Vwi-aot---  7.10t 000-ThinPool        76.13
Now data is at ~75 (5% lower), but metadata is at only ~37%: a whopping 20%
metadata difference for a mere 5% data freed.
This was unexpected: I thought there was a more or less linear relation
between data and metadata usage as, after all, the first is about allocated
chunks tracked by the latter. I know that snapshots pose additional overhead
on metadata tracking, but based on previous tests I expected this overhead to
be much smaller. In this case, we are speaking about a 4X amplification for a
single snapshot. This is concerning because I want to *never* run out of
metadata space.
If it can help, just after taking the snapshot I sparsified some file on the
mounted filesystem, *without* fstrimming it (so, from lvmthin standpoint,
nothing changed on chunk allocation).
What am I missing? Is the "data%" field a measure of how many data chunks are
allocated, or does it even track "how full" are these data chunks? This would
benignly explain the observed discrepancy, as a partially-full data chunks can
be used to store other data without any new metadata allocation.
  LV                   VG         Attr       LSize   Pool Origin Data%
Meta%  Move Log Cpy%Sync Convert Chunk
  000-ThinPool         vg_storage twi-aot---   7.21t  74.95
36.94                            4.00m
  [000-ThinPool_tdata] vg_storage Twi-ao----   7.21t
                                            0
  [000-ThinPool_tmeta] vg_storage ewi-ao---- 116.00m
                                            0
  Storage              vg_storage Vwi-aot---   7.10t 000-ThinPool
 76.13                                      0
  [lvol0_pmspare]      vg_storage ewi------- 116.00m
                                            0
Hi

Well just for the 1st. look - 116MB for metadata for 7.21TB is *VERY* small
size. I'm not sure what is the data 'chunk-size' - but you will need to
extend pool's metadata sooner or later considerably - I'd suggest at least
2-4GB for this data size range.

Metadata itself are also allocated in some internal chunks - so releasing a
thin-volume doesn't necessarily free space in the whole metadata chunks thus
such chunk remains allocated and there is not a more detailed free-space
tracking as space in chunks is shared between multiple thin volumes and is
related to efficient storage of b-Trees...

There is no 'direct' connection between releasing space in data and metadata
volume - so it's quite natural you will see different percentage of free space
after thin volume removal between those two volumes.

The only problem would be if repeated operation would lead to some permanent
growth....

Regards

Zdenek
Gionatan Danti
2018-03-27 09:40:55 UTC
Permalink
Hi
Well just for the 1st. look -  116MB for metadata for 7.21TB is *VERY*
small size. I'm not sure what is the data 'chunk-size'  - but you will
need to extend pool's metadata sooner or later considerably - I'd
suggest at least 2-4GB for this data size range.
Hi Zdenek,
as shown by the last lvs command, data chunk size is at 4MB. Data chunk
size and metadata volume size where automatically selected at thin pool
creation - ie: they are default values.

Indeed, running "thin_metadata_size -b4m -s7t -m1000 -um" show
"thin_metadata_size - 60.80 mebibytes estimated metadata area size"
Metadata itself are also allocated in some internal chunks - so
releasing a thin-volume doesn't necessarily free space in the whole
metadata chunks thus such chunk remains allocated and there is not a
more detailed free-space tracking as space in chunks is shared between
multiple thin volumes and is related to efficient storage of b-Trees...
Ok, so removing a snapshot/volume can free a lower than expected
metadata amount. I fully understand that. However, I saw the *reverse*:
removing a volume shrunk metadata (much) more than expected. This also
mean that snapshot creation and data writes on the main volume caused a
*much* larger than expected increase in metadata usage.
There is no 'direct' connection between releasing space in data and
metadata volume - so it's quite natural you will see different
percentage of free space after thin volume removal between those two
volumes.
I understand that if data is shared between two or more volumes,
deleting a volume will not change much from a metadata standpoint.
However, this is true for the data pool also: it will continue to show
the same utilization. After all, removing a shared volume only means
that data chunk are mapped in another volume.

However, I was under impression that a more or less direct connection
between allocated pool data chunk and metadata existed: otherwise, a
tool as thin_metadata_size lose its scope.

So, where am I wrong?

Thanks.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Zdenek Kabelac
2018-03-27 10:18:08 UTC
Permalink
Post by Gionatan Danti
Hi
Well just for the 1st. look -  116MB for metadata for 7.21TB is *VERY* small
size. I'm not sure what is the data 'chunk-size'  - but you will need to
extend pool's metadata sooner or later considerably - I'd suggest at least
2-4GB for this data size range.
Hi Zdenek,
as shown by the last lvs command, data chunk size is at 4MB. Data chunk size
and metadata volume size where automatically selected at thin pool creation -
ie: they are default values.
Indeed, running "thin_metadata_size -b4m -s7t -m1000 -um" show
"thin_metadata_size - 60.80 mebibytes estimated metadata area size"
Metadata itself are also allocated in some internal chunks - so releasing a
thin-volume doesn't necessarily free space in the whole metadata chunks thus
such chunk remains allocated and there is not a more detailed free-space
tracking as space in chunks is shared between multiple thin volumes and is
related to efficient storage of b-Trees...
Ok, so removing a snapshot/volume can free a lower than expected metadata
amount. I fully understand that. However, I saw the *reverse*: removing a
volume shrunk metadata (much) more than expected. This also mean that snapshot
creation and data writes on the main volume caused a *much* larger than
expected increase in metadata usage.
As said - the 'metadata' usage is chunk-based and it's journal driven (i.e.
there is never in-place overwrite of valid data) - so the data storage pattern
always depends on existing layout and its transition to new state.
Post by Gionatan Danti
There is no 'direct' connection between releasing space in data and metadata
volume - so it's quite natural you will see different percentage of free
space after thin volume removal between those two volumes.
I understand that if data is shared between two or more volumes, deleting a
volume will not change much from a metadata standpoint. However, this is true
for the data pool also: it will continue to show the same utilization. After
all, removing a shared volume only means that data chunk are mapped in another
volume.
However, I was under impression that a more or less direct connection between
allocated pool data chunk and metadata existed: otherwise, a tool as
thin_metadata_size lose its scope.
So, where am I wrong?
Tool for size estimation is giving some 'rough' first guess/first choice number.

The metadata usage is based in real-word data manipulation - so while it's
relatively easy to 'cup' a single thin LV metadata usage - once there is a
lot of sharing between many different volumes - the exact size estimation
is difficult - as it depend on the order how the 'btree' has been constructed.

I.e. it is surely true the i.e. defragmentation of thin-pool may give you a
more compact tree consuming less space - but the amount of work needed to get
thin-pool into the most optimal configuration doesn't pay off. So you need to
live with cases, where the metadata usage behaves in a bit unpredictable
manner - since it's more preferred speed over the smallest consumed space -
which could be very pricey in terms of CPU and memory usage.

So as it has been said - metadata is 'accounted' in chunks for a userspace app
(like lvm2 is or what you get with 'dmsetup status') - but how much free space
is left in these individual chunks is kernel internal...

It's time to move on, you address 7TB and you 'extremely' care about couple MB
'hint here' - try to investigate how much space is wasted in filesystem itself ;)



Regards

Zdenek
Gionatan Danti
2018-03-27 10:58:40 UTC
Permalink
Post by Zdenek Kabelac
Tool for size estimation is giving some 'rough' first guess/first choice number.
The metadata usage is based in real-word data manipulation - so while
it's relatively easy to 'cup'  a single thin LV metadata usage - once
there is a lot of sharing between many different volumes - the exact
size estimation
is difficult - as it depend on the order how the 'btree' has been constructed.
I.e. it is surely true the i.e. defragmentation of thin-pool may give
you a more compact tree consuming less space - but the amount of work
needed to get thin-pool into the most optimal configuration doesn't pay
off.  So you need to live with cases, where the metadata usage behaves
in a bit unpredictable manner - since it's more preferred speed over the
smallest consumed space - which could be very pricey in terms of CPU and
memory usage.
So as it has been said - metadata is 'accounted' in chunks for a
userspace app (like lvm2 is or what you get with 'dmsetup status') - but
how much free space is left in these individual chunks is kernel
internal...
Ok, understood.
Post by Zdenek Kabelac
It's time to move on, you address 7TB and you 'extremely' care about
couple MB 'hint here' - try to investigate how much space is wasted in
filesystem itself ;)
Mmm no, I am caring for the couple MBs themselves. I was concerned about
the possibility to get a full metadata device by writing far less data
than expected. But I now get the point.

Thanks.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Gionatan Danti
2018-03-27 11:06:19 UTC
Permalink
Post by Gionatan Danti
Mmm no, I am caring for the couple MBs themselves. I was concerned about
the possibility to get a full metadata device by writing far less data
than expected. But I now get the point.
Sorry, I really meant "I am NOT caring for the couple MBs themselves"
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Zdenek Kabelac
2018-03-27 10:39:35 UTC
Permalink
Post by Gionatan Danti
What am I missing? Is the "data%" field a measure of how many data chunks are
allocated, or does it even track "how full" are these data chunks? This would
benignly explain the observed discrepancy, as a partially-full data chunks can
be used to store other data without any new metadata allocation.
Hi

I've forget to mention there is "thin_ls" tool (comes with
device-mapper-persistent-data package (with thin_check) - for those who want
to know precise amount of allocation and what amount of blocks is owned
exclusively by a single thinLV and what is shared.

It's worth to note - numbers printed by 'lvs' are *JUST* really rough
estimations of data usage for both thin_pool & thin_volumes.

Kernel is not maintaining full data-set - only a needed portion of it - and
since 'detailed' precise evaluation is expensive it's deferred to the tool
thin_ls...


And last but not least comment - when you pointed out 4MB extent usage - it's
relatively huge chunk - and if the 'fstrim' wants to succeed - those 4MB
blocks fitting thin-pool chunks needs to be fully released.

So i.e. if there are some 'sparse' filesystem metadata blocks places - they
may prevent TRIM to successeed - so while your filesystem may have a lot of
free space for its data - the actually amount if physically trimmed space can
be much much smaller.

So beware if the 4MB chunk-size for a thin-pool is good fit here....
The smaller the chunk is - the better change of TRIM there is...
For heavily fragmented XFS even 64K chunks might be a challenge....


Regards


Zdenek
Gionatan Danti
2018-03-27 11:05:22 UTC
Permalink
Hi
I've forget to mention  there is  "thin_ls" tool (comes with
device-mapper-persistent-data package (with thin_check) - for those who
want to know precise amount of allocation and what amount of blocks is
owned exclusively by a single thinLV and what is shared.
It's worth to note - numbers printed by 'lvs' are *JUST* really rough
estimations of data usage for both  thin_pool & thin_volumes.
Kernel is not maintaining full data-set - only a needed portion of it -
and since 'detailed' precise evaluation is expensive it's deferred to
the tool thin_ls...
Ok, thanks for the remind about "thin_ls" (I often forgot about these
"minor" but very useful utilities...)
And last but not least comment -  when you pointed out 4MB extent usage
- it's relatively huge chunk - and if the 'fstrim' wants to succeed -
those 4MB blocks fitting thin-pool chunks needs to be fully released. >
So i.e. if there are some 'sparse' filesystem metadata blocks places -
they may prevent TRIM to successeed - so while your filesystem may have
a lot of free space for its data - the actually amount if physically
trimmed space can be much much smaller.
So beware if the 4MB chunk-size for a thin-pool is good fit here....
The smaller the chunk is - the better change of TRIM there is...
Sure, I understand that. Anyway, please note that 4MB chunk size was
*automatically* chosen by the system during pool creation. It seems to
me that the default is to constrain the metadata volume to be < 128 MB,
right?
For heavily fragmented XFS even 64K chunks might be a challenge....
True, but chunk size *always* is a performance/efficiency tradeoff.
Making a 64K chunk-sided volume will end with even more fragmentation
for the underlying disk subsystem. Obviously, if many snapshot are
expected, a small chunk size is the right choice (CoW filesystem as
BTRFS and ZFS face similar problems, by the way).

Thanks.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Zdenek Kabelac
2018-03-27 12:52:25 UTC
Permalink
Post by Gionatan Danti
Hi
And last but not least comment -  when you pointed out 4MB extent usage -
it's relatively huge chunk - and if the 'fstrim' wants to succeed - those
4MB blocks fitting thin-pool chunks needs to be fully released. >
So i.e. if there are some 'sparse' filesystem metadata blocks places - they
may prevent TRIM to successeed - so while your filesystem may have a lot of
free space for its data - the actually amount if physically trimmed space
can be much much smaller.
So beware if the 4MB chunk-size for a thin-pool is good fit here....
The smaller the chunk is - the better change of TRIM there is...
Sure, I understand that. Anyway, please note that 4MB chunk size was
*automatically* chosen by the system during pool creation. It seems to me that
the default is to constrain the metadata volume to be < 128 MB, right?
Yes - on default lvm2 'targets' to fit metadata into this 128MB size.

Obviously there is nothing like 'one size fits all' - so it really the user
thinks about the use-case and pick better parameters then defaults.

Size 128MB is picked to have metadata that easily fit in RAM.
Post by Gionatan Danti
For heavily fragmented XFS even 64K chunks might be a challenge....
True, but chunk size *always* is a performance/efficiency tradeoff. Making a
64K chunk-sided volume will end with even more fragmentation for the
underlying disk subsystem. Obviously, if many snapshot are expected, a small
chunk size is the right choice (CoW filesystem as BTRFS and ZFS face similar
problems, by the way).
Yep - the smaller the chunk is - the less 'max' size of data device can be
supported as there is final number of chunks you can address from maximal
metadata size which is ~16GB and can't get any bigger.

The bigger the chunk is - the less sharing in snapshot happens, but it gets
less fragments.

Regards

Zdenek

Continue reading on narkive:
Loading...