Discussion:
[linux-lvm] thinpool metadata size
Paul B. Henson
2014-03-12 21:21:36 UTC
Permalink
While researching thinpool provisioning, it seems one of the issues is that
the size of the metadata is fixed as of creation, and that if the metadata
allocation fills up, your pool is corrupted? In many of the places that
concern was mentioned, it was also said that extending the size of the
metadata lv was a feature coming soon, but I didn't find anything confirming
whether or not that functionality had been released. Is the size of the
metadata lv still fixed?

My intention is to have a 4TB PV (4 x 2TB RAID10), allocated completely to a
thin pool, with the metadata stored separately on a 256G RAID1 of a couple
SSD's (the rest of the SSD mirror will eventually be used for dm-cache when
lvm support for that is released). This storage will be used for
virtualization, with fairly heavy snapshots, where there will be half a
dozen or so template volumes which will be snapshotted when a new vm is
created, then each of those will have some number of snapshots for backup
purposes (although those snapshots will never be written to). Given such a
usage pattern, is there a best practice recommendation for sizing the
metadata lv? It looks like going with the defaults would result in
approximately 3.6G allocated for metadata.

Thanks.
Mike Snitzer
2014-03-12 23:35:05 UTC
Permalink
On Wed, Mar 12 2014 at 5:21pm -0400,
Post by Paul B. Henson
While researching thinpool provisioning, it seems one of the issues is that
the size of the metadata is fixed as of creation, and that if the metadata
allocation fills up, your pool is corrupted? In many of the places that
concern was mentioned, it was also said that extending the size of the
metadata lv was a feature coming soon, but I didn't find anything confirming
whether or not that functionality had been released. Is the size of the
metadata lv still fixed?
No, metadata resize is now available. But you definitely want to be
using the latest kernel (there have been various fixes for this
feature).

Completely exhausting all space in the metadata device will expose you
to a corner case that still needs work... so best to avoid that by
sizing your metadata device conservatively (larger).

We'll soon be assessing whether a fix is needed for metadata resize once
all metadata space is exhausted (but last I knew we have a bug lurking
in dm-persistent-data for this case).
Post by Paul B. Henson
My intention is to have a 4TB PV (4 x 2TB RAID10), allocated completely to a
thin pool, with the metadata stored separately on a 256G RAID1 of a couple
SSD's (the rest of the SSD mirror will eventually be used for dm-cache when
lvm support for that is released). This storage will be used for
virtualization, with fairly heavy snapshots, where there will be half a
dozen or so template volumes which will be snapshotted when a new vm is
created, then each of those will have some number of snapshots for backup
purposes (although those snapshots will never be written to). Given such a
usage pattern, is there a best practice recommendation for sizing the
metadata lv? It looks like going with the defaults would result in
approximately 3.6G allocated for metadata.
The largest the metadata volume can be is just under 16GB. The size of
the metadata device will depend on the blocksize and number of expected
snapshots.

The thin_metadata_size utility should be able to provide you with an
approximation for the total metadata size needed.
Paul B. Henson
2014-03-13 01:32:12 UTC
Permalink
From: Mike Snitzer
Sent: Wednesday, March 12, 2014 4:35 PM
No, metadata resize is now available.
Oh, cool; that makes the initial allocation decision a little less critical
:).
But you definitely want to be
using the latest kernel (there have been various fixes for this
feature).
I thought I saw a thin pool metadata corruption issue fly by recently with a
fix destined for 3.14, I was tentatively thinking of waiting for the 3.14
release before migrating my box to thin provisioning. I'm currently running
3.12, it looks like that was designated a long-term support kernel? Are thin
provisioning (and dm-cache, as I'm going to add that to the mix as soon as
lvm supports it) patches going to be backported to that, or would it be
better to track mainline stable kernels as they are released?
Completely exhausting all space in the metadata device will expose you
to a corner case that still needs work... so best to avoid that by
sizing your metadata device conservatively (larger).
On the grand scale of things it doesn't look like it wants that much space,
so over allocation sounds like a good idea.
The largest the metadata volume can be is just under 16GB. The size of
the metadata device will depend on the blocksize and number of expected
snapshots.
Interesting; for some reason I thought metadata usage was also dependent on
changes between origin and snapshots. So, if you had one origin lv and 100
snapshots of it that were all identical, it would use less metadata than if
you had 100 snapshots that had been written to and were all wildly divergent
from each other. Evidently not though?

In regards to blocksize, from what I read the recommendation was that if
you're only looking for thin provisioning, but not planning to have lots of
snapshots, it's better to have a larger blocksize, whereas if you're going
to have a lot of snapshots a smaller blocksize is better? I think I'm just
going to stick with the default 64k for now.
The thin_metadata_size utility should be able to provide you with an
approximation for the total metadata size needed.
A short tangent; typically when you distinguish between gigabytes and
gibibytes, the former are powers of 10 and the latter powers of 2, no? 1
gigabyte = 1000000000 bytes, 1 gibibyte = 1073741824 bytes? It looks like
the thin_metadata_size utility has those reversed?

# thin_metadata_size -b 64k -s 4t -m 100000 -u gigabytes
thin_metadata_size - 2.41 gigabytes estimated metadata area size

# thin_metadata_size -b 64k -s 4t -m 100000 -u gibibytes
thin_metadata_size - 2.59 gibibytes estimated metadata area size

# thin_metadata_size -b 64k -s 4t -m 100000 -u bytes
thin_metadata_size - 2591174656 bytes estimated metadata area size

Back on subject, I guess there's some fixed overhead, as the metadata size
difference between one 1 lv and 10000 lv's is pretty tiny:

# thin_metadata_size -b 64k -s 4t -m 1 -u g
thin_metadata_size - 2.03 gigabytes estimated metadata area size

# thin_metadata_size -b 64k -s 4t -m 10000 -u g
thin_metadata_size - 2.07 gigabytes estimated metadata area size

Another power of 10 increase in volumes still only adds a bit more:

# thin_metadata_size -b 64k -s 4t -m 100000 -u g
thin_metadata_size - 2.41 gigabytes estimated metadata area size

I think I'll be pretty safe allocating 2.5G, particularly given you can now
resize it later if you start getting short.

Thanks much for the info.
Mike Snitzer
2014-03-13 14:01:58 UTC
Permalink
On Wed, Mar 12 2014 at 9:32pm -0400,
Post by Paul B. Henson
From: Mike Snitzer
Sent: Wednesday, March 12, 2014 4:35 PM
No, metadata resize is now available.
Oh, cool; that makes the initial allocation decision a little less critical
:).
But you definitely want to be
using the latest kernel (there have been various fixes for this
feature).
I thought I saw a thin pool metadata corruption issue fly by recently with a
fix destined for 3.14, I was tentatively thinking of waiting for the 3.14
release before migrating my box to thin provisioning. I'm currently running
3.12, it looks like that was designated a long-term support kernel? Are thin
provisioning (and dm-cache, as I'm going to add that to the mix as soon as
lvm supports it) patches going to be backported to that, or would it be
better to track mainline stable kernels as they are released?
The important fixes for long-standing issues will be marked for stable,
e.g.: http://git.kernel.org/linus/cebc2de44d3bce53 (and yes I already
sent a note to stable@ to have them pull this in to 3.12-stable too)

But significant improvements will not be. The biggest recent example of
this are all the improvements made in 3.14 for "out-of-data-space" mode
and all the associated error handling improvements.

So if I were relegated to using upstream kernels, I'd track latest
stable kernel if I could. Otherwise, I'd do my own backports -- but
wouldn't expect others to support my backports.
Post by Paul B. Henson
Completely exhausting all space in the metadata device will expose you
to a corner case that still needs work... so best to avoid that by
sizing your metadata device conservatively (larger).
On the grand scale of things it doesn't look like it wants that much space,
so over allocation sounds like a good idea.
The largest the metadata volume can be is just under 16GB. The size of
the metadata device will depend on the blocksize and number of expected
snapshots.
Interesting; for some reason I thought metadata usage was also dependent on
changes between origin and snapshots. So, if you had one origin lv and 100
snapshots of it that were all identical, it would use less metadata than if
you had 100 snapshots that had been written to and were all wildly divergent
from each other. Evidently not though?
I'm not sure if the tool tracks the rate of change. It may account for
worst case of _every_ block for the provided number of thin devices
_not_ being shared.
Paul B. Henson
2014-03-14 02:39:06 UTC
Permalink
From: Mike Snitzer
Sent: Thursday, March 13, 2014 7:02 AM
So if I were relegated to using upstream kernels, I'd track latest
stable kernel if I could.
I can, and I guess I will :), it just adds a little extra volatility and
work. Maybe by the time the next long-term stable after 3.12 gets picked
thin provisioning and cache will have settled down enough to go with it.
I'm not sure if the tool tracks the rate of change. It may account for
worst case of _every_ block for the provided number of thin devices
_not_ being shared.
Interesting, then after I added an extra order of magnitude padding for the
number of snapshots, it's probably quite over allocated. But still, again
2.5G isn't very much in the scale of things.

Thanks.
matthew patton
2014-03-14 05:52:06 UTC
Permalink
Post by Paul B. Henson
Interesting, then after I added an extra order of magnitude padding for the
number of snapshots, it's probably quite over allocated. But still, again
2.5G isn't very much in the scale of things.
The enterprise boys allocate as much as 30% of usable space to housekeeping so I'd say 2.5GB is a damn trifle.
Mike Snitzer
2014-03-13 17:20:25 UTC
Permalink
On Wed, Mar 12 2014 at 7:35pm -0400,
Post by Mike Snitzer
On Wed, Mar 12 2014 at 5:21pm -0400,
Post by Paul B. Henson
While researching thinpool provisioning, it seems one of the issues is that
the size of the metadata is fixed as of creation, and that if the metadata
allocation fills up, your pool is corrupted? In many of the places that
concern was mentioned, it was also said that extending the size of the
metadata lv was a feature coming soon, but I didn't find anything confirming
whether or not that functionality had been released. Is the size of the
metadata lv still fixed?
No, metadata resize is now available. But you definitely want to be
using the latest kernel (there have been various fixes for this
feature).
Completely exhausting all space in the metadata device will expose you
to a corner case that still needs work... so best to avoid that by
sizing your metadata device conservatively (larger).
We'll soon be assessing whether a fix is needed for metadata resize once
all metadata space is exhausted (but last I knew we have a bug lurking
in dm-persistent-data for this case).
Yes, we're still unable to resize after metadata has been completely exhausted:

(NOTE: this is with a hacked 3.14-rc6 kernel that comments out the
dm_pool_metadata_set_needs_check() block in
drivers/md/dm-thin.c:abort_transaction().. otherwise this test would
short-circuit on the 'needs_check' flag being set and would never
attempt the resize)

# dmtest run --suite thin-provisioning -n /resize_metadata_after_exhaust/
Loaded suite thin-provisioning
Started
test_resize_metadata_after_exhaust(MetadataResizeTests): metadata_size = 512k
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: reached low water mark for metadata device: sending event.
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: space map metadata: unable to allocate new metadata block
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: metadata operation 'dm_pool_alloc_data_block' failed: error = -28
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: aborting current metadata transaction
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: switching pool to read-only mode
wipe_device failed as expected
resizing...
Mar 13 12:57:44 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186096
Mar 13 12:57:44 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186096
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: switching pool to write mode
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 128 to 192 blocks
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: space map metadata: unable to allocate new metadata block
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: metadata operation 'dm_pool_commit_metadata' failed: error = -28
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: aborting current metadata transaction
Mar 13 12:57:44 rhel-storage-02 kernel: device-mapper: thin: 251:5: switching pool to read-only mode
Mar 13 12:57:44 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:44 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:44 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:44 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:45 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:45 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:45 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186111
Mar 13 12:57:45 rhel-storage-02 kernel: Buffer I/O error on device dm-6, logical block 4186104
metadata_size = 768k
wipe_device failed as expected
resizing...
Mar 13 12:57:46 rhel-storage-02 kernel: device-mapper: thin: 251:5: switching pool to write mode
Mar 13 12:57:46 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 128 to 256 blocks
Mar 13 12:57:46 rhel-storage-02 kernel: device-mapper: space map metadata: unable to allocate new metadata block
Mar 13 12:57:46 rhel-storage-02 kernel: device-mapper: thin: 251:5: metadata operation 'dm_pool_commit_metadata' failed: error = -28
Mar 13 12:57:46 rhel-storage-02 kernel: device-mapper: thin: 251:5: aborting current metadata transaction
Mar 13 12:57:46 rhel-storage-02 kernel: device-mapper: thin: 251:5: switching pool to read-only mode

Good news is online metadata resize works fine as long as metadata space
hasn't been exhausted:

# dmtest run --suite thin-provisioning -n /resize_metadata_with_io/
Loaded suite thin-provisioning
Started
test_resize_metadata_with_io(MetadataResizeTests):
Mar 13 13:19:10 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 256 to 512 blocks
Mar 13 13:19:11 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 512 to 1024 blocks
Mar 13 13:19:12 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 1024 to 1536 blocks
Mar 13 13:19:14 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 1536 to 2048 blocks
Mar 13 13:19:15 rhel-storage-02 kernel: device-mapper: thin: 251:5: growing the metadata device from 2048 to 2560 blocks
.

Finished in 11.653548028 seconds.
Paul B. Henson
2014-03-14 02:42:40 UTC
Permalink
From: Mike Snitzer
Sent: Thursday, March 13, 2014 10:20 AM
[...]
Good news is online metadata resize works fine as long as metadata space
Hmm, then it seems it would be wise to keep an eye on utilization ;). Looks
like there are some munin plug-ins for lvm, hopefully one of them already
has metadata usage support or can be easily extended.

Thanks.
Mike Snitzer
2014-03-14 03:35:13 UTC
Permalink
On Thu, Mar 13 2014 at 10:42pm -0400,
Post by Paul B. Henson
From: Mike Snitzer
Sent: Thursday, March 13, 2014 10:20 AM
[...]
Good news is online metadata resize works fine as long as metadata space
Hmm, then it seems it would be wise to keep an eye on utilization ;). Looks
like there are some munin plug-ins for lvm, hopefully one of them already
has metadata usage support or can be easily extended.
lvm2 should/could monitor for metadata low water mark threshold events,
which the kernel will trigger, much like it does for data low water
mark.

But since the kernel internalizes the metadata low water mark I'm not
sure if lvm2 has taken this on. Zdenek?
Loading...