Discussion:
[linux-lvm] Why use thin_pool_autoextend_threshold < 100 ?
Marc MERLIN
2018-07-26 16:31:45 UTC
Permalink
Still learning about thin volumes.
Why do I want my thin pool to get auto extended? Does "extended" mean
resized?
Why would I want to have thin_pool_autoextend_threshold below 100 and
have it auto extend as needed vs having all of them be at 100, knowing
that underlying block allocation will fail if I run out of physical
blocks underneath?


Details:
I have a 14TB bcache block device.
On top, I'd like to put multiple btrfs filesystems.
There is however an issue with btrfs where it gets more unsafe (and
slower) to use if you have too many snapshots (over 50, and especially
over 100).
The fix around this is sadly to have multiple separate filesystems,
which kind of negates the nice part where you make subvolumes and let
them grow independently.

So, I'm going to make about 10 thin volumes, one for each of my btrfs
subvolumes so that they are all separate filesystems.
However, my plan is to make them all 14TB in size so that I never have
to resize the filesystem with the full understanding of course that the
sum of all is still going to be 14TiB underneath.

Right now, I'm getting this:
gargamel:~# lvcreate -V14TiB -T vgds2/thinpool2 -n debian64
Using default stripesize 64.00 KiB.
WARNING: Sum of all thin volume sizes (28.00 TiB) exceeds the size of thin pool vgds2/thinpool2 and the size of whole volume group (14.55 TiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "debian64" created.

I'm looking at lvm.conf, and I'll be honest that it's not clear
# Configuration option activation/thin_pool_autoextend_threshold.
# Auto-extend a thin pool when its usage exceeds this percent.
# Setting this to 100 disables automatic extension.
# The minimum value is 50 (a smaller value is treated as 50.)
# Also see thin_pool_autoextend_percent.
# Automatic extension requires dmeventd to be monitoring the LV.
#
# Example
# Using 70% autoextend threshold and 20% autoextend size, when a 1G
# thin pool exceeds 700M, it is extended to 1.2G, and when it exceeds
# 840M, it is extended to 1.44G:
# thin_pool_autoextend_threshold = 70
#
thin_pool_autoextend_threshold = 100

What's the downside of just leaving it at 100?

Thanks
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
Zdenek Kabelac
2018-07-27 12:59:28 UTC
Permalink
Post by Marc MERLIN
Still learning about thin volumes.
Why do I want my thin pool to get auto extended? Does "extended" mean
resized?
yes extension == resize
Post by Marc MERLIN
Why would I want to have thin_pool_autoextend_threshold below 100 and
have it auto extend as needed vs having all of them be at 100, knowing
that underlying block allocation will fail if I run out of physical
blocks underneath?
Hi

man lvmthin.


In general - do not calculate with 'running out of space on thin-pool' as
your regular workflow how to use it daily.

Running out-of-space in thin-pool (data and even more on metadata) will have
always MAJOR impact on usability of your system. It's always unpleasant moment
and it's not even closely comparable with something like running out-of-space
in your filesystem - it's much more problematic case - so you should at all
cost try to avoid it.

If you want to be living on corner case of out-of-space, thin-pool is probably
not the best technology for use.
Post by Marc MERLIN
I have a 14TB bcache block device.
On top, I'd like to put multiple btrfs filesystems.
IMHO bad plan to combine 2 overprovisioning technologies together.

btrfs HAS its own built-in volume manager (aka built-in it's own like lvm)
Post by Marc MERLIN
There is however an issue with btrfs where it gets more unsafe (and
slower) to use if you have too many snapshots (over 50, and especially
over 100).
It's better to pair thin-pool with ext4 of XFS.

BTRFS will suffer great pain from problems of lvm2 snapshots - where btrfs
will see the very same block device multiple times present in your system - so
I'd highly discourage usage of thin-pool with btrfs unless you are very well
aware of the weaknesses and you can avoid running into them...
Post by Marc MERLIN
I'm looking at lvm.conf, and I'll be honest that it's not clear
# Configuration option activation/thin_pool_autoextend_threshold.
# Auto-extend a thin pool when its usage exceeds this percent.
# Setting this to 100 disables automatic extension.
# The minimum value is 50 (a smaller value is treated as 50.)
# Also see thin_pool_autoextend_percent.
# Automatic extension requires dmeventd to be monitoring the LV.
#
# Example
# Using 70% autoextend threshold and 20% autoextend size, when a 1G
# thin pool exceeds 700M, it is extended to 1.2G, and when it exceeds
# thin_pool_autoextend_threshold = 70
#
thin_pool_autoextend_threshold = 100
What's the downside of just leaving it at 100?
Possible lose of your data in case you run out of space and you hit some
corner cases - note just with 4.18 kernel will be fixed one quite annoying bug
with usage of TRIM and full pool which could have lead to some problematic
metadata recovery.

Regards

Zdenek
Marc MERLIN
2018-07-27 18:26:58 UTC
Permalink
Hi Zdenek,

Thanks for your helpful reply.
Post by Zdenek Kabelac
Post by Marc MERLIN
Still learning about thin volumes.
Why do I want my thin pool to get auto extended? Does "extended" mean
resized?
yes extension == resize
Gotcha. Then I don't want to have to worry about my filesystem being resized
multiple times, especially since I'm not sure how it will help.
Post by Zdenek Kabelac
man lvmthin.
Thanks. Had read it, but not carefully enough.
So, I just re-read "Automatic extend settings"
I'm still I'm not entirely sure how using extension would help me there. I
can't set it to 10% for all 10 filesystems (50% is minimum).
If I set it to anything less than 100%, it could later that it can block,
and try to extend and resize later, but ultimately I'll still have multiple
filesystems that together exceed the space available, so I can run out.
I'm not seeing how the automatic extend setting is helpful, at least in my case.
Am I missing something?

To be clear, my case is that I will have 10 filesystems in a place where the
same data was in a single filesystem that sadly I must segment now. More
than a few will take more than 1/10th of the space, but I don't want to have
to worry about which ones are going to use how much as long as all together
they stay below 100% of course.
I don't want to have to manage space for each of those 10 and have to resize
them by hand multiple times up and down to share the space, hence dm-thin.

My understanding is that I have to watch this carefully
LV Name thinpool2
VG Name vgds2
LV Pool metadata thinpool2_tmeta
LV Pool data thinpool2_tdata
LV Status available
# open 8
LV Size 14.50 TiB
Allocated pool data 20.26%
Allocated metadata 10.66%

I'll have to make sure to run fstrim so that 'Allocated pool data' never
gets too high.
Metadata, I need to read more about to see whether that may become a problem.
I think as long as I don't use LVM snapshots I should be ok (and I won't).
Post by Zdenek Kabelac
Running out-of-space in thin-pool (data and even more on metadata) will
have always MAJOR impact on usability of your system. It's always
unpleasant moment and it's not even closely comparable with something like
running out-of-space in your filesystem - it's much more problematic case -
so you should at all cost try to avoid it.
Thanks for confirming.
I suppose in my case I should set 'errorwhenfull y' so that the FS immmediately
remounts read only on write failure. Delaying for up to 60 seconds is not
going to help in my case.
Post by Zdenek Kabelac
If you want to be living on corner case of out-of-space, thin-pool is
probably not the best technology for use.
I don't want to be using dm-thin at all, but I have too many subvolumes for
a single btrfs filesystem, so I need to segement my btrfs filesystem in 10
or so, to be safe (as discussed with btrfs developers)
Post by Zdenek Kabelac
IMHO bad plan to combine 2 overprovisioning technologies together.
btrfs HAS its own built-in volume manager (aka built-in it's own like lvm)
btrfs does not over provision, and sadly I found out that if you have more
than 50 or 100 snapshots, you are going to run into problems with balancing,
and bigger problems with filesystem corruption and repair later (as I found
out over the last 3 weeks dealing with this)
Post by Zdenek Kabelac
Post by Marc MERLIN
There is however an issue with btrfs where it gets more unsafe (and
slower) to use if you have too many snapshots (over 50, and especially
over 100).
It's better to pair thin-pool with ext4 of XFS.
I need btrfs send/receive, so that's not an option.
Post by Zdenek Kabelac
BTRFS will suffer great pain from problems of lvm2 snapshots - where btrfs
I will not be using lvm snapshots at all.
Post by Zdenek Kabelac
will see the very same block device multiple times present in your system -
so I'd highly discourage usage of thin-pool with btrfs unless you are very
well aware of the weaknesses and you can avoid running into them...
I'm only using thin-pool to allow dynamic block allocation for over
provisioning. I will use no other LVM feature. Is that ok?
Post by Zdenek Kabelac
Possible lose of your data in case you run out of space and you hit some
corner cases - note just with 4.18 kernel will be fixed one quite annoying
bug with usage of TRIM and full pool which could have lead to some
problematic metadata recovery.
So, as long as I run trim in btrfs and make very sure I don't run out of blocks
on the VG side, should I be safe-ish enough?

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
John Stoffel
2018-07-27 19:31:36 UTC
Permalink
Marc> Hi Zdenek,
Marc> Thanks for your helpful reply.
Post by Zdenek Kabelac
Post by Marc MERLIN
Still learning about thin volumes.
Why do I want my thin pool to get auto extended? Does "extended" mean
resized?
yes extension == resize
Marc> Gotcha. Then I don't want to have to worry about my filesystem being resized
Marc> multiple times, especially since I'm not sure how it will help.
Post by Zdenek Kabelac
man lvmthin.
Marc> Thanks. Had read it, but not carefully enough.
Marc> So, I just re-read "Automatic extend settings"
Marc> I'm still I'm not entirely sure how using extension would help me there. I
Marc> can't set it to 10% for all 10 filesystems (50% is minimum).
Marc> If I set it to anything less than 100%, it could later that it can block,
Marc> and try to extend and resize later, but ultimately I'll still have multiple
Marc> filesystems that together exceed the space available, so I can run out.
Marc> I'm not seeing how the automatic extend setting is helpful, at least in my case.
Marc> Am I missing something?

Marc> To be clear, my case is that I will have 10 filesystems in a
Marc> place where the same data was in a single filesystem that sadly
Marc> I must segment now. More than a few will take more than 1/10th
Marc> of the space, but I don't want to have to worry about which ones
Marc> are going to use how much as long as all together they stay
Marc> below 100% of course.

Marc> I don't want to have to manage space for each of those 10 and
Marc> have to resize them by hand multiple times up and down to share
Marc> the space, hence dm-thin.

Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.

But again, I think you might really prefer quotas instead, unless you
need complete logical seperation.

John
Marc MERLIN
2018-07-27 19:58:18 UTC
Permalink
Post by John Stoffel
Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.
Those filesystems can be umounted, so shrinking while live is not
something I need even if btrfs might actually support it.
Post by John Stoffel
But again, I think you might really prefer quotas instead, unless you
need complete logical seperation.
Since I know more than I wish I did about btrfs :) let me explain a bit
more

0) I will not be using lvm for its own snapshot capabilities, or resize.
I'm cheating by using overcommit with dm-thin and not wanting to worry
about segmenting space between each fileystem and having to worry about
shrinking one to grow another one from time to time.

1) quotas don't work well on btrfs when you have snapshots, and by that
I mean btfrs snapshots. Because blocks are shared between snapshots,
calculating quotas is a performance problem.

2) I don't have a space or quota problem on btrfs, the problem I have is
I use btrfs send/receive a lot for backups (it's a backup server) and
history (go back a month ago or whatever).
http://marc.merlins.org/perso/btrfs/post_2014-03-22_Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive.html
if you aren't familiar with btrfs send/receive backups.
Btrfs starts having performance problems for some operations (re-balance,
or fsck) when you have too many subvolumes (each snapshot creates a
subvolume).

3) I hit severe enough problems that filesystem checks were taking days
to complete, which was not workable. The only way around it is to have
fewer subvolumes.

4) because I still need the same amount of backups and want the same
amount of history, fewer subvolumes means moving each separate subvolume
into its own separate filesystem.

Then there is the last part that btrfs is still not super stable and can
have corruption problems (although in my case I had clear problems due
to an underlying unreliable SATA subsystem which caused writes not to
make it to all the blocks of each drive of a raid set, something that
even careful journalling does not deal with with).
So, I have:

5) when things go wrong with btrfs, you're better off having smaller
filesystems with less data as they are quicker to check and repair as
well we quicker to rebuild if they are corrupted beyond repair
(btrfs can easily get into a state where all or most of your data is
still there read only, but the filesystem has extent issues that can't
be fixed at this moment and require a rebuild)

Makes sense?

Am I crazy to want to use dm-thin the way I'm trying to? :)

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
John Stoffel
2018-07-27 21:09:54 UTC
Permalink
Post by John Stoffel
Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.
Marc> Those filesystems can be umounted, so shrinking while live is not
Marc> something I need even if btrfs might actually support it.
Post by John Stoffel
But again, I think you might really prefer quotas instead, unless you
need complete logical seperation.
Marc> Since I know more than I wish I did about btrfs :) let me explain a bit
Marc> more

Marc> 0) I will not be using lvm for its own snapshot capabilities, or
Marc> resize. I'm cheating by using overcommit with dm-thin and not
Marc> wanting to worry about segmenting space between each fileystem
Marc> and having to worry about shrinking one to grow another one from
Marc> time to time.

Marc> 1) quotas don't work well on btrfs when you have snapshots, and
Marc> by that I mean btfrs snapshots. Because blocks are shared
Marc> between snapshots, calculating quotas is a performance problem.

Marc> 2) I don't have a space or quota problem on btrfs, the problem I
Marc> have is I use btrfs send/receive a lot for backups (it's a
Marc> backup server) and history (go back a month ago or whatever).
Marc> http://marc.merlins.org/perso/btrfs/post_2014-03-22_Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive.html
Marc> if you aren't familiar with btrfs send/receive backups. Btrfs
Marc> starts having performance problems for some operations
Marc> (re-balance, or fsck) when you have too many subvolumes (each
Marc> snapshot creates a subvolume).

That's the key part that I didn't realize. And this is why I'm still
leary of btrfs (and zfs for that matter) since as you push the limits,
they tend to fall off a cliff performance wise, instead of degrading
more gracefully. So you're obvisously also using source brtfs
volume(s) for your data being backed up. So can understand what
you're trying to do...

So is it a single 14tb source btrfs volume, and did you make snapshots
on a rotating basis to the destinations volumes?

Marc> 3) I hit severe enough problems that filesystem checks were
Marc> taking days to complete, which was not workable. The only way
Marc> around it is to have fewer subvolumes.

Ouch! This is not an easy space to be in.

Marc> 4) because I still need the same amount of backups and want the same
Marc> amount of history, fewer subvolumes means moving each separate subvolume
Marc> into its own separate filesystem.

So you're doing snapshots of source sub-volumes? I figure you must be
running into performance problems no matter which end you're talking
about here, because the btrfs stuff is just going to bite you one way
or another.

Marc> Then there is the last part that btrfs is still not super stable
Marc> and can have corruption problems (although in my case I had
Marc> clear problems due to an underlying unreliable SATA subsystem
Marc> which caused writes not to make it to all the blocks of each
Marc> drive of a raid set, something that even careful journalling
Marc> does not deal with with). So, I have:

Man, you love living dangerously! *grin*

Marc> 5) when things go wrong with btrfs, you're better off having smaller
Marc> filesystems with less data as they are quicker to check and repair as
Marc> well we quicker to rebuild if they are corrupted beyond repair
Marc> (btrfs can easily get into a state where all or most of your data is
Marc> still there read only, but the filesystem has extent issues that can't
Marc> be fixed at this moment and require a rebuild)

Ouch! You really enjoy living on the edge. :-)

Marc> Am I crazy to want to use dm-thin the way I'm trying to? :)

I think you're a little crazy using btrfs in this way, *grin* since
losing my data is a big no-no in my world. Personally love my Netapps
because they're super reliable and super easy to grow-shrink volumes
and snapshots just work, along with cloning volumes across to other
systems.

But I also agree that backups are a pain in the ass, no matter how you
look at it, and it's only gotten worse as filesystem size, and file
counts have gone up, but underlying filesystems and such haven't
managed to keep up.

Good luck for sure!
John
Marc MERLIN
2018-07-27 23:35:18 UTC
Permalink
Post by John Stoffel
That's the key part that I didn't realize. And this is why I'm still
leary of btrfs (and zfs for that matter) since as you push the limits,
they tend to fall off a cliff performance wise, instead of degrading
more gracefully. So you're obvisously also using source brtfs
volume(s) for your data being backed up. So can understand what
you're trying to do...
So is it a single 14tb source btrfs volume, and did you make snapshots
on a rotating basis to the destinations volumes?
Maybe we should continue this on the btrfs list, I don't want to spam
people here who don't care about btrfs :) but I'll answer this and if we
continue, let's move lists if you don't mind.

btrfs send/receive needs a snapshot for each copy. I then have a script
that decides that I keep X of the older snapshots I don't need anymore
for send/receive to work, but that I keep around for posterity.

Snapshots do not actually cause performance issues that I've noticed day
to day with btrfs, but if you do quotas, or balance (which is a
complicated operation), or btrfsck, then the number of snapshots
matters, and performance gets hurt quite a bit if you have 270
snapshots, like I ended up having in the end :)
Post by John Stoffel
Marc> 4) because I still need the same amount of backups and want the same
Marc> amount of history, fewer subvolumes means moving each separate subvolume
Marc> into its own separate filesystem.
So you're doing snapshots of source sub-volumes? I figure you must be
running into performance problems no matter which end you're talking
about here, because the btrfs stuff is just going to bite you one way
or another.
Not really, performance was fine. It was so much better than using
rsync (sometimes by 100x or more)
But yeah, send/receive makes a snapshots of the source, and leaves a
snapshot on the destination volume.
You can work with only 2 snapshots, but I keep more for historical
restores.
Post by John Stoffel
Marc> Then there is the last part that btrfs is still not super stable
Marc> and can have corruption problems (although in my case I had
Marc> clear problems due to an underlying unreliable SATA subsystem
Marc> which caused writes not to make it to all the blocks of each
Marc> drive of a raid set, something that even careful journalling
Man, you love living dangerously! *grin*
It is a good time to say that I actually use all of this on one
filesystem?

mdadm raid5
bcache
dmcrypt
dm-thin
lvm
btrfs

:)
Post by John Stoffel
I think you're a little crazy using btrfs in this way, *grin* since
losing my data is a big no-no in my world. Personally love my Netapps
because they're super reliable and super easy to grow-shrink volumes
and snapshots just work, along with cloning volumes across to other
systems.
I used to work at netapp, they're great, but they don't work inside my
laptop, obviously they're not open source and I'd rather avoid using
NFS if I can at this point (ok, they also do iscsi).

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
Chris Murphy
2018-07-31 04:52:53 UTC
Permalink
Post by John Stoffel
Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.
Btrfs supports grow and shrink resizes only when mounted. It's not
possible to resize when unmounted.
--
Chris Murphy
John Stoffel
2018-08-01 01:33:13 UTC
Permalink
Post by John Stoffel
Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.
Chris> Btrfs supports grow and shrink resizes only when mounted. It's
Chris> not possible to resize when unmounted.

That's... bizarre. Good to know, but bizarre. That does make it more
appealing to use in day to day situations for sure. Any thoughts on
how stable this is in real life?

John
Chris Murphy
2018-08-01 02:43:38 UTC
Permalink
Post by John Stoffel
Post by John Stoffel
Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.
Chris> Btrfs supports grow and shrink resizes only when mounted. It's
Chris> not possible to resize when unmounted.
That's... bizarre. Good to know, but bizarre. That does make it more
appealing to use in day to day situations for sure. Any thoughts on
how stable this is in real life?
I've never heard of it failing in many years of being on the Btrfs
list. The resize leverages the same block group handling as balance
code, so the relocation of block groups during resize is the same as
you'd get with a filtered balance, it's integral to the file system's
operation.

The shrink operation first moves block groups in the region subject to
shrink (the part that's going away), and this is an atomic operation
per block group. You could pull the plug on it (and I have) in
progress and you'd just get a reversion to a prior state before the
last file system metadata and superblock commit (assumes the hardware
isn't lying and some hardware does lie). Once all the block groups are
moved, and the dev and chunk trees are updated to reflect the new
location of those chunks (block groups), the superblocks are updated
to reflect the new device size.

Literally the shrink operation changes very little metadata, it's just
moving block groups, and then the actual "resize" is merely a
superblock change. The file system metadata doesn't change much
because Btrfs uses an internal logical block addressing to reference
file extents and those references stay the same during a resize. The
logical block range mapping to physical block range mapping is a tiny
update (maybe 1/2 dozen 16K leaf and node writes) and those updates
are always COW, not overwrites. That's also how this is an atomic
operation. If the block group copy fails, the dev and chunk trees that
are used to translate between logical and physical block ranges never
get updated.
--
Chris Murphy
Chris Murphy
2018-08-02 17:42:16 UTC
Permalink
Post by Chris Murphy
Post by John Stoffel
Post by John Stoffel
Why don't you run quotas on your filesystems? Also, none of the
filesystems in Linux land that I'm aware of supports shrinking the
filesystem while live, it's all a unmount, shrink FS, shrink volume
(carefully!) and then re-mount the filesystem.
Chris> Btrfs supports grow and shrink resizes only when mounted. It's
Chris> not possible to resize when unmounted.
That's... bizarre. Good to know, but bizarre. That does make it more
appealing to use in day to day situations for sure. Any thoughts on
how stable this is in real life?
I've never heard of it failing in many years of being on the Btrfs
list. The resize leverages the same block group handling as balance
code, so the relocation of block groups during resize is the same as
you'd get with a filtered balance, it's integral to the file system's
operation.
The shrink operation first moves block groups in the region subject to
shrink (the part that's going away), and this is an atomic operation
per block group. You could pull the plug on it (and I have) in
progress and you'd just get a reversion to a prior state before the
last file system metadata and superblock commit (assumes the hardware
isn't lying and some hardware does lie). Once all the block groups are
moved, and the dev and chunk trees are updated to reflect the new
location of those chunks (block groups), the superblocks are updated
to reflect the new device size.
Literally the shrink operation changes very little metadata, it's just
moving block groups, and then the actual "resize" is merely a
superblock change. The file system metadata doesn't change much
because Btrfs uses an internal logical block addressing to reference
file extents and those references stay the same during a resize. The
logical block range mapping to physical block range mapping is a tiny
update (maybe 1/2 dozen 16K leaf and node writes) and those updates
are always COW, not overwrites. That's also how this is an atomic
operation. If the block group copy fails, the dev and chunk trees that
are used to translate between logical and physical block ranges never
get updated.
--
Chris Murphy
Also, fs resize always happens when doing device add or device remove.
So resize is integral for Btrfs multiple device support. Device add
and remove can likewise only be done while the file system is mounted.
Removing a device means migrating block groups off that device,
shrinking the file system by an amount identical to the device size,
updating superblocks on remaining devices, and wiping the Btrfs
signature on the removed device. And there are similar behaviors when
converting block group profiles: e.g. from single to raid1, single to
DUP, DUP to single, raid5 to raid6 or vice versa and so on.
Conversions are only possible while the file system is mounted.

LVM pvmove isn't entirely different in concept. The LVM extents are
smaller (4MB by default) than Btrfs block groups (dynamically variable
in size but most typically they are 1GiB for data bg's and 256MB for
metadata bg's, and 32MB for system bg's. Btrfs block groups are
collections of extents.). But basically the file system just keeps on
reading and writing to its usual LBA's which are abstracted and
translated into real physical LBA's and a device by LVM. I don't know
how atomic pvmove is without the --atomic flag, and what the chances
of resuming pvmove in case of crash or an urgent reboot is.

The gotcha with ext4 and XFS is they put filesystem metadata in fixed
locations on a block device, so those all have to be relocated to new
fixed positions based on the new block device size as well as data.
The shrink operation is probably sufficiently complicated for ext234
that they just don't want concurrent read/write operations happening
while shrinking. And also the resize introduces inherent inefficiency
with subsequent operation. The greater the difference between mkfs
volume size and the resized size, the greater the inefficiency. That
applies to both ext4 and XFS whether shrink or grow, of course XFS
doesn't have shrink at all, the expectation for its more sophisticated
environment use cases was that it would only ever be grown.

Whereas Btrfs has no fixed locations for any of its block groups, so
from its perspective a resize is just not that unique of an operation,
leveraging code that's regularly exercised in normal operation anyway.
And it also doesn't suffer from any resize inefficiencies either; in
fact depending on the operation it might become more efficient.

Anyway, probably a better way of handling shrink with ext4 and XFS is
having them on LVM thin volumes, and just using fstrim to remove
unused LVM extents from the LV, releasing them back to the pool for
use by any other LV in that pool. It's not exactly the same thing as a
shrink of course, but if the idea is to let a file system use the
unused but "reserved" space of a second file system, merely trimming
the second file system on a thin LV does achieve that. Bigger issue
here is you can't then shrink the pool, so you can still get stuck in
some circumstances.
--
Chris Murphy
Marc MERLIN
2018-07-31 02:44:10 UTC
Permalink
Post by Marc MERLIN
Hi Zdenek,
Thanks for your helpful reply.
Ha again Zdenek,

Just to confirm, am I going to be ok enough with the scheme I described
as long as I ensure that 'Allocated pool data' does not get to 100% ?

For now, I have my btrfs filesystems mounted with "discard", so
hopefully it should tell dm-thin when it can free up/reuse blocks.

Given that, am I more or less ok using dm-thin that way?

And for my own understanding, is there any reason why I would even want
to consider thin_pool_autoextend_threshold < 100 ?

As a reminder, I have:
mdadm raid5
bcache
dmcrypt
dm-thin
lvm
btrfs

Thanks,
Marc
Post by Marc MERLIN
Post by Zdenek Kabelac
Post by Marc MERLIN
Still learning about thin volumes.
Why do I want my thin pool to get auto extended? Does "extended" mean
resized?
yes extension == resize
Gotcha. Then I don't want to have to worry about my filesystem being resized
multiple times, especially since I'm not sure how it will help.
Post by Zdenek Kabelac
man lvmthin.
Thanks. Had read it, but not carefully enough.
So, I just re-read "Automatic extend settings"
I'm still I'm not entirely sure how using extension would help me there. I
can't set it to 10% for all 10 filesystems (50% is minimum).
If I set it to anything less than 100%, it could later that it can block,
and try to extend and resize later, but ultimately I'll still have multiple
filesystems that together exceed the space available, so I can run out.
I'm not seeing how the automatic extend setting is helpful, at least in my case.
Am I missing something?
To be clear, my case is that I will have 10 filesystems in a place where the
same data was in a single filesystem that sadly I must segment now. More
than a few will take more than 1/10th of the space, but I don't want to have
to worry about which ones are going to use how much as long as all together
they stay below 100% of course.
I don't want to have to manage space for each of those 10 and have to resize
them by hand multiple times up and down to share the space, hence dm-thin.
My understanding is that I have to watch this carefully
LV Name thinpool2
VG Name vgds2
LV Pool metadata thinpool2_tmeta
LV Pool data thinpool2_tdata
LV Status available
# open 8
LV Size 14.50 TiB
Allocated pool data 20.26%
Allocated metadata 10.66%
I'll have to make sure to run fstrim so that 'Allocated pool data' never
gets too high.
Metadata, I need to read more about to see whether that may become a problem.
I think as long as I don't use LVM snapshots I should be ok (and I won't).
Post by Zdenek Kabelac
Running out-of-space in thin-pool (data and even more on metadata) will
have always MAJOR impact on usability of your system. It's always
unpleasant moment and it's not even closely comparable with something like
running out-of-space in your filesystem - it's much more problematic case -
so you should at all cost try to avoid it.
Thanks for confirming.
I suppose in my case I should set 'errorwhenfull y' so that the FS immmediately
remounts read only on write failure. Delaying for up to 60 seconds is not
going to help in my case.
Post by Zdenek Kabelac
If you want to be living on corner case of out-of-space, thin-pool is
probably not the best technology for use.
I don't want to be using dm-thin at all, but I have too many subvolumes for
a single btrfs filesystem, so I need to segement my btrfs filesystem in 10
or so, to be safe (as discussed with btrfs developers)
Post by Zdenek Kabelac
IMHO bad plan to combine 2 overprovisioning technologies together.
btrfs HAS its own built-in volume manager (aka built-in it's own like lvm)
btrfs does not over provision, and sadly I found out that if you have more
than 50 or 100 snapshots, you are going to run into problems with balancing,
and bigger problems with filesystem corruption and repair later (as I found
out over the last 3 weeks dealing with this)
Post by Zdenek Kabelac
Post by Marc MERLIN
There is however an issue with btrfs where it gets more unsafe (and
slower) to use if you have too many snapshots (over 50, and especially
over 100).
It's better to pair thin-pool with ext4 of XFS.
I need btrfs send/receive, so that's not an option.
Post by Zdenek Kabelac
BTRFS will suffer great pain from problems of lvm2 snapshots - where btrfs
I will not be using lvm snapshots at all.
Post by Zdenek Kabelac
will see the very same block device multiple times present in your system -
so I'd highly discourage usage of thin-pool with btrfs unless you are very
well aware of the weaknesses and you can avoid running into them...
I'm only using thin-pool to allow dynamic block allocation for over
provisioning. I will use no other LVM feature. Is that ok?
Post by Zdenek Kabelac
Possible lose of your data in case you run out of space and you hit some
corner cases - note just with 4.18 kernel will be fixed one quite annoying
bug with usage of TRIM and full pool which could have lead to some
problematic metadata recovery.
So, as long as I run trim in btrfs and make very sure I don't run out of blocks
on the VG side, should I be safe-ish enough?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
Zdenek Kabelac
2018-07-31 12:35:42 UTC
Permalink
Post by Marc MERLIN
Post by Marc MERLIN
Hi Zdenek,
Thanks for your helpful reply.
Ha again Zdenek,
Just to confirm, am I going to be ok enough with the scheme I described
as long as I ensure that 'Allocated pool data' does not get to 100% ?
For now, I have my btrfs filesystems mounted with "discard", so
hopefully it should tell dm-thin when it can free up/reuse blocks.
Given that, am I more or less ok using dm-thin that way?
And for my own understanding, is there any reason why I would even want
to consider thin_pool_autoextend_threshold < 100 ?
Hi

If you monitor amount of free space for data AND for metadata in thin-pool
yourself you can keep easily threshold == 100.

Just don't forget when you upsize 'data' - you should also typically
extend also metadata - it's not uncommon issue user start with small
'data' & 'metadata' LV with thin-pool - then continue to only extend
thin-pool 'data' volume and ignore/forget about metadata completely
and hit the full metadata device - which can lead to many troubles
(hitting full dataLV is normally not a big deal).

Regards

Zdenek
Marc MERLIN
2018-07-31 21:17:06 UTC
Permalink
Post by Zdenek Kabelac
If you monitor amount of free space for data AND for metadata in thin-pool
yourself you can keep easily threshold == 100.
Understood. Two things:
1) basically threshold < 100 allows you to hit the limit, have LVM pause
IO, allocate more blocks, and resize the filesystem for you.
However, if you're not monitoring this, it's ultimately just the same as
having threshold = 100 and hoping that you won't hit the limit, except
that you're adding the complexity of resizes in the mix. Correct?

2) I wasn't quite clear on what metadata was used for, and I let
vgcreate pick a default amount for me. Am I correct that it basically
tracks block usage and maybe LVM snapshots that I'm not going to use,
and that therefore if I don't resize my LV, I don't really have to
worry about metadata running out?
Post by Zdenek Kabelac
Just don't forget when you upsize 'data' - you should also typically
extend also metadata - it's not uncommon issue user start with small
'data' & 'metadata' LV with thin-pool - then continue to only extend
thin-pool 'data' volume and ignore/forget about metadata completely
and hit the full metadata device - which can lead to many troubles
(hitting full dataLV is normally not a big deal).
Thanks for the warning. Given that I started with the maximum size and
don't plain on ever extending (to be fair, I can't), I should be ok
there, correct?

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
Zdenek Kabelac
2018-08-01 11:37:30 UTC
Permalink
Post by Marc MERLIN
Post by Zdenek Kabelac
If you monitor amount of free space for data AND for metadata in thin-pool
yourself you can keep easily threshold == 100.
1) basically threshold < 100 allows you to hit the limit, have LVM pause
IO, allocate more blocks, and resize the filesystem for you.
However, if you're not monitoring this, it's ultimately just the same as
having threshold = 100 and hoping that you won't hit the limit, except
that you're adding the complexity of resizes in the mix. Correct?
Sure thing, when there is no free space to extend your overprovisioned
thin-pool and you run out-of-space you hit the limit at some point....
Post by Marc MERLIN
2) I wasn't quite clear on what metadata was used for, and I let
vgcreate pick a default amount for me. Am I correct that it basically
tracks block usage and maybe LVM snapshots that I'm not going to use,
and that therefore if I don't resize my LV, I don't really have to
worry about metadata running out?
kernel metadata stored in _tmeta LV hold old mapping of all thin-volumes.
i.e. which thin-pool chunk belongs to which thin-volume.
Post by Marc MERLIN
Post by Zdenek Kabelac
Just don't forget when you upsize 'data' - you should also typically
extend also metadata - it's not uncommon issue user start with small
'data' & 'metadata' LV with thin-pool - then continue to only extend
thin-pool 'data' volume and ignore/forget about metadata completely
and hit the full metadata device - which can lead to many troubles
(hitting full dataLV is normally not a big deal).
Thanks for the warning. Given that I started with the maximum size and
don't plain on ever extending (to be fair, I can't), I should be ok
there, correct?
Yep - once you make ~16GiB metadata you can't make them any bigger (hard
internal limitation of existing thin-pool target implementation).

But you still need to remember you can run of space in your metadata if there
is heavy usage of many large thin volumes - so the value of free space should
be always somehow monitored...


Regards

Zdenek

Loading...