Post by Zdenek KabelacI don't know much about Grub, but I do know its lvm.c by heart now almost :p.
lvm.c by grub is mostly useless...
Then I feel we should take it out and not have grub capable of booting
LVM volumes anymore at all, right.
Post by Zdenek KabelacOne of the things I don't think people would disagree with would be having one
- autoextend and waiting with writes so nothing fails
- no autoextend and making stuff read-only.
ATM user needs to write his own monitoring plugin tool to switch to
read-only volumes - it's really as easy as running bash script in loop.....
So you are saying every user of thin LVM must individually, that means
if there are a 10.000 users, you now have 10.000 people needing to write
the same thing, while first having to acquire the knowledge of how to do
it.
I take it by that loop you mean a sleep loop. It might also be that
logtail thing and then check for the dmeventd error messages in syslog.
Right? And then when you find this message, you remount ro. You have to
test a bit to make sure it works and then you are up and running. But
this does imply that this thing is only available to die-hard users. You
first have to be aware of what is going to happen. I tell you, there is
really not a lot of good documentation on LVM okay. I know there is that
LVM book. Let me get it....
First hit is CentOS. Second link is reddit. Third link is Redhat. Okay
it should be "lvm guide" not "lvm book". Hasn't been updated since 2006
and no advanced information other than how to compile and install....
I mean: http://tldp.org/HOWTO/LVM-HOWTO/. So what people are really
going to know this stuff except the ones that are on this list?
Unless you experiment, you won't know what will happen to begin with.
For instance, different topic, but it was impossible to find any real
information on LVM cache.
So now you want every single admin to have the knowledge (that you
obviously do have, but you are its writers and mainters, its gods and
cohorts) to create a manual script, no matter how simple, that will
check the syslog, that you can only really know about by checking the
fucking source or running tests and then see what happens (and be smart
enough to check syslog) -- and then of course to write either a service
file for this script or put it in some form of rc.local.
Well that latter is easy enough even on my system (I was not even sure
whether that existed here :p).
But knowing about this stuff doesn't come by itself. You know. This
doesn't just fall from the sky.
I would probably be more than happy to write documentation at some point
(because I guess I did go through all of that to learn, and maybe others
shouldn't or won't have to?) but without this documentation, or this
person leading the way, this is not easy stuff.
Also "info" still sucks on Linux, the only really available resource
that is easy to use are man pages. It took me quite some time to learn
about all the available lvm commands to begin with (without reading a
encompassing manual) and imagine my horror when I was used to
Debian/Ubuntu systems automatically activating the vg upon opening a
LUKS container, but then the OpenSUSE rescue environment not doing that.
How to find out about vgchange -ay without having internet
access.........
It was impossible.
So for me it has been a hard road to begin with and I am still learning.
In fact I *had* read about vgchange -ay but that was months prior and I
had forgotten. Yes, bad sysadmin.
Every piece of effort a user can take on his own, is a piece of effort
that can be prevented by a developer or even possibly a (documentation)
writer if such a thing could exist. And I know I can't do it yet, if
that is what you are asking or thinking.
Post by Zdenek KabelacWe call them 'Request For Enhancements' BZ....
You mean you have a non-special non-category that only distinguishes
itself by having a [RFE] tag in the bug name, and that is your special
feature? (laughs a bit).
I mean I'm not saying it has to be anything special and if you have a
small system maybe that is enough.
But Bugzilla is just not an agreeable space to really inspire or invite
positive feedback like that.... I mean I too have been using bugzillas
for maybe a decade or longer. Not as a developer mostly, as a user. And
the thing is just a cynical place. I mean, LOOK at Jira:
https://issues.apache.org/jira/browse/log4j2/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel
Just an example. A "bug" is just one out of many categories. They have
issue types for Improvements, Brainstorming, New Feature, Question,
Story, and Wish. It is so entirely inviting to do whatever you want to
do. In BugZilla, a feature request is still just a bug. And in your
RedHat system, you just have added some field called "doc type" that
you've set to "enhancement" but that's it.
And a bug is a failure, it is a fault. The system is not meant for
positive feedback, only negative feedback in that sense. The user
experience of it is just vastly detrimental compared to that other
thing....
Well I didn't really want to go into this, but since you invited it
:pp....
But it is also meant for the coming thing. And I apologize.
Post by Zdenek KabelacFirst what I proposed would be for every thin volume to have a spare chunk.
But maybe that's irrelevant here.
Well the question was not asking for your 'technical' proposal, as you
have no real idea how it works and your visions/estimations/guesses
have no use at all (trust me - far deeper thinking was considered so
don't even waste your time to write those sentences...)
Well you can drop the attitude you know. If you were doing so great, you
would not be having a total lack of all useful documentation to begin
with. You would not have a system that can freeze the entire system by
default, because "policy" is apparently not well done.
You would not be having to debate how to make the system even a little
bit safer, and excuse yourself every three lines by saying that it's the
admin's job to monitor his system, not your job to make sure he doesn't
need to do all that much, or your job to make sure the system is
fail-safe to begin with.
I mean I understand that it is a work in progress. But then don't act
like it is finished, or that it is perfect provided the administrator is
perfect too.
If I'm trying to do anything here, it is to point out that the system is
quite lacking by default. You say "policy, policy, policy" as though you
are very tired. And maybe I'm a bit less so, I don't know. And I know it
can be tiresome to have to make these... call them fine-tunements to
make sure they work well by default on every system. Especially, I don't
know. If it is a work in progress and not meant to be used by people not
willing to invest as much as you have (so to speak).
And I'm not saying you are doing a bad job in developing this. I think
LVM is one of the more sane systems existing in the Linux world today. I
mean, I wouldn't be here if I didn't like it, or if I wasn't grateful
for your work.
I think the commands themselves and their way of being used, is
outstanding, they are intuitive, they are much better than many other
systems out there (think mdadm). It takes hardly no pain to remember how
to use e.g. lvcreate, or vgcreate, or whatever. It is intuitive, it is
nice, sometimes you need a little lookup, and that is fast too. It is a
bliss to use compared to other systems certainly. Many of the
rudimentary things are possible, and the system is so nicely modular and
layered that it is always obvious what you need to do at whatever point.
Post by Zdenek KabelacAlso forget you write a new FS - thinLV is block device so there is no
such think like 'fs allocates' space on device - this space is meant
to be there....
In this case, provided indeed none of that would happen (that we talked
about earlier) the filesystem doesn't NEED to allocate anything, but it
DOES know which part of the block space it already has in use and which
parts it doesn't, and if it is aware of this, and if it is aware of the
"real block size" of the underlying device provided it did do a form of
allocation (as does LVM thin) then suddenly it doesn't NEED to know
about this allocation other than to know that it is happening, and it
only needs to know the alignment of the real blocks.
Of course that means some knowledge of the underlying the device, but as
has been said earlier (by that other guy that supported it) this
knowledge is already there at some level and it would not be that weird.
Yes it is that "integration" you so despise.
You are *already* integrating e.g. extfs to more closely honour the
extent boundaries so that it is more efficient. What I am saying is not
at all out of the ordinary with that. You could not optimize if the
filesystem did not know about alignment, and if it could not "direct"
'allocation' into those aligned areas. So the filesystem already knows
what is going to happen down beneath, and it has the knowledge to choose
not to write to new areas unless it has to. You *told* me so.
That means it can also choose not to write to any NEW "aligned" blocks.
So you are just being principial here. You attack the idea based on the
fact that "there is no real allocation taking place of the block device
by the filesystem". But if you drop the word, there is no reason to
disagree with what I said.
The filesystem KNOWS allocation is getting done (or it could know) and
if it knows about the block alignment of those extents, then it does not
NEED to have intimate knowledge of the ACTUAL allocation getting done by
the thin volume in the thin pool.
So what are you really disagreeing with here? You are just being
pedantic right? You could tell the filesystem to enter
no-allocation-mode or no-write-to-new-areas-mode (same thing here) or
"no-cause-allocation-mode" (same thing here).
And it would work.
Even if you disagree with the term, it would still work. At least, as
far as we go here.
You never said it wouldn't work. You just disagreed with my use of
wording.
Post by Zdenek KabelacYou have 2 thinLVs.
Origin + snapshot.
You write to origin - and you miss to write a block.
Such block may be located in 'fs' journal, it might be a 'data' block,
or fs metadata block.
Each case may have different consequences.
But that is for the filesystem to decide. The thin volume will not know
about the filesystem. In that sense. Layers, remember?
Post by Zdenek KabelacWhen you fail to write an ordinary (non-thin) block device - this
block is then usually 'unreadable/error' - but in thinLV case - upon
read you get previous 100% valid' content - so you may start to
imagine where it's all heading.
So you mean that "unreadable/error" signifies some form of "bad sector"
error. But if you fail to write to thinLV, doesn't that mean (in our
case there) that the block was not allocated by thinLV? That means you
cannot read from it either. Maybe bad example, I don't know.
Post by Zdenek KabelacBasically solving these troubles when pool is 'full' is 'too late'.
If user wants something 'reliable' - he needs to use different thresholds -
i.e. stopping at 90%....
Well I will try to look into it more when I have time. But I don't
believe you. I don't see a reason from the outset why it should or would
need to be so. There should be no reason a write fails unless an
allocate fails. So how could you ever read from it (unless you read
random or white data). And, provided the filesystem does try to read
from it; why would it do so if its write failed before that?
Maybe that is what you alluded to before, but a filesystem should be
able to solve that on its own without knowing those details I think. I
believe quite usually inodes are written in advance? They are not
growth-scenarios. So this metadata cannot fail to write due to a failed
block level allocate. But even that should be irrelevant for thin LVM
itself.....
Post by Zdenek KabelacBut other users might be 'happy' with missing block (failing write
area) and rather continue to use 'fs'....
But now you are talking about human users. You are now talking about an
individual that tries to write to a thin LV, it doesn't work because the
thing is full, and he/she wants to continue to use the 'fs'. But that is
what I proposed right. If you have a fail-safe system, if you have a
system that keeps functioning even though it blocks growth writes, then
you have the best of both worlds. You have both.
It is not either/or. What I was talking about is both. You have
reliability and you can keep using the filesystem. The filesystem just
needs to be able to cope with the condition that it cannot use any new
blocks from the existing pool that it knows about. That is not extremely
very different from having exhausted its block pool to begin with. It is
really the same condition, except right now it is rather artificial.
You artificially tell the FS: you are out of space. Or, you may not use
new (alignment) blocks. It is no different from having no free blocks at
all. The FS could deal with it in the same way.
Post by Zdenek KabelacYou have many things to consider - but if you make policies too complex,
users will not be able to use it.
Users are already confused with 'simple' lvm.conf options like
'issue_discards'....
I understand. But that is why you create reasonable defaults that work
well together. I mean, I am not telling you you can't, or have done a
bad job in the past, or are doing a bad job now.
But I'm talking mostly about defaults. And right now I was really only
proposing this idea of a filesystem state that says "Me, the filesystem,
will not allocate any new blocks for data that are in alignment with the
underlying block device. I will not use any new (extents) from my block
device even though normally they would be available to me. I have just
been told there might be an issue, and even though I don't know why, I
will just accept that and try not to write there anymore".
It is really the simplest idea there can be here. If you didn't have
thin, and the filesystem was full, you'd have the same condition.
It is just a "stop expanding" flag.
Post by Zdenek KabelacPersonally, I feel the condition of a filesystem getting into a "cannot
allocate" state, is superior.
As said - there is no thin-volume filesystem.
Can you just cut that, you know. I know the filesystem does not
allocate. But it does know, or can know, allocation will happen. It
might be aware of the "thin" nature, and even if it didn't, it could
still honour such a flag even if it wouldn't make sense for it.
Post by Zdenek KabelacHowever in this case it needs no other information. It is just a state. It
knows: my block devices has 4M blocks (for instance), I cannot get new ones
Your thinking is from 'msdos' era - single process, single user.
You have multiple thin volumes active, with multiple different users
all running their jobs in parallel and you do not want to stop every
user when you are recomputing space in pool.
There is really no much point in explaining further details unless you are
willing to spend your time understanding deeply surrounding details.
You are using details to escape the necessity that the overlying or
encompassing framework dictates that things do currently not work.
That is like using the trees to say that there is no forest.
Or not seeing the forest for the trees. That is exactly what it means. I
know I am a child here. But do not ignore the wisdom of a child. The
child knows more than you do. Even if it has much less data than you do.
The whole reason a child *can* know more is because it has less data.
Because of that, it can still see the outline, while you may no longer
be able to, because you are deep within the forest.
That's exactly what that saying means.
If you see planet earth from space and you see that it is turning or
maybe you can see its ice caps are melting. And then someone on earth
says "No that is not happening because such and such is so". Who is
right? The one with the overview, or the one with the details?
An outsider can often perceive directly what is the nature of something.
Only at the outside, of course. But he/she can clearly see whether it is
left or right, big or small, cold or hot. It may not know why it is
being hot or cold, but it does know that it is being cold or hot. And
the outsider may see there should be no reason why something cannot be
so.
If details are in the way, change the details.
By the above, with "user" you seem to mean a real human user. But a
filesystem queues requests, it does not have multiple users. It needs to
schedule whatever it is doing, but it all has to go through the same
channel, ending up on the same disk. So from this perspective, the only
relevant users are the various filesystems. This must be so, because if
two operating systems mount the same block device twice, you get mayhem.
So the filesystem driver is the channel. Whether it is one multitasking
process or multiple users doing the same thing, is irrelevant. Jobs, in
this sense, are also irrelevant. What is relevant is writes to different
parts, or reads from different parts.
But supposing those multiple users are multiple filesystems using the
same thin pool. Okay you have a point, perhaps. And indeed I do not know
about any delays in space calculations. I am just approaching this from
the perspective of a designer. I would not design it such that the data
on the amount of free extents, would at any one time be unavailable. It
should be available to all at any one time. It is just a number. It does
not or should not need recomputation. I am sorry if that is incorrect
here. If it does need recomputation, then of course what you say makes
sense (even to me) and that you need a time window to prepare for
disaster; to anticipate.
I don't see why a value like the number of free extents in a pool would
need recomputation though, but that is just me. Even if you had
concurrent writes (allocations/expansions) you should be able to deal
with that, people do that all the time.
The number of free extents is simply a given at any one time right?
Unless freeing them is a more involved operation. I'm just trying to
show you that there shouldn't need to be any problems here with this
idea.
Allocations should be atomic and even if they are concurrent, the
updating of this information shouldn't be concurrent. It is a single
number, only one person can change it at a time. It's a single number,
even if you wrote 10 million blocks concurrently, your system should be
able to change/increment that number 10 million times in the same time.
Right? I know you will say wrong. But this seems out of the ordinarily
strange to me.
I mean I am still wholly unaware of how concurrency works in the kernel
(except that I know the terms) (because I've been reading some code)
(such as RCU, refcount, spinlock, mutex, what else) but I doubt this
would be a real issue if you did it right, but that's just me.
If you can concurrently traverse data structures and keep everything
working in pristine order, you know, why shouldn't you be able to
'concurrently' update a number.
Maybe that's stupid of me, but it just doesn't make sense to me.
Post by Zdenek KabelacThat seems pretty trivial. The mechanic for it may not. It is
preferable in my
view if the filesystem was notified about it and would not even *try* to write
There is no 'try' operation.
You have seen Star Wars too much. That statement is misunderstood, Yoda
tells a falsehood there.
There is a write operation that can fail or not fail.
Post by Zdenek KabelacIt would probably O^2 complicate everything - and the performance would
drop by major factor - as you would need to handle cancellation....
Can you only think in troubles and worries? :P. I see you mean (I think)
that some writes would succeed and some would fail and that that would
complicate things? Other than that there is not much difference with a
read-only filesystem right?
A filesystem that cannot even write to any new blocks is dead anyway.
Why worry about performance in any case. It's a form of read-only mode
or space-full mode that is not very different from existing modes. It's
a single flag. Some writes succeed, some writes fail. System is almost
dead to begin with, space is gone. Applications start to crash left and
right. But at least the system survives.
Not sure what cancellation you are talking about or if you understood
what I said before.....
Post by Zdenek KabelacFor simplicity here - just think about failing 'thin' write as a disk
with 'write' errors, however upon read you get last written
content....
So? And I still cannot see how that would happen. If the filesystem had
not actually written to a certain area, it would also not try to read,
right? Otherwise, the whole idea of "lazy allocation" of extents is
impossible. I don't actually know what happens if you "read" the entire
thin LV, and you could, but blocks that have never been allocated (by
thin LV) should just return zero. I don't think anything else would
happen?
I mean, there we go again: And of course the file contains nothing but
zeroes, duh. Reading from a "nonwritten" extent just returns zero space.
Obvious.
There is no reason why a thin write should fail if it has succeeded
before to the same area. I mean, what's the issue here, you don't really
explain. Anyway I am grateful for your time explaining this, but it just
does not make much sense.
Then you can say "Oh I give up", but still, it does not make much sense.
Post by Zdenek Kabelac'extX' will switch to 'ro' upon write failure (when configured this way).
Ah, you mean errors=remount-ro. Let me see what my default is :p. (The
man page does not mention the default, very nice....).
Oh, it is continue by default. Obvious....
In any case that means if it did have a 3rd mount option type (like rw,
ro, .....rp for "read/partial" ;-)).
It could also remount rp on errors ;-).
Thanks for the pointers all.
Post by Zdenek Kabelac'XFS' in 'most' cases now will shutdown itself as well (being improved)
extX is better since user may still continue to use it at least in
read-only mode...
Thanks. That is very welcome. But I need to be a complete expert to be
able to use this thing. I will write a manual later :p. (If I'm still
alive).
Post by Zdenek KabelacIt seems completely obvious that to me at this point, if anything from LVM (or
e.g. dmeventd) could signal every filesystem on every affected thin volume, to
enter a do-not-allocate state, and filesystems would be able to fail writes
based on that, you would already have a solution right?
'bash' loop...
I guess your --errorwhenfull y, combined with tunefs -e remount-ro,
would also do the trick, but that works on ALL filesystem errors.
Like I said, I haven't tested it yet. Maybe we are covering nonsensical
ground here.
But a bash loop is no solution for a real system.....
Yes thanks for pointing it out to me. But this email is getting way too
long for me.
Anyway, we are also converging on the solution I'd like, so thank you
for your time here regardless.
Post by Zdenek KabelacRemember - not writing 'new' fs....
Never said I was. New state for existing fs.
Post by Zdenek KabelacYou are preparing for lost battle.
Full pool is simply not a full fs.
And thin-pool may get out-of-data or out-of-metadata....
Does not have to be any different when the filesystem thinks and says it
is full.
You are not going from full pool to full filesystem. The filesystem is
not even full.
You are going from full pool, to a message to filesystems to enter
no-expand-mode (no-allocate-mode), which will then simply cease growing
into new "aligned" blocks.
What does it even MEAN to say that the two are not identical? I never
talked about the two being identical. It is just an expansion freeze.
Post by Zdenek KabelacThat would normally mean that filesystem operations such as DELETE would still
You really need to sit and think for a while what the snapshot and COW
does really mean, and what is all written into a filesystem (included
with journal) when you delete a file.
Too tired now. I don't think deleting files requires growth of
filesystem. I can delete files on a full fs just fine.
You mean a deletion on origin can cause allocation on snapshot.
Still that is not a filesystem thing, that is a thin-pool thing.
That is something for LVM to handle. I don't think this delete would
fail, would it? If the snapshot is a block thing, it could write the
changed inodes of the file and its directory.... it would only overwrite
the actual data if that block was overwritten on origin.
So you run the risk of extent allocation for inodes.
But you have this problem today as well. It means clearing space could
possibly need or would possibly need a work buffer. Some workspace.
You would need to pre-allocate space for the snapshot, as a practical
measure. But that's not really a real solution.
The real solution is to buffer it in memory. If the deletes free space,
you get free extents that you can use to write the memory buffered data
(metadata). That's the only way to deal with that. You are just talking
inodes (and possibly journal).
(But then how is the snapshot going to know these are deletes. In any
case, you'd have the same problems with regular writes to origin. So I
guess with snapshots you run into more troubles?
I guess with snapshots you either drops the snapshots or freeze the
entire filesystem/volume? Then how will you delete anything?
You would either have to drop a snapshot, drop a thin volume, or copy
the data first and then do that.
Right?
Too tired.
Post by Zdenek KabelacBut on of our 'polices' visions are to also use 'fstrim' when some
threshold is reached or before thin snapshot is taken...
A discard filesystem (mounted discard) will automatically do that right,
with a slight delay, so to speak.
I guess it would be good to do that, or warn the user to mount with
"discard" option.