Discussion:
[linux-lvm] Add udev-md-raid-safe-timeouts.rules
Chris Murphy
2018-04-16 01:04:15 UTC
Permalink
I just ran into this:
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec

This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.

There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
--
Chris Murphy
Austin S. Hemmelgarn
2018-04-16 11:43:39 UTC
Permalink
Post by Chris Murphy
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.
There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
This has been discussed at least once here before (probably more times,
hard to be sure since it usually comes up as a side discussion in an
only marginally related thread). Last I knew, the consensus here was
that it needs to be changed upstream in the kernel, not by adding a udev
rule because while the value is technically system policy, the default
policy is brain-dead for anything but the original disks it was
i9ntended for (30 seconds works perfectly fine for actual SCSI devices
because they behave sanely in the face of media errors, but it's
horribly inadequate for ATA devices).

To re-iterate what I've said before on the subject:

For ATA drives it should probably be 150 seconds. That's 30 seconds
beyond the typical amount of time most consumer drives will keep
retrying a sector, so even if it goes the full time to try and recover a
sector this shouldn't trigger. The only people this change should
negatively impact are those who have failing drives which support SCT
ERC and have it enabled, but aren't already adjusting this timeout.

For physical SCSI devices, it should continue to be 30 seconds. SCSI
disks are sensible here and don't waste your time trying to recover a
sector. For PV-SCSI devices, it should probably be adjusted too, but I
don't know what a reasonable value is.

For USB devices it should probably be higher than 30 seconds, but again
I have no idea what a reasonable value is.
Roger Heflin
2018-04-16 15:19:21 UTC
Permalink
And then there are SAN devices managed by multipath, were the timeouts
should maybe even lower. I know in the scsi layer there are some
extra retries going on and that that actual timeout hits at 5x the
base timeout. So there kind of is a soft timeout on SAN devices at
the base timeout. Though there are no messages before the 5x
timeout message indicating that the system is having issues at least
in the SAN case.

For mdraid it should almost be a parameter defined on the md-device to
override the timeout since one could have some disks with ERC and some
without. Multipath.conf has a setting fast_io_fail_tmo that is
supposed to set the scsi timeout to that value if set.
Post by Austin S. Hemmelgarn
Post by Chris Murphy
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.
There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
This has been discussed at least once here before (probably more times,
hard to be sure since it usually comes up as a side discussion in an only
marginally related thread).
Sorry, but where is "here"? This message is cross-posted to about three
lists at least ...
Last I knew, the consensus here was
Post by Austin S. Hemmelgarn
that it needs to be changed upstream in the kernel, not by adding a udev
rule because while the value is technically system policy, the default
policy is brain-dead for anything but the original disks it was i9ntended
for (30 seconds works perfectly fine for actual SCSI devices because they
behave sanely in the face of media errors, but it's horribly inadequate for
ATA devices).
imho (and it's probably going to be a pain to implement :-) there should be
a soft time-out and a hard time-out. The soft time-out should trigger "drive
is taking too long to respond" messages that end up in a log - so that
people who actually care can keep a track of this sort of thing. The hard
timeout should be the current set-up, where the kernel just gives up.
Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2018-04-16 17:10:16 UTC
Permalink
Adding linux-usb@ and linux-scsi@
(This email does contain the thread initiating email, but some replies
are on the other lists.)

On Mon, Apr 16, 2018 at 5:43 AM, Austin S. Hemmelgarn
Post by Chris Murphy
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.
There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
This has been discussed at least once here before (probably more times, hard
to be sure since it usually comes up as a side discussion in an only
marginally related thread). Last I knew, the consensus here was that it
needs to be changed upstream in the kernel, not by adding a udev rule
because while the value is technically system policy, the default policy is
brain-dead for anything but the original disks it was i9ntended for (30
seconds works perfectly fine for actual SCSI devices because they behave
sanely in the face of media errors, but it's horribly inadequate for ATA
devices).
For ATA drives it should probably be 150 seconds. That's 30 seconds beyond
the typical amount of time most consumer drives will keep retrying a sector,
so even if it goes the full time to try and recover a sector this shouldn't
trigger. The only people this change should negatively impact are those who
have failing drives which support SCT ERC and have it enabled, but aren't
already adjusting this timeout.
For physical SCSI devices, it should continue to be 30 seconds. SCSI disks
are sensible here and don't waste your time trying to recover a sector. For
PV-SCSI devices, it should probably be adjusted too, but I don't know what a
reasonable value is.
For USB devices it should probably be higher than 30 seconds, but again I
have no idea what a reasonable value is.
I don't know how all of this is designed but it seems like there's
only one location for the command timer, and the SCSI driver owns it,
and then everyone else (ATA and USB and for all I know SAN) are on top
of that and lack any ability to have separate timeouts.

The nice thing about the udev rule is that it tests for SCT ERC before
making a change. There certainly are enterprise and almost enterprise
"NAS" SATA drives that have short SCT ERC times enabled out of the box
- and the udev method makes them immune to the change.
--
Chris Murphy
Austin S. Hemmelgarn
2018-04-17 11:28:42 UTC
Permalink
Post by Chris Murphy
(This email does contain the thread initiating email, but some replies
are on the other lists.)
On Mon, Apr 16, 2018 at 5:43 AM, Austin S. Hemmelgarn
Post by Chris Murphy
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.
There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
This has been discussed at least once here before (probably more times, hard
to be sure since it usually comes up as a side discussion in an only
marginally related thread). Last I knew, the consensus here was that it
needs to be changed upstream in the kernel, not by adding a udev rule
because while the value is technically system policy, the default policy is
brain-dead for anything but the original disks it was i9ntended for (30
seconds works perfectly fine for actual SCSI devices because they behave
sanely in the face of media errors, but it's horribly inadequate for ATA
devices).
For ATA drives it should probably be 150 seconds. That's 30 seconds beyond
the typical amount of time most consumer drives will keep retrying a sector,
so even if it goes the full time to try and recover a sector this shouldn't
trigger. The only people this change should negatively impact are those who
have failing drives which support SCT ERC and have it enabled, but aren't
already adjusting this timeout.
For physical SCSI devices, it should continue to be 30 seconds. SCSI disks
are sensible here and don't waste your time trying to recover a sector. For
PV-SCSI devices, it should probably be adjusted too, but I don't know what a
reasonable value is.
For USB devices it should probably be higher than 30 seconds, but again I
have no idea what a reasonable value is.
I don't know how all of this is designed but it seems like there's
only one location for the command timer, and the SCSI driver owns it,
and then everyone else (ATA and USB and for all I know SAN) are on top
of that and lack any ability to have separate timeouts.
On the note of SAN, iSCSI is part of the SCSI subsystem, so it gets
applied directly there. I'm pretty sure NBD has it's own thing, and I
think the same is true of ATAoE.

As far as USB, UMS is essentially a stripped down version of SCSI with
it's own limitations, and UAS _is_ SCSI, with both of those having
pretty much always been routed through the SCSI subsystem.
Post by Chris Murphy
The nice thing about the udev rule is that it tests for SCT ERC before
making a change. There certainly are enterprise and almost enterprise
"NAS" SATA drives that have short SCT ERC times enabled out of the box
- and the udev method makes them immune to the change.
The kernel could just as easily look for that too though. From what
I've seen however, other failure sources that wouldn't trigger SCT ERC
on SATA drives are really rare, usually it means a bad cable, bad drive
electronics, or a bad storage controller, so i don't think having it set
really high for SCT ERC enabled drives is likely to be much of an issue
most of the time.

Austin S. Hemmelgarn
2018-04-17 11:15:25 UTC
Permalink
Post by Austin S. Hemmelgarn
Post by Chris Murphy
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.
There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
This has been discussed at least once here before (probably more
times, hard to be sure since it usually comes up as a side discussion
in an only marginally related thread).
Sorry, but where is "here"? This message is cross-posted to about three
lists at least ...
Oops, didn't see the extra lists listed. In this case, discussed
previously on the BTRFS ML.
 Last I knew, the consensus here was
Post by Austin S. Hemmelgarn
that it needs to be changed upstream in the kernel, not by adding a
udev rule because while the value is technically system policy, the
default policy is brain-dead for anything but the original disks it
was i9ntended for (30 seconds works perfectly fine for actual SCSI
devices because they behave sanely in the face of media errors, but
it's horribly inadequate for ATA devices).
imho (and it's probably going to be a pain to implement :-) there should
be a soft time-out and a hard time-out. The soft time-out should trigger
"drive is taking too long to respond" messages that end up in a log - so
that people who actually care can keep a track of this sort of thing.
The hard timeout should be the current set-up, where the kernel just
gives up.
Agreed, although as pointed out by Roger in his reply to this, it kind
of already works this way in some cases.
Alan Stern
2018-04-16 17:33:37 UTC
Permalink
Post by Chris Murphy
(This email does contain the thread initiating email, but some replies
are on the other lists.)
On Mon, Apr 16, 2018 at 5:43 AM, Austin S. Hemmelgarn
Post by Chris Murphy
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.
There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.
This has been discussed at least once here before (probably more times, hard
to be sure since it usually comes up as a side discussion in an only
marginally related thread). Last I knew, the consensus here was that it
needs to be changed upstream in the kernel, not by adding a udev rule
because while the value is technically system policy, the default policy is
brain-dead for anything but the original disks it was i9ntended for (30
seconds works perfectly fine for actual SCSI devices because they behave
sanely in the face of media errors, but it's horribly inadequate for ATA
devices).
For ATA drives it should probably be 150 seconds. That's 30 seconds beyond
the typical amount of time most consumer drives will keep retrying a sector,
so even if it goes the full time to try and recover a sector this shouldn't
trigger. The only people this change should negatively impact are those who
have failing drives which support SCT ERC and have it enabled, but aren't
already adjusting this timeout.
For physical SCSI devices, it should continue to be 30 seconds. SCSI disks
are sensible here and don't waste your time trying to recover a sector. For
PV-SCSI devices, it should probably be adjusted too, but I don't know what a
reasonable value is.
For USB devices it should probably be higher than 30 seconds, but again I
have no idea what a reasonable value is.
I don't know how all of this is designed but it seems like there's
only one location for the command timer, and the SCSI driver owns it,
and then everyone else (ATA and USB and for all I know SAN) are on top
of that and lack any ability to have separate timeouts.
As far as mass-storage is concerned, USB is merely a transport. It
doesn't impose any timeout rules; the appropriate timeout value is
whatever the device at the end of the USB link needs. Thus, a SCSI
drive connected over USB could use a 30-second timeout, an ATA drive
could use 150 seconds, and so on.

Unfortunately, the only way to tell what sort of drive you've got is by
looking at the Vendor/Product IDs or other information provided by the
drive itself. You can't tell anything just from knowing what sort of
bus it's on.

Alan Stern
Post by Chris Murphy
The nice thing about the udev rule is that it tests for SCT ERC before
making a change. There certainly are enterprise and almost enterprise
"NAS" SATA drives that have short SCT ERC times enabled out of the box
- and the udev method makes them immune to the change.
Loading...