[linux-lvm] LVM on top of DRBD

Discussion:

k***@knebb.de

2017-01-08 18:58:42 UTC

Hi all,

I have to cross-post to LVM as well to DRBD mailing list as I have no
clue where the issue is- if it's not a bug...

I can not get working LVM on top of drbd- I am getting I/O erros
followed by "diskless" state.

Steps to reproduce:

Two machine2.

A: CentOS7 x64; epel-providedd packages
kmod-drbd84-8.4.9-1.el7.elrepo.x86_64
drbd84-utils-8.9.8-1.el7.elrepo.x86_64

B: CentOS6 x64; epel-provided packages
kmod-drbd83-8.3.16-3.el6.elrepo.x86_64
drbd83-utils-8.3.16-1.el6.elrepo.x86_64

drbd1.res:
resource drbd1 {
protocol A;
startup {
wfc-timeout 240;
degr-wfc-timeout 120;
become-primary-on backuppc;
}
net {
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 128k;
shared-secret "13Lue=3";
}
syncer {
rate 500M;
}
on backuppc {
device /dev/drbd1;
disk /dev/sdc;
address 192.168.0.1:7790;
meta-disk internal;
}
on drbd {
device /dev/drbd1;
disk /dev/sda;
address 192.168.2.16:7790;
meta-disk internal;
}
}

I was able to create the drbd as expected (see first line of following
syslog), it gets in sync.
So I set up LVM and create filter rules so LVM should ignore the
underlying physical device:
/etc/lvm/lvm.conf [node1]:
filter = ["r|/dev/sdc|"];
/etc/lvm/lvm.conf [node2]:
filter = [ "r|/dev/sda|" ]

LVM ignores sda as expected:
#> pvscan
PV /dev/sda2 VG cl lvm2 [15,00 GiB / 0 free]
Total: 1 [15,00 GiB] / in use: 1 [15,00 GiB] / in no VG: 0 [0 ]

Now creating PV, VG, LV:
[***@backuppc etc]# pvcreate /dev/drbd1
Physical volume "/dev/drbd1" successfully created.
[***@backuppc etc]# vgcreate test /dev/drbd1
Volume group "test" successfully created
[***@backuppc etc]# lvcreate test -n test -L 3G
Volume group "test" has insufficient free space (767 extents): 768
required.
[***@backuppc etc]# lvcreate test -n test -L 2.9G
Rounding up size to full physical extent 2,90 GiB
Logical volume "test" created.
[***@backuppc etc]# vgdisplay -v test
--- Volume group ---
VG Name test
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 3,00 GiB
PE Size 4,00 MiB
Total PE 767
Alloc PE / Size 743 / 2,90 GiB
Free PE / Size 24 / 96,00 MiB
VG UUID pUPkxh-oS0f-MEUY-yIeJ-3zPb-Fkg1-TW1fgh
--- Logical volume ---
LV Path /dev/test/test
LV Name test
VG Name test
LV UUID X0wpkL-niZ7-XT7u-zjT0-ETzC-hYbI-yyv13F
LV Write Access read/write
LV Creation host, time backuppc, 2017-01-07 10:57:29 +0100
LV Status available
# open 0
LV Size 2,90 GiB
Current LE 743
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:2
--- Physical volumes ---
PV Name /dev/drbd1
PV UUID 3tcvkG-Keqk-vplB-f9zY-1X34-ZxCI-eFYPio
PV Status allocatable
Total PE / Free PE 767 / 24

Creating filesystem (sorry, output in German):
[***@backuppc etc]# mkfs.ext4 /dev/test/test
mke2fs 1.42.9 (28-Dec-2013)
Dateisystem-Label=
OS-Typ: Linux
BlockgrÃ¶Ãe=4096 (log=2)
FragmentgrÃ¶Ãe=4096 (log=2)
Stride=0 BlÃ¶cke, Stripebreite=0 BlÃ¶cke
190464 Inodes, 760832 BlÃ¶cke
38041 BlÃ¶cke (5.00%) reserviert fÃŒr den Superuser
Erster Datenblock=0
Maximale Dateisystem-BlÃ¶cke=780140544
24 Blockgruppen
32768 BlÃ¶cke pro Gruppe, 32768 Fragmente pro Gruppe
7936 Inodes pro Gruppe
Superblock-Sicherungskopien gespeichert in den BlÃ¶cken:
32768, 98304, 163840, 229376, 294912

Platz fÃŒr Gruppentabellen wird angefordert: erledigt
Inode-Tabellen werden geschrieben: erledigt
Erstelle Journal (16384 BlÃ¶cke): erledigt
Schreibe SuperblÃ¶cke und Dateisystem-Accountinginformationen: erledigt

Mounting and start to use:
[***@backuppc etc]# mount /dev/test/test /mnt
[***@backuppc etc]# cd /mnt/
[***@backuppc mnt]# cd ..

I immediately get I/O errors in syslog (and NO, the physical disk is not
damaged. Both are virtual machines (VMware ESXi 5.x) running on HW-RAID):

Jan 7 10:42:07 backuppc kernel: block drbd1: Resync done (total 166
sec; paused 0 sec; 18948 K/sec)
Jan 7 10:42:07 backuppc kernel: block drbd1: updated UUIDs
2C441CCF3B27BA41:0000000000000000:C9022D0F617A83BA:0000000000000004
Jan 7 10:42:07 backuppc kernel: block drbd1: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Jan 7 10:58:44 backuppc kernel: EXT4-fs (dm-2): mounted filesystem with
ordered data mode. Opts: (null)
Jan 7 10:58:48 backuppc kernel: block drbd1: local WRITE IO error
sector 5296+3960 on sdc
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( UpToDate -> Failed )
Jan 7 10:58:48 backuppc kernel: block drbd1: Local IO failed in
__req_mod. Detaching...
Jan 7 10:58:48 backuppc kernel: block drbd1: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( Failed -> Diskless )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: sock was shut down by peer
Jan 7 10:58:48 backuppc kernel: drbd drbd1: peer( Secondary -> Unknown
) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: short read (expected size 8)
Jan 7 10:58:48 backuppc kernel: drbd drbd1: meta connection shut down
by peer.
Jan 7 10:58:48 backuppc kernel: drbd drbd1: ack_receiver terminated
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Terminating drbd_a_drbd1
Jan 7 10:58:48 backuppc kernel: block drbd1: helper command:
/sbin/drbdadm pri-on-incon-degr minor-1
Jan 7 10:58:48 backuppc kernel: block drbd1: helper command:
/sbin/drbdadm pri-on-incon-degr minor-1 exit code 0 (0x0)
Jan 7 10:58:48 backuppc kernel: block drbd1: Should have called
drbd_al_complete_io(, 5296, 2027520), but my Disk seems to have failed
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Connection closed
Jan 7 10:58:48 backuppc kernel: drbd drbd1: conn( BrokenPipe ->
Unconnected )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: receiver terminated
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Restarting receiver thread
Jan 7 10:58:48 backuppc kernel: drbd drbd1: receiver (re)started
Jan 7 10:58:48 backuppc kernel: drbd drbd1: conn( Unconnected ->
WFConnection )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Not fencing peer, I'm not
even Consistent myself.
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29096+3968
Jan 7 10:58:48 backuppc kernel: dm-2: WRITE SAME failed. Manually zeroing.
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29096+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29352+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29608+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29864+256
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Handshake successful:
Agreed network protocol version 97
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Feature flags enabled on
protocol level: 0x0 none.
Jan 7 10:58:49 backuppc kernel: drbd drbd1: conn( WFConnection ->
WFReportParams )
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Starting ack_recv thread
(from drbd_r_drbd1 [22367])
Jan 7 10:58:49 backuppc kernel: block drbd1: receiver updated UUIDs to
effective data uuid: 2C441CCF3B27BA40
Jan 7 10:58:49 backuppc kernel: block drbd1: peer( Unknown -> Secondary
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )

In the end my /proc/drbd looks like this:

version: 8.4.9-1 (api:1/proto:86-101)
GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by
***@Build64R7, 2016-12-04 01:08:48
1: cs:Connected ro:Primary/Secondary ds:Diskless/UpToDate A r-----
ns:3212879 nr:0 dw:67260 dr:3149797 al:27 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0

pvscan is still fine:

[***@backuppc log]# pvscan
PV /dev/sda2 VG cl lvm2 [15,00 GiB / 0 free]
PV /dev/drbd1 VG test lvm2 [3,00 GiB / 96,00 MiB free]
Total: 2 [17,99 GiB] / in use: 2 [17,99 GiB] / in no VG: 0 [0 ]

So anyone having an idea what is going wrong here?

Greetings

Christian

emmanuel segura

2017-01-09 10:52:38 UTC

Permalink

use the same OS version.

Post by k***@knebb.de
Hi all,
I have to cross-post to LVM as well to DRBD mailing list as I have no
clue where the issue is- if it's not a bug...
I can not get working LVM on top of drbd- I am getting I/O erros
followed by "diskless" state.
Two machine2.
A: CentOS7 x64; epel-providedd packages
kmod-drbd84-8.4.9-1.el7.elrepo.x86_64
drbd84-utils-8.9.8-1.el7.elrepo.x86_64
B: CentOS6 x64; epel-provided packages
kmod-drbd83-8.3.16-3.el6.elrepo.x86_64
drbd83-utils-8.3.16-1.el6.elrepo.x86_64
resource drbd1 {
protocol A;
startup {
wfc-timeout 240;
degr-wfc-timeout 120;
become-primary-on backuppc;
}
net {
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 128k;
shared-secret "13Lue=3";
}
syncer {
rate 500M;
}
on backuppc {
device /dev/drbd1;
disk /dev/sdc;
address 192.168.0.1:7790;
meta-disk internal;
}
on drbd {
device /dev/drbd1;
disk /dev/sda;
address 192.168.2.16:7790;
meta-disk internal;
}
}
I was able to create the drbd as expected (see first line of following
syslog), it gets in sync.
So I set up LVM and create filter rules so LVM should ignore the
filter = ["r|/dev/sdc|"];
filter = [ "r|/dev/sda|" ]
#> pvscan
PV /dev/sda2 VG cl lvm2 [15,00 GiB / 0 free]
Total: 1 [15,00 GiB] / in use: 1 [15,00 GiB] / in no VG: 0 [0 ]
Physical volume "/dev/drbd1" successfully created.
Volume group "test" successfully created
Volume group "test" has insufficient free space (767 extents): 768
required.
Rounding up size to full physical extent 2,90 GiB
Logical volume "test" created.
--- Volume group ---
VG Name test
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 3,00 GiB
PE Size 4,00 MiB
Total PE 767
Alloc PE / Size 743 / 2,90 GiB
Free PE / Size 24 / 96,00 MiB
VG UUID pUPkxh-oS0f-MEUY-yIeJ-3zPb-Fkg1-TW1fgh
--- Logical volume ---
LV Path /dev/test/test
LV Name test
VG Name test
LV UUID X0wpkL-niZ7-XT7u-zjT0-ETzC-hYbI-yyv13F
LV Write Access read/write
LV Creation host, time backuppc, 2017-01-07 10:57:29 +0100
LV Status available
# open 0
LV Size 2,90 GiB
Current LE 743
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:2
--- Physical volumes ---
PV Name /dev/drbd1
PV UUID 3tcvkG-Keqk-vplB-f9zY-1X34-ZxCI-eFYPio
PV Status allocatable
Total PE / Free PE 767 / 24
mke2fs 1.42.9 (28-Dec-2013)
Dateisystem-Label=
OS-Typ: Linux
Blockgröße=4096 (log=2)
Fragmentgröße=4096 (log=2)
Stride=0 Blöcke, Stripebreite=0 Blöcke
190464 Inodes, 760832 Blöcke
38041 Blöcke (5.00%) reserviert für den Superuser
Erster Datenblock=0
Maximale Dateisystem-Blöcke=780140544
24 Blockgruppen
32768 Blöcke pro Gruppe, 32768 Fragmente pro Gruppe
7936 Inodes pro Gruppe
32768, 98304, 163840, 229376, 294912
Platz für Gruppentabellen wird angefordert: erledigt
Inode-Tabellen werden geschrieben: erledigt
Erstelle Journal (16384 Blöcke): erledigt
Schreibe Superblöcke und Dateisystem-Accountinginformationen: erledigt
I immediately get I/O errors in syslog (and NO, the physical disk is not
Jan 7 10:42:07 backuppc kernel: block drbd1: Resync done (total 166
sec; paused 0 sec; 18948 K/sec)
Jan 7 10:42:07 backuppc kernel: block drbd1: updated UUIDs
2C441CCF3B27BA41:0000000000000000:C9022D0F617A83BA:0000000000000004
Jan 7 10:42:07 backuppc kernel: block drbd1: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Jan 7 10:58:44 backuppc kernel: EXT4-fs (dm-2): mounted filesystem with
ordered data mode. Opts: (null)
Jan 7 10:58:48 backuppc kernel: block drbd1: local WRITE IO error
sector 5296+3960 on sdc
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( UpToDate -> Failed )
Jan 7 10:58:48 backuppc kernel: block drbd1: Local IO failed in
__req_mod. Detaching...
Jan 7 10:58:48 backuppc kernel: block drbd1: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( Failed -> Diskless )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: sock was shut down by peer
Jan 7 10:58:48 backuppc kernel: drbd drbd1: peer( Secondary -> Unknown
) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: short read (expected size 8)
Jan 7 10:58:48 backuppc kernel: drbd drbd1: meta connection shut down
by peer.
Jan 7 10:58:48 backuppc kernel: drbd drbd1: ack_receiver terminated
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Terminating drbd_a_drbd1
/sbin/drbdadm pri-on-incon-degr minor-1
/sbin/drbdadm pri-on-incon-degr minor-1 exit code 0 (0x0)
Jan 7 10:58:48 backuppc kernel: block drbd1: Should have called
drbd_al_complete_io(, 5296, 2027520), but my Disk seems to have failed
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Connection closed
Jan 7 10:58:48 backuppc kernel: drbd drbd1: conn( BrokenPipe ->
Unconnected )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: receiver terminated
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Restarting receiver thread
Jan 7 10:58:48 backuppc kernel: drbd drbd1: receiver (re)started
Jan 7 10:58:48 backuppc kernel: drbd drbd1: conn( Unconnected ->
WFConnection )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Not fencing peer, I'm not
even Consistent myself.
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29096+3968
Jan 7 10:58:48 backuppc kernel: dm-2: WRITE SAME failed. Manually zeroing.
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29096+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29352+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29608+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29864+256
Agreed network protocol version 97
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Feature flags enabled on
protocol level: 0x0 none.
Jan 7 10:58:49 backuppc kernel: drbd drbd1: conn( WFConnection ->
WFReportParams )
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Starting ack_recv thread
(from drbd_r_drbd1 [22367])
Jan 7 10:58:49 backuppc kernel: block drbd1: receiver updated UUIDs to
effective data uuid: 2C441CCF3B27BA40
Jan 7 10:58:49 backuppc kernel: block drbd1: peer( Unknown -> Secondary
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
version: 8.4.9-1 (api:1/proto:86-101)
GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by
1: cs:Connected ro:Primary/Secondary ds:Diskless/UpToDate A r-----
ns:3212879 nr:0 dw:67260 dr:3149797 al:27 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0
PV /dev/sda2 VG cl lvm2 [15,00 GiB / 0 free]
PV /dev/drbd1 VG test lvm2 [3,00 GiB / 96,00 MiB free]
Total: 2 [17,99 GiB] / in use: 2 [17,99 GiB] / in no VG: 0 [0 ]
So anyone having an idea what is going wrong here?
Greetings
Christian
_______________________________________________
drbd-user mailing list
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

--
.~.
/V\
// \\
/( )\
^`~'^

Tyler Hains

2017-01-09 13:19:35 UTC

Permalink

I can not get working LVM on top of drbd- I am getting I/O erros followed by "diskless" state.

The process I have used to do this requires that you initialize the DRBD meta-data before you create the file system. Use "drbdadm create-md myresourcename" after your lvcreate command, and before the mkfs.ext4 command.

Tyler Hains
MySQL Consultant

The information contained in this email and any attachments is private and is the confidential property of ROAM Data, Inc. If you are not the intended recipient(s) or have otherwise received this email in error, please delete this email and inform the sender as soon as possible. Neither this email nor the information contained in any attachments may be disclosed, stored, used, published or copied by anyone other than the intended recipient(s). All orders for ROAM Data, Inc. products and services are accepted by ROAM Data, Inc. subject to the terms and conditions of sale set forth on the ROAM Data, Inc. website, as such terms and conditions of sale may be changed from time to time without notice.

Lars Ellenberg

2017-01-10 09:42:58 UTC

Permalink

For some reason, (some? not only?) VMWare virtual disks tend to pretend
to support "write same", even if they fail such requests later.

DRBD treats such failed WRITE-SAME the same way as any other backend
error, and by default detaches.

mkfs.ext4 by default uses "lazy_itable_init" and "lazy_journal_init",
which makes it complete faster, but delays initialization of some file system
meta data areas until first mount, where some kernel daemon will zero-out the
relevant areas in the background.

Older kernels (RHEL 6) and also older drbd (8.3) are not affected, because they
don't know about write-same.

Workarounds exist:

Don't use the "lazy" mkfs.
During normal operation, write-same is usually not used.

Or tell the system that the backend does not support write-same:
Check setting:
grep ^ /sys/block/*/device/scsi_disk/*/max_write_same_blocks
disable:
echo 0 | tee /sys/block/*/device/scsi_disk/*/max_write_same_blocks

You then need to re-attach DRBD (drbdadm down all; drbdadm up all)
to make it aware of this change.

Fix:

Well, we need to somehow add some ugly heuristic to better detect
wether some backend really supports write-same. [*]

Or, more likely, add an option to tell DRBD to ignore any pretend-only
write-same support.

Thanks,

Lars

[*] No, it is not as easy as "just ignore any IO error if it was a write-same
request", because we try to "guarantee" that during normal operation, all
replicas are in sync (within the limits defined by the replication protocol).
If replicas fail in different ways, we can not do that (at least not without
going through some sort of "recovery" first).

Post by k***@knebb.de
Two machine2.
A: CentOS7 x64; epel-providedd packages
kmod-drbd84-8.4.9-1.el7.elrepo.x86_64
drbd84-utils-8.9.8-1.el7.elrepo.x86_64
B: CentOS6 x64; epel-provided packages
kmod-drbd83-8.3.16-3.el6.elrepo.x86_64
drbd83-utils-8.3.16-1.el6.elrepo.x86_64
resource drbd1 {
protocol A;
startup {
wfc-timeout 240;
degr-wfc-timeout 120;
become-primary-on backuppc;
}
net {
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 128k;
shared-secret "13Lue=3";
}
syncer {
rate 500M;
}
on backuppc {
device /dev/drbd1;
disk /dev/sdc;
address 192.168.0.1:7790;
meta-disk internal;
}
on drbd {
device /dev/drbd1;
disk /dev/sda;
address 192.168.2.16:7790;
meta-disk internal;
}
}
I was able to create the drbd as expected (see first line of following
syslog), it gets in sync.
So I set up LVM and create filter rules so LVM should ignore the
filter = ["r|/dev/sdc|"];
filter = [ "r|/dev/sda|" ]
#> pvscan
PV /dev/sda2 VG cl lvm2 [15,00 GiB / 0 free]
Total: 1 [15,00 GiB] / in use: 1 [15,00 GiB] / in no VG: 0 [0 ]
Physical volume "/dev/drbd1" successfully created.
Volume group "test" successfully created
Volume group "test" has insufficient free space (767 extents): 768
required.
Rounding up size to full physical extent 2,90 GiB
Logical volume "test" created.
--- Volume group ---
VG Name test
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 3,00 GiB
PE Size 4,00 MiB
Total PE 767
Alloc PE / Size 743 / 2,90 GiB
Free PE / Size 24 / 96,00 MiB
VG UUID pUPkxh-oS0f-MEUY-yIeJ-3zPb-Fkg1-TW1fgh
--- Logical volume ---
LV Path /dev/test/test
LV Name test
VG Name test
LV UUID X0wpkL-niZ7-XT7u-zjT0-ETzC-hYbI-yyv13F
LV Write Access read/write
LV Creation host, time backuppc, 2017-01-07 10:57:29 +0100
LV Status available
# open 0
LV Size 2,90 GiB
Current LE 743
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:2
--- Physical volumes ---
PV Name /dev/drbd1
PV UUID 3tcvkG-Keqk-vplB-f9zY-1X34-ZxCI-eFYPio
PV Status allocatable
Total PE / Free PE 767 / 24
mke2fs 1.42.9 (28-Dec-2013)
Dateisystem-Label=
OS-Typ: Linux
Blockgröße=4096 (log=2)
Fragmentgröße=4096 (log=2)
Stride=0 Blöcke, Stripebreite=0 Blöcke
190464 Inodes, 760832 Blöcke
38041 Blöcke (5.00%) reserviert für den Superuser
Erster Datenblock=0
Maximale Dateisystem-Blöcke=780140544
24 Blockgruppen
32768 Blöcke pro Gruppe, 32768 Fragmente pro Gruppe
7936 Inodes pro Gruppe
32768, 98304, 163840, 229376, 294912
Platz für Gruppentabellen wird angefordert: erledigt
Inode-Tabellen werden geschrieben: erledigt
Erstelle Journal (16384 Blöcke): erledigt
Schreibe Superblöcke und Dateisystem-Accountinginformationen: erledigt
I immediately get I/O errors in syslog (and NO, the physical disk is not
Jan 7 10:42:07 backuppc kernel: block drbd1: Resync done (total 166
sec; paused 0 sec; 18948 K/sec)
Jan 7 10:42:07 backuppc kernel: block drbd1: updated UUIDs
2C441CCF3B27BA41:0000000000000000:C9022D0F617A83BA:0000000000000004
Jan 7 10:42:07 backuppc kernel: block drbd1: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Jan 7 10:58:44 backuppc kernel: EXT4-fs (dm-2): mounted filesystem with
ordered data mode. Opts: (null)
Jan 7 10:58:48 backuppc kernel: block drbd1: local WRITE IO error
sector 5296+3960 on sdc
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( UpToDate -> Failed )
Jan 7 10:58:48 backuppc kernel: block drbd1: Local IO failed in
__req_mod. Detaching...
Jan 7 10:58:48 backuppc kernel: block drbd1: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( Failed -> Diskless )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: sock was shut down by peer
Jan 7 10:58:48 backuppc kernel: drbd drbd1: peer( Secondary -> Unknown
) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: short read (expected size 8)
Jan 7 10:58:48 backuppc kernel: drbd drbd1: meta connection shut down
by peer.
Jan 7 10:58:48 backuppc kernel: drbd drbd1: ack_receiver terminated
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Terminating drbd_a_drbd1
/sbin/drbdadm pri-on-incon-degr minor-1
/sbin/drbdadm pri-on-incon-degr minor-1 exit code 0 (0x0)
Jan 7 10:58:48 backuppc kernel: block drbd1: Should have called
drbd_al_complete_io(, 5296, 2027520), but my Disk seems to have failed :(
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Connection closed
Jan 7 10:58:48 backuppc kernel: drbd drbd1: conn( BrokenPipe ->
Unconnected )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: receiver terminated
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Restarting receiver thread
Jan 7 10:58:48 backuppc kernel: drbd drbd1: receiver (re)started
Jan 7 10:58:48 backuppc kernel: drbd drbd1: conn( Unconnected ->
WFConnection )
Jan 7 10:58:48 backuppc kernel: drbd drbd1: Not fencing peer, I'm not
even Consistent myself.
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29096+3968
Jan 7 10:58:48 backuppc kernel: dm-2: WRITE SAME failed. Manually zeroing.
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29096+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29352+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29608+256
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local
nor remote data, sector 29864+256
Agreed network protocol version 97
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Feature flags enabled on
protocol level: 0x0 none.
Jan 7 10:58:49 backuppc kernel: drbd drbd1: conn( WFConnection ->
WFReportParams )
Jan 7 10:58:49 backuppc kernel: drbd drbd1: Starting ack_recv thread
(from drbd_r_drbd1 [22367])
Jan 7 10:58:49 backuppc kernel: block drbd1: receiver updated UUIDs to
effective data uuid: 2C441CCF3B27BA40
Jan 7 10:58:49 backuppc kernel: block drbd1: peer( Unknown -> Secondary
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
version: 8.4.9-1 (api:1/proto:86-101)
GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by
1: cs:Connected ro:Primary/Secondary ds:Diskless/UpToDate A r-----
ns:3212879 nr:0 dw:67260 dr:3149797 al:27 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0
PV /dev/sda2 VG cl lvm2 [15,00 GiB / 0 free]
PV /dev/drbd1 VG test lvm2 [3,00 GiB / 96,00 MiB free]
Total: 2 [17,99 GiB] / in use: 2 [17,99 GiB] / in no VG: 0 [0 ]
So anyone having an idea what is going wrong here?

--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed

k***@knebb.de

2017-01-11 17:23:08 UTC

Permalink

Hi Lars and all,

Post by Lars Ellenberg

Post by k***@knebb.de
I have to cross-post to LVM as well to DRBD mailing list as I have no
clue where the issue is- if it's not a bug...
I can not get working LVM on top of drbd- I am getting I/O erros
followed by "diskless" state.

Ok, it is beyond my knowledge, but I understand what the "write-same"
command does. But if the underlying physical disk offers the command and
reports an error when used this should apply to mkfs.ext4 on the device/
partition as well, shouldn't it? drbd detacheds when an error is
reported- but why does Linux not report an error without drbd? And why
does this only happen when using LVM in-between? Should be the same when
LVM is not used....

Post by Lars Ellenberg
Older kernels (RHEL 6) and also older drbd (8.3) are not affected, because they
don't know about write-same.

My primary host is running CentOS7 while the secondary ist older
(CentOS6). I will try to create the ext4 on the secondary and then
switch to primary.

Post by Lars Ellenberg
grep ^ /sys/block/*/device/scsi_disk/*/max_write_same_blocks
echo 0 | tee /sys/block/*/device/scsi_disk/*/max_write_same_blocks

A "find /sys -name "*same*"" does not report any files named
"max_write_same_blocks". On none of the both nodes. So I dcan not
disable nor verify if it's enabled. I assume no as it does not exist. So
this might not be the reason.

Greetings

Christian

Lars Ellenberg

2017-01-12 17:00:53 UTC

Permalink

Post by k***@knebb.de
Hi Lars and all,

Post by Lars Ellenberg

In this case, it happens on first mount.
Also, it is not an "EIO", but an "EOPNOTSUP".

What really happens is that the file system code calls
blkdev_issue_zeroout(),
which will try discard, if discard is available and discard zeroes data,
or, if discard (with discard zeroes data) is not available or returns
failure, tries write-same with ZERO_PAGE,
or, if write-same is not available or returns failure,
tries __blkdev_issue_zeroout() (which uses "normal" writes).

At least in "current upstream", probably very similar in your
almost-3.10.something kernel.

DRBD sits in between, sees the failure return of write-same,
and handles it by detaching.

Post by k***@knebb.de
drbd detacheds when an error is
reported- but why does Linux not report an error without drbd? And why
does this only happen when using LVM in-between? Should be the same when
LVM is not used....

Yes. And it is, as far as I can tell.

Post by k***@knebb.de

Post by Lars Ellenberg
Older kernels (RHEL 6) and also older drbd (8.3) are not affected, because they
don't know about write-same.

My primary host is running CentOS7 while the secondary ist older
(CentOS6). I will try to create the ext4 on the secondary and then
switch to primary.

Post by Lars Ellenberg
grep ^ /sys/block/*/device/scsi_disk/*/max_write_same_blocks
echo 0 | tee /sys/block/*/device/scsi_disk/*/max_write_same_blocks

A "find /sys -name "*same*"" does not report any files named

double check that, please.
all my centos7 / RHEL 7 (and other distributions with sufficiently new
kernel) have that.

there are both the read-only /sys/block/*/queue/write_same_max_bytes
and the write-able /sys/devices/*/*/*/host*/target*/*/scsi_disk/*/max_write_same_blocks

Post by k***@knebb.de
"max_write_same_blocks". On none of the both nodes. So I dcan not
disable nor verify if it's enabled. I assume no as it does not exist. So
this might not be the reason.

show us lsblk -t and lsblk -D from the box that detaches.
(the "7" one)

It may also be that a discard failed, in which case it could be
devicemapper pretending discard was supported, and the backend failing
that discard request. Or some combination there.

Your original logs show

Post by k***@knebb.de
Jan 7 10:58:44 backuppc kernel: EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
Jan 7 10:58:48 backuppc kernel: block drbd1: local WRITE IO error sector 5296+3960 on sdc

The "+..." part is the length (number of sectors) of the request.
We don't allow "normal" requests of that size, so this is either a
discard or write-same.

Post by k***@knebb.de
Jan 7 10:58:48 backuppc kernel: block drbd1: disk( UpToDate -> Failed )
Jan 7 10:58:48 backuppc kernel: block drbd1: IO ERROR: neither local nor remote data, sector 29096+3968
Jan 7 10:58:48 backuppc kernel: dm-2: WRITE SAME failed. Manually zeroing.

And here we see that at least some WRITE SAME was issued, and returned failure.
and device mapper, which in your case sits above DRBD,
and consumes that error, has its own fallback code for failed write-same.
Which can no longer be services, because DRBD already detached.

So yes,
I'm pretty sure that I did not pull my "best guess" out of thin air only

;-)

--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT

Lars Ellenberg

2017-01-13 12:16:40 UTC

Permalink

Post by Lars Ellenberg

Post by k***@knebb.de
Hi Lars and all,

Post by Lars Ellenberg

In this case, it happens on first mount.
Also, it is not an "EIO", but an "EOPNOTSUP".
What really happens is that the file system code calls
blkdev_issue_zeroout(),
which will try discard, if discard is available and discard zeroes data,
or, if discard (with discard zeroes data) is not available or returns
failure, tries write-same with ZERO_PAGE,
or, if write-same is not available or returns failure,
tries __blkdev_issue_zeroout() (which uses "normal" writes).
At least in "current upstream", probably very similar in your
almost-3.10.something kernel.
DRBD sits in between, sees the failure return of write-same,
and handles it by detaching.

Yes. And it is, as far as I can tell.

Post by k***@knebb.de

Post by Lars Ellenberg
Older kernels (RHEL 6) and also older drbd (8.3) are not affected, because they
don't know about write-same.

My primary host is running CentOS7 while the secondary ist older
(CentOS6). I will try to create the ext4 on the secondary and then
switch to primary.

Post by Lars Ellenberg
grep ^ /sys/block/*/device/scsi_disk/*/max_write_same_blocks
echo 0 | tee /sys/block/*/device/scsi_disk/*/max_write_same_blocks

A "find /sys -name "*same*"" does not report any files named

double check that, please.
all my centos7 / RHEL 7 (and other distributions with sufficiently new
kernel) have that.
there are both the read-only /sys/block/*/queue/write_same_max_bytes
and the write-able /sys/devices/*/*/*/host*/target*/*/scsi_disk/*/max_write_same_blocks

Post by k***@knebb.de
"max_write_same_blocks". On none of the both nodes. So I dcan not
disable nor verify if it's enabled. I assume no as it does not exist. So
this might not be the reason.

show us lsblk -t and lsblk -D from the box that detaches.
(the "7" one)
It may also be that a discard failed, in which case it could be
devicemapper pretending discard was supported, and the backend failing
that discard request. Or some combination there.
Your original logs show

The "+..." part is the length (number of sectors) of the request.
We don't allow "normal" requests of that size, so this is either a
discard or write-same.

Correcting myself, the presence of the warning message misled me.

The 3.10 kernel still has that warning message directly in
blkdev_issue_zeroout(), so that's not the device mapper fallback,
but simply the mechanism I described above, with additional "log that I
took the fallback because of failure".

Which means DISCARDS have not even been tried,
or we'd have a message about that as well.

Post by Lars Ellenberg
Which can no longer be services, because DRBD already detached.
So yes,
I'm pretty sure that I did not pull my "best guess" out of thin air only
;-)

k***@knebb.de

2017-01-14 06:13:36 UTC

Permalink

Hi all,

sorry to be so stubborn- still no real explanation for the behaviour.

I did some test meanwhile:

Created drbd device, set up LV.

When using xfs instead of ext4 --> runs fine.
On CentOS6: mkfs.ext4- no matter on which host I mount it the first time
--> runs fine.
On CentOS7: mkfs.ext4- mounted on CentOS6 --> runs fine.
On CentOS7: mkfs.ext4- mounted on CentOS7 --> disk detached.

Now I skipped LVM in-between.

On CentOS7: mkfs.ext4- mounted on CentOS7 --> runs fine (detached with LVM!)

If this is related to the lazy writes it appears to me LVM shows
different capabilities to mkfs than DRBD does.

Lars wrote:

What really happens is that the file system code calls
blkdev_issue_zeroout(),
which will try discard, if discard is available and discard zeroes data,
or, if discard (with discard zeroes data) is not available or returns
failure, tries write-same with ZERO_PAGE,
or, if write-same is not available or returns failure,
tries __blkdev_issue_zeroout() (which uses "normal" writes).

At least in "current upstream", probably very similar in your
almost-3.10.something kernel.

DRBD sits in between, sees the failure return of write-same,
and handles it by detaching.

blkdev_issue_zeroout() is called. Which tries different possibilities.
DRBD sees the error on write-same (after discard failed/ is not
available) and detaches. Sounds reasonable.

If I skip LVM usage everything is fine. Means, mkfs.ext4 succeeds in
using discard or uses "normal" writes without trying first discard and
write-same.

In first case- why does it succedd with write-same (or discard?) when
there is no LVM in-between?

In the second case- why does it not try to use the faster ones? Does
DRBD not offer these capabilities? If so, why does LVM if the underlying
device does not?

Greetings

Christian

Post by Lars Ellenberg

For some reason, (some? not only?) VMWare virtual disks tend to pretend
to support "write same", even if they fail such requests later.
DRBD treats such failed WRITE-SAME the same way as any other backend
error, and by default detaches.
mkfs.ext4 by default uses "lazy_itable_init" and "lazy_journal_init",
which makes it complete faster, but delays initialization of some file system
meta data areas until first mount, where some kernel daemon will zero-out the
relevant areas in the background.
Older kernels (RHEL 6) and also older drbd (8.3) are not affected, because they
don't know about write-same.
Don't use the "lazy" mkfs.
During normal operation, write-same is usually not used.
grep ^ /sys/block/*/device/scsi_disk/*/max_write_same_blocks
echo 0 | tee /sys/block/*/device/scsi_disk/*/max_write_same_blocks
You then need to re-attach DRBD (drbdadm down all; drbdadm up all)
to make it aware of this change.
Well, we need to somehow add some ugly heuristic to better detect
wether some backend really supports write-same. [*]
Or, more likely, add an option to tell DRBD to ignore any pretend-only
write-same support.
Thanks,
Lars
[*] No, it is not as easy as "just ignore any IO error if it was a write-same
request", because we try to "guarantee" that during normal operation, all
replicas are in sync (within the limits defined by the replication protocol).
If replicas fail in different ways, we can not do that (at least not without
going through some sort of "recovery" first).