[linux-lvm] Snapshots on clustered LVM

Discussion:

Bram Klein Gunnewiek

2015-08-25 10:09:03 UTC

Currently we are using LVM as backing storage for our DRBD disks in HA
set-ups. We use QEMU instances on our node's using (local) DRBD targets
for storage. This enables us to do live migrations between the DRBD
primary/secondary nodes.

We want to support iSCSI targergets in our HA enviroment. We are trying
to see if we can use (c)lvm for that by creating a volume group of our
iSCSI block devices and use that volume group on all nodes to create
logical volumes. This seems to work fine if we handle locking etc
properly and make sure we only activate the logical volumes on one node
at a time. As long as we only have a volume active on one node snapshots
seem to work fine also.

However, we run into problems when we want to perform a live migration
of a running QEMU instance. In order to do a live migration we have to
start a second similar QEMU on the node we want to migrate to and start
a QEMU live migration. In order for us to do that we have to make the
logical volume active on the target node otherwise we can't start the
QEMU instance. During the live migration QEMU ensures that data is only
written on one node (e.g. during the live migration data will be written
on the source node, QEMU wil then pause the instance for a short while
when copying the last data and will then continue the instance on the
target node).

This use case works fine with a clustered LVM set-up except for
snapshots. Changes are not saved in the snapshot when the logical volume
is active on both nodes (as expected if the manual is correct:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html-single/Logical_Volume_Manager_Administration/#snapshot_volumes).

If we are correct it means we can use lvm for as clustered "file system"
but can't trust our snapshots to be 100% reliable if a volume group has
been made active on more then one node. E.G. when doing a live migration
between two nodes of a QEMU instance our snapshots become unreliable.

Are these conclusions correct? Is there a solution for this problem or
is this simply a known limitation of clustered lvm without a work-around?

--
Met vriendelijke groet / Kind regards,
Bram Klein Gunnewiek | Shock Media B.V.

Tel: +31 (0)546 - 714360
Fax: +31 (0)546 - 714361
Web: https://www.shockmedia.nl/

Zdenek Kabelac

2015-08-26 10:59:40 UTC

Permalink

Post by Bram Klein Gunnewiek
Currently we are using LVM as backing storage for our DRBD disks in HA
set-ups. We use QEMU instances on our node's using (local) DRBD targets for
storage. This enables us to do live migrations between the DRBD
primary/secondary nodes.
We want to support iSCSI targergets in our HA enviroment. We are trying to see
if we can use (c)lvm for that by creating a volume group of our iSCSI block
devices and use that volume group on all nodes to create logical volumes. This
seems to work fine if we handle locking etc properly and make sure we only
activate the logical volumes on one node at a time. As long as we only have a
volume active on one node snapshots seem to work fine also.
However, we run into problems when we want to perform a live migration of a
running QEMU instance. In order to do a live migration we have to start a
second similar QEMU on the node we want to migrate to and start a QEMU live
migration. In order for us to do that we have to make the logical volume
active on the target node otherwise we can't start the QEMU instance. During
the live migration QEMU ensures that data is only written on one node (e.g.
during the live migration data will be written on the source node, QEMU wil
then pause the instance for a short while when copying the last data and will
then continue the instance on the target node).
This use case works fine with a clustered LVM set-up except for snapshots.
Changes are not saved in the snapshot when the logical volume is active on
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html-single/Logical_Volume_Manager_Administration/#snapshot_volumes).
If we are correct it means we can use lvm for as clustered "file system" but
can't trust our snapshots to be 100% reliable if a volume group has been made
active on more then one node. E.G. when doing a live migration between two
nodes of a QEMU instance our snapshots become unreliable.
Are these conclusions correct? Is there a solution for this problem or is this
simply a known limitation of clustered lvm without a work-around?

Yes - snapshots are supported ONLY for exclusively actived volumes (means LV
with snapshot is active only on a single node in cluster).

There is no dm target which would support clustered usage of snapshots.

Zdenek

Bram Klein Gunnewiek

2015-08-26 12:22:09 UTC

Permalink

Post by Zdenek Kabelac

Post by Bram Klein Gunnewiek
Currently we are using LVM as backing storage for our DRBD disks in HA
set-ups. We use QEMU instances on our node's using (local) DRBD targets for
storage. This enables us to do live migrations between the DRBD
primary/secondary nodes.
We want to support iSCSI targergets in our HA enviroment. We are trying to see
if we can use (c)lvm for that by creating a volume group of our iSCSI block
devices and use that volume group on all nodes to create logical volumes. This
seems to work fine if we handle locking etc properly and make sure we only
activate the logical volumes on one node at a time. As long as we only have a
volume active on one node snapshots seem to work fine also.
However, we run into problems when we want to perform a live
migration of a
running QEMU instance. In order to do a live migration we have to start a
second similar QEMU on the node we want to migrate to and start a QEMU live
migration. In order for us to do that we have to make the logical volume
active on the target node otherwise we can't start the QEMU instance. During
the live migration QEMU ensures that data is only written on one node (e.g.
during the live migration data will be written on the source node, QEMU wil
then pause the instance for a short while when copying the last data and will
then continue the instance on the target node).
This use case works fine with a clustered LVM set-up except for snapshots.
Changes are not saved in the snapshot when the logical volume is active on
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html-single/Logical_Volume_Manager_Administration/#snapshot_volumes).
If we are correct it means we can use lvm for as clustered "file system" but
can't trust our snapshots to be 100% reliable if a volume group has been made
active on more then one node. E.G. when doing a live migration between two
nodes of a QEMU instance our snapshots become unreliable.
Are these conclusions correct? Is there a solution for this problem or is this
simply a known limitation of clustered lvm without a work-around?

Yes - snapshots are supported ONLY for exclusively actived volumes
(means LV with snapshot is active only on a single node in cluster).
There is no dm target which would support clustered usage of snapshots.
Zdenek

Thanks for the confirmation. It's a pitty we can't get this done with
LVM ... we will try to find an alternative.

Out of curiosity, how does a node know the volume is opened at another
node? In our test set-up we don't use CLVM or anything (we are just
testing), so there is no communication between the nodes. Is this done
through meta data in the volume group / logical volume?

Zdenek Kabelac

2015-08-26 12:44:13 UTC

Permalink

Post by Zdenek Kabelac

Yes - snapshots are supported ONLY for exclusively actived volumes (means LV
with snapshot is active only on a single node in cluster).
There is no dm target which would support clustered usage of snapshots.
Zdenek

Thanks for the confirmation. It's a pitty we can't get this done with LVM ...
we will try to find an alternative.
Out of curiosity, how does a node know the volume is opened at another node?
In our test set-up we don't use CLVM or anything (we are just testing), so
there is no communication between the nodes. Is this done through meta data in
the volume group / logical volume?

I've no idea what you are using then - I'm clearly talking only about lvm2
solution which is ATM based on clvmd usage (there is now integrated support
for another locking manager - sanlock)

If you are using some other locking mechanism - it's then purely up-to-you to
maintain integrity of the whole system - i.e. ensuring there are not multiple
metadata writes from various nodes or where and how are the LVs activated.

Also there are already existing solutions for what you describe, but I assume
you prefer your own home-brewed solution - but it's long journey ahead of you...

Zdenek

David Teigland

2015-08-26 14:17:35 UTC

Permalink

Post by Zdenek Kabelac
Also there are already existing solutions for what you describe, but
I assume you prefer your own home-brewed solution - but it's long
journey ahead of you...

RHEV/ovirt is an existing solution that uses lvm on multiple hosts and
does live migration. They have quite a bit of very specialized lvm code
to do that right -- not typical lvm usage at all.

Digimer

2015-08-26 16:23:27 UTC

Permalink

Post by Bram Klein Gunnewiek

Post by Zdenek Kabelac

Post by Bram Klein Gunnewiek
Currently we are using LVM as backing storage for our DRBD disks in HA
set-ups. We use QEMU instances on our node's using (local) DRBD targets for
storage. This enables us to do live migrations between the DRBD
primary/secondary nodes.
We want to support iSCSI targergets in our HA enviroment. We are trying to see
if we can use (c)lvm for that by creating a volume group of our iSCSI block
devices and use that volume group on all nodes to create logical volumes. This
seems to work fine if we handle locking etc properly and make sure we only
activate the logical volumes on one node at a time. As long as we only have a
volume active on one node snapshots seem to work fine also.
However, we run into problems when we want to perform a live
migration of a
running QEMU instance. In order to do a live migration we have to start a
second similar QEMU on the node we want to migrate to and start a QEMU live
migration. In order for us to do that we have to make the logical volume
active on the target node otherwise we can't start the QEMU instance. During
the live migration QEMU ensures that data is only written on one node (e.g.
during the live migration data will be written on the source node, QEMU wil
then pause the instance for a short while when copying the last data and will
then continue the instance on the target node).
This use case works fine with a clustered LVM set-up except for snapshots.
Changes are not saved in the snapshot when the logical volume is active on
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html-single/Logical_Volume_Manager_Administration/#snapshot_volumes).
If we are correct it means we can use lvm for as clustered "file system" but
can't trust our snapshots to be 100% reliable if a volume group has been made
active on more then one node. E.G. when doing a live migration between two
nodes of a QEMU instance our snapshots become unreliable.
Are these conclusions correct? Is there a solution for this problem or is this
simply a known limitation of clustered lvm without a work-around?

Yes - snapshots are supported ONLY for exclusively actived volumes
(means LV with snapshot is active only on a single node in cluster).
There is no dm target which would support clustered usage of snapshots.
Zdenek

Thanks for the confirmation. It's a pitty we can't get this done with
LVM ... we will try to find an alternative.
Out of curiosity, how does a node know the volume is opened at another
node? In our test set-up we don't use CLVM or anything (we are just
testing), so there is no communication between the nodes. Is this done
through meta data in the volume group / logical volume?

Clustered LVM uses DLM. You can see which nodes are using a given lock
space with 'dlm_tool ls'. When a node joins or leaves, it joins or
leaves whatever lock spaces it's has resources using.

A nodes doesn't have to be actively using a resource, but if it's in a
cluster, it needs to coordinate with the other nodes, even if just to
say "I ACK the changes" or"I'm not using the resources" when
coordinating locks.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

Digimer

2015-08-26 16:35:39 UTC

Permalink

DRBD, like an iSCSI LUN, is just another block device to LVM. So I see
no reason why clvmd won't work just fine. Main advantage is that you can
scale iscsi to 3+ nodes, but you lose data being replicated unless you
have a very nice SAN.

Once the LV is visible on all nodes though, it's up to you to make sure
they're used by apps/fses that understand clustering. I use clustered
LVs to back gfs2 and to back VMs (LV dedicated to a VM, and the cluster
resource manager ensures that a VM is only on one node at a time).

Post by Bram Klein Gunnewiek
However, we run into problems when we want to perform a live migration
of a running QEMU instance. In order to do a live migration we have to
start a second similar QEMU on the node we want to migrate to and start
a QEMU live migration. In order for us to do that we have to make the
logical volume active on the target node otherwise we can't start the
QEMU instance. During the live migration QEMU ensures that data is only
written on one node (e.g. during the live migration data will be written
on the source node, QEMU wil then pause the instance for a short while
when copying the last data and will then continue the instance on the
target node).

If you're using clustered LVM, live migration will work just fine. This
is exactly what I do. The LV will need to be ACTIVE on both nodes though.

Post by Bram Klein Gunnewiek
This use case works fine with a clustered LVM set-up except for
snapshots. Changes are not saved in the snapshot when the logical volume
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html-single/Logical_Volume_Manager_Administration/#snapshot_volumes).

Note that your link is very old, for RHEL 5.

Snapshotting is a problem. As Zdenek said, you have to set the other
nodes to inactive and then set the current host node's LV to
'exclusive'. Trick I found though was that you can't mark it as
exclusive while it's ACTIVE, and you can't make the LV inactive while
it's hosting a VM... So in practical terms, snapshotting clustered LVs
is not feasible.

Post by Bram Klein Gunnewiek
If we are correct it means we can use lvm for as clustered "file system"
but can't trust our snapshots to be 100% reliable if a volume group has
been made active on more then one node. E.G. when doing a live migration
between two nodes of a QEMU instance our snapshots become unreliable.

You can never trust a snapshot 100%; It doesn't capture information in
the VM's memory. So at best, using a snapshot to recover is like
recovering from a sudden power loss. It's then up to your apps and OS to
recover, and that's not always the case with many DBs, unless they're
carefully configured.

This is the core reason why our company won't support snapshots at all.
It gives people a false sense of having good backups.

Post by Bram Klein Gunnewiek
Are these conclusions correct? Is there a solution for this problem or
is this simply a known limitation of clustered lvm without a work-around?

Clustered LVs over a SAN-backed PV will work perfectly fine for live
migrations. Snapshots are not feasible though, and not recommended in
any case.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?