[vdsm] VM recovery now depends on HSM

Hi all, As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM. This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case. Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available. Thanks for your insights. -- Adam Litke

On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason. So…I too ask for more insights:-) Thanks, michal
Thanks for your insights.
-- Adam Litke _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On 09/07/14 13:11 +0200, Michal Skrivanek wrote:
On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason.
So…I too ask for more insights:-)
Sure, so... We switched to running syncVolumeChain at all times to cover a very rare scenario: 1. VM is running on host A 2. User initiates Live Merge on VM 3. Host A experiences a catastrophic hardware failure before engine can determine if the merge succeeded or failed 4. VM is restarted on Host B Since (in this case) the host cannot know if a live merge was in progress on the previous host, it needs to always check. Some ideas to mitigate: 1. When engine recreates a VM on a new host and a Live Merge was in progress, engine could call a verb to ask the host to synchronize the volume chain. This way, it only happens when engine knows it's needed and engine can be sure that the required resources (storage connections and domains) are present. 2. The syncVolumeChain call runs in the recovery case to ensure that we clean up after any missed block job events from libvirt while vdsm was stopped/restarting. In this case, the block job info is saved in the vm conf so the recovery flow could be changed to query libvirt for block job status on only those disks where we know about a previous operation. For those found gone, we'd call syncVolumeChain. In this scenario, we still have to deal with the race with HSM initialization and storage connectivity issues. Perhaps engine should drive this case as well? -- Adam Litke

----- Original Message -----
From: "Adam Litke" <alitke@redhat.com> To: "Michal Skrivanek" <michal.skrivanek@redhat.com> Cc: devel@ovirt.org Sent: Wednesday, July 9, 2014 4:19:09 PM Subject: Re: [ovirt-devel] [vdsm] VM recovery now depends on HSM
On 09/07/14 13:11 +0200, Michal Skrivanek wrote:
On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason.
So…I too ask for more insights:-)
Sure, so... We switched to running syncVolumeChain at all times to cover a very rare scenario:
1. VM is running on host A 2. User initiates Live Merge on VM 3. Host A experiences a catastrophic hardware failure before engine can determine if the merge succeeded or failed 4. VM is restarted on Host B
Since (in this case) the host cannot know if a live merge was in progress on the previous host, it needs to always check.
Some ideas to mitigate: 1. When engine recreates a VM on a new host and a Live Merge was in progress, engine could call a verb to ask the host to synchronize the volume chain. This way, it only happens when engine knows it's needed and engine can be sure that the required resources (storage connections and domains) are present.
This seems like the right approach.
2. The syncVolumeChain call runs in the recovery case to ensure that we clean up after any missed block job events from libvirt while vdsm was stopped/restarting.
We need this since vdsm recover running vms when it starts, before engine is connected. Actually engine cannot talk with vdsm until it finished the recovery process.
In this case, the block job info is saved in the vm conf so the recovery flow could be changed to query libvirt for block job status on only those disks where we know about a previous operation. For those found gone, we'd call syncVolumeChain. In this scenario, we still have to deal with the race with HSM initialization and storage connectivity issues. Perhaps engine should drive this case as well?
We don't have race in this stage, because even if hsm is up, we do not connect to the storage domains until engine ask to do so, and engine cannot talk to vdsm until the recovery process and hsm initialization ends. So we can check with libvirt and have correct info about the vm when vdsm starts, but we cannot fix volume metadata at this stage. I think we should fix volume metadata when engine ask to do so, based on the state of the live merge. If we want to do this update without engine control, we can use the domain monitor state event to detect when domain monitor becomes available, and modify the volume metadata. Currently we use this event to unpuse vms that was paused because of EIO error. See clientIF.py:126 Nir

On Jul 9, 2014, at 15:38 , Nir Soffer <nsoffer@redhat.com> wrote:
----- Original Message -----
From: "Adam Litke" <alitke@redhat.com> To: "Michal Skrivanek" <michal.skrivanek@redhat.com> Cc: devel@ovirt.org Sent: Wednesday, July 9, 2014 4:19:09 PM Subject: Re: [ovirt-devel] [vdsm] VM recovery now depends on HSM
On 09/07/14 13:11 +0200, Michal Skrivanek wrote:
On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason.
So…I too ask for more insights:-)
Sure, so... We switched to running syncVolumeChain at all times to cover a very rare scenario:
1. VM is running on host A 2. User initiates Live Merge on VM 3. Host A experiences a catastrophic hardware failure before engine can determine if the merge succeeded or failed 4. VM is restarted on Host B
Since (in this case) the host cannot know if a live merge was in progress on the previous host, it needs to always check.
Some ideas to mitigate: 1. When engine recreates a VM on a new host and a Live Merge was in progress, engine could call a verb to ask the host to synchronize the volume chain. This way, it only happens when engine knows it's needed and engine can be sure that the required resources (storage connections and domains) are present.
This seems like the right approach.
+1 I like the "only when needed", since indeed we can assume the scenario is unlikely to happen most of the times (but very real indeed)
2. The syncVolumeChain call runs in the recovery case to ensure that we clean up after any missed block job events from libvirt while vdsm was stopped/restarting.
can we clean up later on, does it need to be on recovery? Can it be delayed - requested by engine a little bit later?
We need this since vdsm recover running vms when it starts, before engine is connected. Actually engine cannot talk with vdsm until it finished the recovery process.
In this case, the block job info is saved in the vm conf so the recovery flow could be changed to query libvirt for block job status on only those disks where we know about a previous operation. For those found gone, we'd call syncVolumeChain. In this scenario, we still have to deal with the race with HSM initialization and storage connectivity issues. Perhaps engine should drive this case as well?
We don't have race in this stage, because even if hsm is up, we do not connect to the storage domains until engine ask to do so, and engine cannot talk to vdsm until the recovery process and hsm initialization ends.
So we can check with libvirt and have correct info about the vm when vdsm starts, but we cannot fix volume metadata at this stage.
I think we should fix volume metadata when engine ask to do so, based on the state of the live merge.
If we want to do this update without engine control, we can use the domain monitor state event to detect when domain monitor becomes available, and modify the volume metadata.
Currently we use this event to unpuse vms that was paused because of EIO error. See clientIF.py:126
Nir

On 10/07/14 08:40 +0200, Michal Skrivanek wrote:
On Jul 9, 2014, at 15:38 , Nir Soffer <nsoffer@redhat.com> wrote:
----- Original Message -----
From: "Adam Litke" <alitke@redhat.com> To: "Michal Skrivanek" <michal.skrivanek@redhat.com> Cc: devel@ovirt.org Sent: Wednesday, July 9, 2014 4:19:09 PM Subject: Re: [ovirt-devel] [vdsm] VM recovery now depends on HSM
On 09/07/14 13:11 +0200, Michal Skrivanek wrote:
On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason.
So…I too ask for more insights:-)
Sure, so... We switched to running syncVolumeChain at all times to cover a very rare scenario:
1. VM is running on host A 2. User initiates Live Merge on VM 3. Host A experiences a catastrophic hardware failure before engine can determine if the merge succeeded or failed 4. VM is restarted on Host B
Since (in this case) the host cannot know if a live merge was in progress on the previous host, it needs to always check.
Some ideas to mitigate: 1. When engine recreates a VM on a new host and a Live Merge was in progress, engine could call a verb to ask the host to synchronize the volume chain. This way, it only happens when engine knows it's needed and engine can be sure that the required resources (storage connections and domains) are present.
This seems like the right approach.
+1 I like the "only when needed", since indeed we can assume the scenario is unlikely to happen most of the times (but very real indeed)
Ok. I will need to expose a synchronizeDisks virt verb for this. It will be called by engine whenever a VM moves between hosts prior to a block job being resolved. ## # @VM.synchronizeDisks: # # Tell vdsm to synchronize disk metadata with the live VM state # # @vmID: The UUID of the VM # # Since: 4.16.0 ## {'command': {'class': 'VM', 'name': 'synchronizeDisks'}, 'data': {'vmID': 'UUID'}} Greg, you can call this after VmStats from the new host indicates that the block job is indeed not there anymore. You want to call it before you fetch the VM definition to check the volume chain. This way you can be sure that the new host has refreshed the config in case it was out of sync. Federico, I am thinking about how to handle the case where someone would try a cold merge here instead of starting the VM. I guess they cannot because engine will have the disk locked. Maybe that is good enough for now.
2. The syncVolumeChain call runs in the recovery case to ensure that we clean up after any missed block job events from libvirt while vdsm was stopped/restarting.
can we clean up later on, does it need to be on recovery? Can it be delayed - requested by engine a little bit later?
This question is where I could use some help from the experts :) Here is the scenario in question: How serious is a temporary metadata inconsistency? 1. Live merge starts for VM on a host 2. vdsm crashes 3. qemu completes the live merge operation and rewrite the qcow chain 4. libvirt emits an event (missed by vdsm which is not running) 5. vdsm starts and recovers VM At this point, the vm conf has an outdated view of the disk. In the case of an active layer merge, the volumeID of the disk will have changed and at least one volume is removed from the chain. For internal volume merge, just one or more volumes can be missing from the chain. In addition, the metadata on the storage side is out dated. As long as engine submits no operations which depend on an accurate picture of the volume chain until it has called synchronizeDisks() we should be okay. Does vdsm initiate any operations on its own that would be sensitive to this synchronization issue (ie. disk stats)?
We need this since vdsm recover running vms when it starts, before engine is connected. Actually engine cannot talk with vdsm until it finished the recovery process.
In this case, the block job info is saved in the vm conf so the recovery flow could be changed to query libvirt for block job status on only those disks where we know about a previous operation. For those found gone, we'd call syncVolumeChain. In this scenario, we still have to deal with the race with HSM initialization and storage connectivity issues. Perhaps engine should drive this case as well?
We don't have race in this stage, because even if hsm is up, we do not connect to the storage domains until engine ask to do so, and engine cannot talk to vdsm until the recovery process and hsm initialization ends.
So we can check with libvirt and have correct info about the vm when vdsm starts, but we cannot fix volume metadata at this stage.
I think we should fix volume metadata when engine ask to do so, based on the state of the live merge.
If we want to do this update without engine control, we can use the domain monitor state event to detect when domain monitor becomes available, and modify the volume metadata.
Currently we use this event to unpuse vms that was paused because of EIO error. See clientIF.py:126
Nir
-- Adam Litke

Sorry, adding Greg... On 10/07/14 08:40 +0200, Michal Skrivanek wrote:
On Jul 9, 2014, at 15:38 , Nir Soffer <nsoffer@redhat.com> wrote:
----- Original Message -----
From: "Adam Litke" <alitke@redhat.com> To: "Michal Skrivanek" <michal.skrivanek@redhat.com> Cc: devel@ovirt.org Sent: Wednesday, July 9, 2014 4:19:09 PM Subject: Re: [ovirt-devel] [vdsm] VM recovery now depends on HSM
On 09/07/14 13:11 +0200, Michal Skrivanek wrote:
On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason.
So…I too ask for more insights:-)
Sure, so... We switched to running syncVolumeChain at all times to cover a very rare scenario:
1. VM is running on host A 2. User initiates Live Merge on VM 3. Host A experiences a catastrophic hardware failure before engine can determine if the merge succeeded or failed 4. VM is restarted on Host B
Since (in this case) the host cannot know if a live merge was in progress on the previous host, it needs to always check.
Some ideas to mitigate: 1. When engine recreates a VM on a new host and a Live Merge was in progress, engine could call a verb to ask the host to synchronize the volume chain. This way, it only happens when engine knows it's needed and engine can be sure that the required resources (storage connections and domains) are present.
This seems like the right approach.
+1 I like the "only when needed", since indeed we can assume the scenario is unlikely to happen most of the times (but very real indeed)
Ok. I will need to expose a synchronizeDisks virt verb for this. It will be called by engine whenever a VM moves between hosts prior to a block job being resolved. ## # @VM.synchronizeDisks: # # Tell vdsm to synchronize disk metadata with the live VM state # # @vmID: The UUID of the VM # # Since: 4.16.0 ## {'command': {'class': 'VM', 'name': 'synchronizeDisks'}, 'data': {'vmID': 'UUID'}} Greg, you can call this after VmStats from the new host indicates that the block job is indeed not there anymore. You want to call it before you fetch the VM definition to check the volume chain. This way you can be sure that the new host has refreshed the config in case it was out of sync. Federico, I am thinking about how to handle the case where someone would try a cold merge here instead of starting the VM. I guess they cannot because engine will have the disk locked. Maybe that is good enough for now.
2. The syncVolumeChain call runs in the recovery case to ensure that we clean up after any missed block job events from libvirt while vdsm was stopped/restarting.
can we clean up later on, does it need to be on recovery? Can it be delayed - requested by engine a little bit later?
This question is where I could use some help from the experts :) Here is the scenario in question: How serious is a temporary metadata inconsistency? 1. Live merge starts for VM on a host 2. vdsm crashes 3. qemu completes the live merge operation and rewrite the qcow chain 4. libvirt emits an event (missed by vdsm which is not running) 5. vdsm starts and recovers VM At this point, the vm conf has an outdated view of the disk. In the case of an active layer merge, the volumeID of the disk will have changed and at least one volume is removed from the chain. For internal volume merge, just one or more volumes can be missing from the chain. In addition, the metadata on the storage side is out dated. As long as engine submits no operations which depend on an accurate picture of the volume chain until it has called synchronizeDisks() we should be okay. Does vdsm initiate any operations on its own that would be sensitive to this synchronization issue (ie. disk stats)?
We need this since vdsm recover running vms when it starts, before engine is connected. Actually engine cannot talk with vdsm until it finished the recovery process.
In this case, the block job info is saved in the vm conf so the recovery flow could be changed to query libvirt for block job status on only those disks where we know about a previous operation. For those found gone, we'd call syncVolumeChain. In this scenario, we still have to deal with the race with HSM initialization and storage connectivity issues. Perhaps engine should drive this case as well?
We don't have race in this stage, because even if hsm is up, we do not connect to the storage domains until engine ask to do so, and engine cannot talk to vdsm until the recovery process and hsm initialization ends.
So we can check with libvirt and have correct info about the vm when vdsm starts, but we cannot fix volume metadata at this stage.
I think we should fix volume metadata when engine ask to do so, based on the state of the live merge.
If we want to do this update without engine control, we can use the domain monitor state event to detect when domain monitor becomes available, and modify the volume metadata.
Currently we use this event to unpuse vms that was paused because of EIO error. See clientIF.py:126
Nir
-- Adam Litke

----- Original Message -----
From: "Michal Skrivanek" <michal.skrivanek@redhat.com> To: "Adam Litke" <alitke@redhat.com> Cc: devel@ovirt.org, "Nir Soffer" <nsoffer@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com> Sent: Thursday, July 10, 2014 8:40:58 AM Subject: Re: [ovirt-devel] [vdsm] VM recovery now depends on HSM
On Jul 9, 2014, at 15:38 , Nir Soffer <nsoffer@redhat.com> wrote:
----- Original Message -----
From: "Adam Litke" <alitke@redhat.com> To: "Michal Skrivanek" <michal.skrivanek@redhat.com> Cc: devel@ovirt.org Sent: Wednesday, July 9, 2014 4:19:09 PM Subject: Re: [ovirt-devel] [vdsm] VM recovery now depends on HSM
On 09/07/14 13:11 +0200, Michal Skrivanek wrote:
On Jul 8, 2014, at 22:36 , Adam Litke <alitke@redhat.com> wrote:
Hi all,
As part of the new live merge feature, when vdsm starts and has to recover existing VMs, it calls VM._syncVolumeChain to ensure that vdsm's view of the volume chain matches libvirt's. This involves two kinds of operations: 1) sync VM object, 2) sync underlying storage metadata via HSM.
This means that HSM must be up (and the storage domain(s) that the VM is using must be accessible. When testing some rather eccentric error flows, I am finding this to not always be the case.
Is there a way to have VM recovery wait on HSM to come up? How should we respond if a required storage domain cannot be accessed? Is there a mechanism in vdsm to schedule an operation to be retried at a later time? Perhaps I could just schedule the sync and it could be retried until the required resources are available.
I've briefly discussed with Federico some time ago that IMHO the syncVolumeChain needs to be changed. It must not be part of VM's create flow as I expect this quite a bottleneck in big-scale environment (it is now in fact not executing only on recovery but on all 4 create flows!). I don't know how yet, but we need to find a different way. Now you just added yet another reason.
So…I too ask for more insights:-)
Sure, so... We switched to running syncVolumeChain at all times to cover a very rare scenario:
1. VM is running on host A 2. User initiates Live Merge on VM 3. Host A experiences a catastrophic hardware failure before engine can determine if the merge succeeded or failed 4. VM is restarted on Host B
Since (in this case) the host cannot know if a live merge was in progress on the previous host, it needs to always check.
Some ideas to mitigate: 1. When engine recreates a VM on a new host and a Live Merge was in progress, engine could call a verb to ask the host to synchronize the volume chain. This way, it only happens when engine knows it's needed and engine can be sure that the required resources (storage connections and domains) are present.
This seems like the right approach.
+1 I like the "only when needed", since indeed we can assume the scenario is unlikely to happen most of the times (but very real indeed)
I agree on the assumptions but I disagree on the implementation (new API). The verb that should be called to trigger the synchronization is the very same live merge command that was called on host A. The operation will either resume (fix the inconsistency) or just finish right away successfully as there was nothing to be done. Let's keep in mind that fixing this discrepancy needs to be added to the list of things to verify when we import a data storage domain in order to sanitize the domain. (There's not that list anywhere yet, I know, but there should be because this is not the only thing that requires it). -- Federico
participants (4)
-
Adam Litke
-
Federico Simoncelli
-
Michal Skrivanek
-
Nir Soffer