Restoring Hosted-Engine from a stale backup

Hi all, One of our engines has had a DB failure* & it seems there was an unnoticed problem in its backup routine, meaning the last backup I've got is a couple of weeks old. Luckily, VDSM has kept the underlying VMs running without any interruptions, so my objective is to get the HE back online & get the hosts & VMs back under its control with minimal downtime. So, my questions are the following... 1. What problems can I expect to have with VMs added/modified since the last backup? 2. As it's only the DB that's been affected, can I skip redeploying the Engine & jump straight to restoring the DB & rerunning engine-setup? 3. The original docs I read didn't mention that it's best to leave a host in maintenance mode before running the engine backup, so my plan is to install a new temporary host on a separate server, re-add the old hosts & then once everything's back up, remove the temporary host. Are there any faults in this plan? 4. When it comes to deleting the old HE VM, the docs point to a paywalled guide on redhat.com...? CentOS 7 oVirt 4.0.4 Gluster 3.8 * Apparently a write somehow cleared fsync, despite not actually having been written to disk?! No idea how that happened... Many thanks, -- Doug

Hey guys, Just giving this a bump in the hope that someone might be able to advise... Hi all,
One of our engines has had a DB failure* & it seems there was an unnoticed problem in its backup routine, meaning the last backup I've got is a couple of weeks old. Luckily, VDSM has kept the underlying VMs running without any interruptions, so my objective is to get the HE back online & get the hosts & VMs back under its control with minimal downtime.
So, my questions are the following...
1. What problems can I expect to have with VMs added/modified since the last backup? 2. As it's only the DB that's been affected, can I skip redeploying the Engine & jump straight to restoring the DB & rerunning engine-setup? 3. The original docs I read didn't mention that it's best to leave a host in maintenance mode before running the engine backup, so my plan is to install a new temporary host on a separate server, re-add the old hosts & then once everything's back up, remove the temporary host. Are there any faults in this plan? 4. When it comes to deleting the old HE VM, the docs point to a paywalled guide on redhat.com...?
To add a bit more info to 4), I'm referring to the following...
Note: If the Engine database is restored successfully, but the Engine
virtual machine appears to be Down and cannot be migrated to another self-hosted engine host, you can enable a new Engine virtual machine and remove the dead Engine virtual machine from the environment by following the steps provided in https://access.redhat.com/solutions/1517683.
Source: http://www.ovirt.org/documentation/self-hosted/chap-Backing_up_and_Restoring... CentOS 7
oVirt 4.0.4 Gluster 3.8
* Apparently a write somehow cleared fsync, despite not actually having been written to disk?! No idea how that happened...
Many thanks, -- Doug
Cheers, -- Doug

On Tue, Jan 24, 2017 at 1:49 PM, Doug Ingham <dougti@gmail.com> wrote:
Hey guys, Just giving this a bump in the hope that someone might be able to advise...
Hi all,
One of our engines has had a DB failure* & it seems there was an unnoticed problem in its backup routine, meaning the last backup I've got is a couple of weeks old. Luckily, VDSM has kept the underlying VMs running without any interruptions, so my objective is to get the HE back online & get the hosts & VMs back under its control with minimal downtime.
So, my questions are the following...
1. What problems can I expect to have with VMs added/modified since the last backup?
Modified VMs will be reverted to the previous configuration; additional VMs should be seen as external VMs, then you could import.
1. As it's only the DB that's been affected, can I skip redeploying the Engine & jump straight to restoring the DB & rerunning engine-setup?
Yes, if the engine VM is fine, you could just import the previous backup and run engine-setup again. Please set the global maintenance mode for hosted-engine since engine-backup and engine-setup are going to bring down the engine.
1. The original docs I read didn't mention that it's best to leave a host in maintenance mode before running the engine backup, so my plan is to install a new temporary host on a separate server, re-add the old hosts & then once everything's back up, remove the temporary host. Are there any faults in this plan? 2. When it comes to deleting the old HE VM, the docs point to a paywalled guide on redhat.com...?
To add a bit more info to 4), I'm referring to the following...
Note: If the Engine database is restored successfully, but the Engine
virtual machine appears to be Down and cannot be migrated to another self-hosted engine host, you can enable a new Engine virtual machine and remove the dead Engine virtual machine from the environment by following the steps provided in https://access.redhat.com/solutions/1517683.
Source: http://www.ovirt.org/documentation/self-hosted/ chap-Backing_up_and_Restoring_an_EL-Based_Self-Hosted_Environment/
If you are re-importing the backup in place on the initial engine VM you don't have to. The point is just if you are migrating to a new engine VM and so you have to remove the entry of the previous one to let the auto-import process trigger again.
CentOS 7
oVirt 4.0.4 Gluster 3.8
* Apparently a write somehow cleared fsync, despite not actually having been written to disk?! No idea how that happened...
Many thanks, -- Doug
Cheers, -- Doug
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi All, Simone, On 24 January 2017 at 10:11, Simone Tiraboschi <stirabos@redhat.com> wrote:
On Tue, Jan 24, 2017 at 1:49 PM, Doug Ingham <dougti@gmail.com> wrote:
Hey guys, Just giving this a bump in the hope that someone might be able to advise...
Hi all,
One of our engines has had a DB failure* & it seems there was an unnoticed problem in its backup routine, meaning the last backup I've got is a couple of weeks old. Luckily, VDSM has kept the underlying VMs running without any interruptions, so my objective is to get the HE back online & get the hosts & VMs back under its control with minimal downtime.
So, my questions are the following...
1. What problems can I expect to have with VMs added/modified since the last backup?
Modified VMs will be reverted to the previous configuration; additional VMs should be seen as external VMs, then you could import.
Given VDSM kept the VMs up whilst the HE's been down, how will the running VMs that were present before & after the backup be affected? Many of the VMs that were present during the last backup are now on different hosts, including the HE VM. Will that cause any issues?
1. As it's only the DB that's been affected, can I skip redeploying the Engine & jump straight to restoring the DB & rerunning engine-setup?
Yes, if the engine VM is fine, you could just import the previous backup and run engine-setup again. Please set the global maintenance mode for hosted-engine since engine-backup and engine-setup are going to bring down the engine.
As per above, do I still only need to import the previous backup even if the all of the VMs (including the HE VM) are now on different hosts to when the backup was made? And as for the future, is it going to be necessary to always keep an unused host in the cluster to allow for emergency restores? I'm a bit concerned that if we ever utilised all of our hosts for running VMs, then we'd be completely stuck if the HE ever imploded again. Cheers, -- Doug

On Mon, Feb 6, 2017 at 1:52 PM, Doug Ingham <dougti@gmail.com> wrote:
Hi All, Simone,
On 24 January 2017 at 10:11, Simone Tiraboschi <stirabos@redhat.com> wrote:
On Tue, Jan 24, 2017 at 1:49 PM, Doug Ingham <dougti@gmail.com> wrote:
Hey guys, Just giving this a bump in the hope that someone might be able to advise...
Hi all,
One of our engines has had a DB failure* & it seems there was an unnoticed problem in its backup routine, meaning the last backup I've got is a couple of weeks old. Luckily, VDSM has kept the underlying VMs running without any interruptions, so my objective is to get the HE back online & get the hosts & VMs back under its control with minimal downtime.
So, my questions are the following...
1. What problems can I expect to have with VMs added/modified since the last backup?
Modified VMs will be reverted to the previous configuration; additional VMs should be seen as external VMs, then you could import.
Given VDSM kept the VMs up whilst the HE's been down, how will the running VMs that were present before & after the backup be affected?
Many of the VMs that were present during the last backup are now on different hosts, including the HE VM. Will that cause any issues?
For normal VMs I don't expect any issue: the engine will simply update the correspondent record once it will find them on the managed hosts. A serious issue could instead happen with HA VMs: if the engine finds earlier an HA VM as running on a different host it will simply update its record, the issue is if it finds earlier the VM a not on the original host since it will try to restart it causing a split brain and probably a VM corruption. I opened a bug to track it: https://bugzilla.redhat.com/show_bug.cgi?id=1419649
1. As it's only the DB that's been affected, can I skip redeploying the Engine & jump straight to restoring the DB & rerunning engine-setup?
Yes, if the engine VM is fine, you could just import the previous backup and run engine-setup again. Please set the global maintenance mode for hosted-engine since engine-backup and engine-setup are going to bring down the engine.
As per above, do I still only need to import the previous backup even if the all of the VMs (including the HE VM) are now on different hosts to when the backup was made?
Please take care of the HA VMs.
And as for the future, is it going to be necessary to always keep an unused host in the cluster to allow for emergency restores? I'm a bit concerned that if we ever utilised all of our hosts for running VMs, then we'd be completely stuck if the HE ever imploded again.
Honestly I don't see any special issue there.
Cheers, -- Doug

On 6 February 2017 at 13:30, Simone Tiraboschi <stirabos@redhat.com> wrote:
1. What problems can I expect to have with VMs added/modified since the last backup?
Modified VMs will be reverted to the previous configuration; additional VMs should be seen as external VMs, then you could import.
Given VDSM kept the VMs up whilst the HE's been down, how will the running VMs that were present before & after the backup be affected?
Many of the VMs that were present during the last backup are now on different hosts, including the HE VM. Will that cause any issues?
For normal VMs I don't expect any issue: the engine will simply update the correspondent record once it will find them on the managed hosts. A serious issue could instead happen with HA VMs: if the engine finds earlier an HA VM as running on a different host it will simply update its record, the issue is if it finds earlier the VM a not on the original host since it will try to restart it causing a split brain and probably a VM corruption. I opened a bug to track it: https://bugzilla.redhat.com/show_bug.cgi?id=1419649
Ouch. *All* of our VMs are HA by default. So the simplest current solution would be to shutdown the running VMs in VDSM, before restoring the backup & running engine-setup? Cheers, -- Doug

A serious issue could instead happen with HA VMs: if the engine finds earlier an HA VM as running on a different host it will simply update its record, the issue is if it finds earlier the VM a not on the original host since it will try to restart it causing a split brain and probably a VM corruption.
Uh? Why should this be the case? I think we need Arik to confirm this, but the engine only does HA VM restarts during fencing. And fencing is only activated 5 minutes after the engine start to make sure we already have all host reports. Martin On Mon, Feb 6, 2017 at 5:45 PM, Doug Ingham <dougti@gmail.com> wrote:
On 6 February 2017 at 13:30, Simone Tiraboschi <stirabos@redhat.com> wrote:
What problems can I expect to have with VMs added/modified since the last backup?
Modified VMs will be reverted to the previous configuration; additional VMs should be seen as external VMs, then you could import.
Given VDSM kept the VMs up whilst the HE's been down, how will the running VMs that were present before & after the backup be affected?
Many of the VMs that were present during the last backup are now on different hosts, including the HE VM. Will that cause any issues?
For normal VMs I don't expect any issue: the engine will simply update the correspondent record once it will find them on the managed hosts. A serious issue could instead happen with HA VMs: if the engine finds earlier an HA VM as running on a different host it will simply update its record, the issue is if it finds earlier the VM a not on the original host since it will try to restart it causing a split brain and probably a VM corruption. I opened a bug to track it: https://bugzilla.redhat.com/show_bug.cgi?id=1419649
Ouch. *All* of our VMs are HA by default.
So the simplest current solution would be to shutdown the running VMs in VDSM, before restoring the backup & running engine-setup?
Cheers, -- Doug
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (3)
-
Doug Ingham
-
Martin Sivak
-
Simone Tiraboschi