[ovirt-users] Re: DR on hyperconverged deployment

Thursday, 2 April 2020

Hi,
Thank you for your reply, but I already did all of that but I didn't
understand everything and I got several problems on the way of doing the
fail-over then a fail-back. I am writing this mail in hope to clarify those
things.
I will try to express my self correctly and give as much details as that I
can.

*The LAB :*
My LAB contains two single-host oVirt-HCI platforms, one to act as the *primary
site  (the source)* the second as the *disaster-recovery site (the target)*.
Each HCI site contains one data domain, the domain is comprised of a
gluster volume which is backed by one brick. The volumes (source and
target) have the same size, and they have been created within the process
of the HCI deployment.
*At the end of the deployment, I detached the deleted the gluster data
domain on the target site, but I didn't delete the target volume.*

My goal is to test the disaster recovery (active-passive DR to be precise)
process on an HCI implementation. To test the fail-over and the fail-back
process entirely.

*Documentation*

RHHI 1.7
Maintaining_Red_Hat_Hyperconverged_Infrastructure_for_Virtualization-en-US
and I started my implementation

I prepared all the ansible playbooks.

*The Test procedure:*

*Fail-over*

1 - Create a Windows10 VM on the source volume.

2 - Replicate to the DR site.

3 - Execute the fail-over procedure and test if the WM is usable in the
target platform.

4 - Detach and Delete the data domain in the target platform without
touching the target volume

5 - Make changes to the Win10 VM on the source volume (creating files and
installing software)

6 - Replicate again to the DR site then execute another fail-over and see
if the modification were synced.

*Fail-back*

1 - Make changes to the Win10 VM on the target volume (deleting files) *and
especially creating a snapshot*

2 - Detach and Delete the data domain in the source platform without
touching the source volume.

3 - Replicate to the source site.

4 - Execute the clean up playbook

5 - Execute the fail-over and WM is usable in the source platform and that
the modifications were synced especially the snapshot

*Things I need to confirm :*

1 - When creating the geo-replication from the primary site to the target
site, we get to a point where we have to create "*Scheduling regular
backups using geo-replication*", from my understanding it's like a cron job
that starts the geo-replication at a specific time (or day time), and from
my testing, the geo-replication starts syncing at that precise time and
when its "*CRAWL STATUS*" reaches "Changelog Crawl" it stops the
synchronization. In other terms when the geo-replication reaches the same
date as the check-point (the specific time).

The smallest time you can get from the configuration window is 24hours,
which means in the event of a disaster, you can at most recover the data
from the day before. *Is this correct?*

*Problems encountered during the test:*

*Fail-over*

1 - When executing the fail-over the first time (ansible-playbook
dr-rhv-failover.yml --tags "fail_over"), the import of the target data
domain failed with the error : *An exception occurred during task
execution. To see the full traceback, use -vvv. The error was:
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is
"[Error in creating a Storage Domain. The selected storage path is not
empty (probably contains another Storage Domain). Either remove the
existing Storage Domain from this path, or change the Storage path).]".
HTTP response code is 400. *I tried manually to import the domain from
oVirt's admin console and I got the same error. so I did the following

- I deleted the target volume and the brick and the sub-directory of the
brick.

- I recreated the volume from scratch.

- I redid the geo-replication synchronization from the source.

- I executed the fail-over and this time the target data domain was
imported correctly and the Win10 VM was started correctly.

2 - I detached then deleted the target data domain without touching the
target volume, then I made change to the Win10 VM on the source site, then
I created a new schedule of geo-replication, and after the replication I
executed another fail-over.

- The Win10 VM started successfully and the changes made were synced.

*Fail-back*
1 - The documentation doesn't explain the fail-back procedure thoroughly.
It doesn't explain what does the dr-cleanup.yml do?

2 - When launching the fail-back playbook at some point I get this message :

*TASK [oVirt.disaster-recovery : Failback Replication Sync pause]
****************************************************************************************************************************[oVirt.disaster-recovery
: Failback Replication Sync pause][Failback Replication Sync] Please press
ENTER once the destination storage domains are ready to be used for the
destination setup:*
What does this mean?

3 - I did some changed on the Win10 VM and I created snapshot of that VM.

4.a - To replicate the data from the target site to the primary site I
create a new geo-replication from the target volume to the source volume,
but I get a warning that the source volume was not empty so I forced the
geo-replication creation, then :
- I detached and deleted the source data domain without touching the source
volume.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "Changelog Crawl" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*

4.b - So I redid the test but,
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "changelog" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*

4.c - I redid the test but :
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication using a shedule this time
- I executed the clean-up plyabook then I executed the fail-back playbook
- *This time the source data domain was imported correctly and the Win10 VM
was started and the modifications were synced.*
- The snapshot was imported, but there was another snapshot with it called
"Win10-TMPDR".

Regards.

Le jeu. 2 avr. 2020 à 08:42, Eyal Shenitzky <eshenitz(a)redhat.com&gt; a écrit :

...
 If you intention is to use active-passive disaster recovery solution,
you
 can have a look at the following guild:

 https://ovirt.org/documentation/disaster-recovery-guide/active_passive_ov...

 On Wed, 1 Apr 2020 at 16:42, wodel youchi <wodel.youchi(a)gmail.com&gt; wrote:

> Hi,
>
> I am trying to configure and test disaster recovery on ovirt HCI
>
> And to understand how it works
> What is the minimum RPO and its relationship with checkpoint
> And what are the steps to fail back
>
> Regards
>
> Le mer. 1 avr. 2020 14:16, Eyal Shenitzky <eshenitz(a)redhat.com&gt; a écrit :
>
>> Hi Wodel,
>>
>> Can you please explain what you are trying to do?
>> I am not sure I understand it from your question.
>>
>> On Wed, 1 Apr 2020 at 12:55, wodel youchi <wodel.youchi(a)gmail.com&gt;
>> wrote:
>>
>>> Hi,
>>>
>>> I re-did the test and it seems that the minimum RPO is one day and if
>>> someone could confirm that would be great
>>>
>>> As for the snapshot this time it was synced
>>>
>>> Then I tried to test the fail back and I found that the documentation
>>> is not clear :
>>> - it is not clear what is the purpose of the dr-clear playbook
>>> - it is not clear what does mean : put the target volume in read write
>>> mode and source volume in read-only mode
>>> - Do we have to sync back using a new georeplication link from the dr
>>> volume to source volume?
>>> I tried to so, in my first trial I forced the creation of the back
>>> georeplication without deleting the content of the source volume then I
>>> started the replication manually  (I didn't use the checkpoint) and I
>>> stopped the replication once it reached the changelog state, but I
couldn't
>>> import the source volume I got the error : volume is not empty
>>>
>>> In my second trial I deleted and recreated the source volume from
>>> scratch and the i started the replication back manually at the end I got
>>> the error
>>>
>>> In my third trial I deleted the source volume and recreated it from
>>> scratch but I replicated back using the check point method and this time
>>> the fail back worked.
>>>
>>>  Could someone sheds some light on this?
>>>
>>> Thank you
>>> Regards.
>>>
>>> Le dim. 29 mars 2020 19:19, wodel youchi <wodel.youchi(a)gmail.com&gt; a
>>> écrit :
>>>
>>>> Hi,
>>>>
>>>> Need to understand somethings about DR on oVirt-HI
>>>>
>>>>
>>>>    - What does mean : Scheduling regular backups using
>>>>    geo-replication (point 3.3.4 RHHI 1.7 Doc Maintaining RHHI) :
>>>>       - Does this mean creating a check-point?
>>>>       - If yes, does this mean that the geo-replication process will
>>>>       sync data up to that check-point and then stops the
synchronization, then
>>>>       repeat the same cycle the day after? does this mean that the
minimum RPO is
>>>>       one day?
>>>>    - I created a snapshot of a VM on the source Manager, I synced the
>>>>    volume then I executed a DR, The VM was started on the Target Manager
but
>>>>    the VM didn't have its snapshot, any idea???
>>>>
>>>>
>>>> Regards, be safe.
>>>>
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/N2MSZUYT2GE...
>>>
>>
>>
>> --
>> Regards,
>> Eyal Shenitzky
>>
>

 --
 Regards,
 Eyal Shenitzky

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: DR on hyperconverged deployment