Sorry for the late response, please see my comments inline.

On Fri, 3 Apr 2020 at 02:30, wodel youchi <wodel.youchi@gmail.com> wrote:

Hi,
Thank you for your reply, but I already did all of that but I didn't understand everything and I got several problems on the way of doing the fail-over then a fail-back. I am writing this mail in hope to clarify those things.
I will try to express my self correctly and give as much details as that I can.

The LAB :
My LAB contains two single-host oVirt-HCI platforms, one to act as the primary site (the source) the second as the disaster-recovery site (the target).
Each HCI site contains one data domain, the domain is comprised of a gluster volume which is backed by one brick. The volumes (source and target) have the same size, and they have been created within the process of the HCI deployment.
At the end of the deployment, I detached the deleted the gluster data domain on the target site, but I didn't delete the target volume.

My goal is to test the disaster recovery (active-passive DR to be precise) process on an HCI implementation. To test the fail-over and the fail-back process entirely.

Documentation
RHHI 1.7 Maintaining_Red_Hat_Hyperconverged_Infrastructure_for_Virtualization-en-US and I started my implementation
I prepared all the ansible playbooks.

The Test procedure:
Fail-over
1 - Create a Windows10 VM on the source volume.
2 - Replicate to the DR site.
3 - Execute the fail-over procedure and test if the WM is usable in the target platform.
4 - Detach and Delete the data domain in the target platform without touching the target volume
5 - Make changes to the Win10 VM on the source volume (creating files and installing software)
6 - Replicate again to the DR site then execute another fail-over and see if the modification were synced.

Fail-back
1 - Make changes to the Win10 VM on the target volume (deleting files) and especially creating a snapshot
2 - Detach and Delete the data domain in the source platform without touching the source volume.
3 - Replicate to the source site.
4 - Execute the clean up playbook
5 - Execute the fail-over and WM is usable in the source platform and that the modifications were synced especially the snapshot

Things I need to confirm :
1 - When creating the geo-replication from the primary site to the target site, we get to a point where we have to create "Scheduling regular backups using geo-replication", from my understanding it's like a cron job that starts the geo-replication at a specific time (or day time), and from my testing, the geo-replication starts syncing at that precise time and when its "CRAWL STATUS" reaches "Changelog Crawl" it stops the synchronization. In other terms when the geo-replication reaches the same date as the check-point (the specific time).
The smallest time you can get from the configuration window is 24hours, which means in the event of a disaster, you can at most recover the data from the day before. Is this correct?

If I understand correctly, you are talking about the storage replication that performed on the storage layer.

This question should refer to the Gluster team/community.

Problems encountered during the test:
Fail-over
1 - When executing the fail-over the first time (ansible-playbook dr-rhv-failover.yml --tags "fail_over"), the import of the target data domain failed with the error : An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Error in creating a Storage Domain. The selected storage path is not empty (probably contains another Storage Domain). Either remove the existing Storage Domain from this path, or change the Storage path).]". HTTP response code is 400. I tried manually to import the domain from oVirt's admin console and I got the same error. so I did the following
- I deleted the target volume and the brick and the sub-directory of the brick.
- I recreated the volume from scratch.
- I redid the geo-replication synchronization from the source.
- I executed the fail-over and this time the target data domain was imported correctly and the Win10 VM was started correctly.

First, I suggest you to use the python scripts that help you to automate the DR process, you can find them under -

../your-dr-folder/files -> please use './ovirt-dr -h' to see the available options.

According to the error, it seems like maybe you didn't wait for the sanlock lease to expire, you must wait around 80 seconds before you are trying to use it.

2 - I detached then deleted the target data domain without touching the target volume, then I made change to the Win10 VM on the source site, then I created a new schedule of geo-replication, and after the replication I executed another fail-over.
- The Win10 VM started successfully and the changes made were synced.

Fail-back
1 - The documentation doesn't explain the fail-back procedure thoroughly. It doesn't explain what does the dr-cleanup.yml do?

it should remove all the entities from your original/source site so there will be no conflicts when you fail back to the environment.

2 - When launching the fail-back playbook at some point I get this message :
TASK [oVirt.disaster-recovery : Failback Replication Sync pause] ****************************************************************************************************************************
[oVirt.disaster-recovery : Failback Replication Sync pause]
[Failback Replication Sync] Please press ENTER once the destination storage domains are ready to be used for the destination setup:
What does this mean?

You must let sanlock to release his leases by setting the domains on maintenance or shutting down the engine and wait around 80 seconds when it is ready you can start to fail back,

3 - I did some changed on the Win10 VM and I created snapshot of that VM.

4.a - To replicate the data from the target site to the primary site I create a new geo-replication from the target volume to the source volume, but I get a warning that the source volume was not empty so I forced the geo-replication creation, then :
- I detached and deleted the source data domain without touching the source volume.
- I started the geo-replication manually (without a schedule) and when it reached the state of "Changelog Crawl" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the error : An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Error in creating a Storage Domain. The selected storage path is not empty (probably contains another Storage Domain). Either remove the existing Storage Domain from this path, or change the Storage path).]". HTTP response code is 400.

4.b - So I redid the test but,
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication manually (without a schedule) and when it reached the state of "changelog" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the error : An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Error in creating a Storage Domain. The selected storage path is not empty (probably contains another Storage Domain). Either remove the existing Storage Domain from this path, or change the Storage path).]". HTTP response code is 400.

4.c - I redid the test but :
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication using a shedule this time
- I executed the clean-up plyabook then I executed the fail-back playbook
- This time the source data domain was imported correctly and the Win10 VM was started and the modifications were synced.
- The snapshot was imported, but there was another snapshot with it called "Win10-TMPDR".

Regards.

Le jeu. 2 avr. 2020 à 08:42, Eyal Shenitzky <eshenitz@redhat.com> a écrit :
If you intention is to use active-passive disaster recovery solution, you can have a look at the following guild:
https://ovirt.org/documentation/disaster-recovery-guide/active_passive_overview.html

On Wed, 1 Apr 2020 at 16:42, wodel youchi <wodel.youchi@gmail.com> wrote:
Hi,

I am trying to configure and test disaster recovery on ovirt HCI

And to understand how it works
What is the minimum RPO and its relationship with checkpoint
And what are the steps to fail back

Regards

Le mer. 1 avr. 2020 14:16, Eyal Shenitzky <eshenitz@redhat.com> a écrit :
Hi Wodel,

Can you please explain what you are trying to do?
I am not sure I understand it from your question.

On Wed, 1 Apr 2020 at 12:55, wodel youchi <wodel.youchi@gmail.com> wrote:
Hi,

I re-did the test and it seems that the minimum RPO is one day and if someone could confirm that would be great

As for the snapshot this time it was synced

Then I tried to test the fail back and I found that the documentation is not clear :
- it is not clear what is the purpose of the dr-clear playbook
- it is not clear what does mean : put the target volume in read write mode and source volume in read-only mode
- Do we have to sync back using a new georeplication link from the dr volume to source volume?
I tried to so, in my first trial I forced the creation of the back georeplication without deleting the content of the source volume then I started the replication manually (I didn't use the checkpoint) and I stopped the replication once it reached the changelog state, but I couldn't import the source volume I got the error : volume is not empty

In my second trial I deleted and recreated the source volume from scratch and the i started the replication back manually at the end I got the error

In my third trial I deleted the source volume and recreated it from scratch but I replicated back using the check point method and this time the fail back worked.

Could someone sheds some light on this?

Thank you
Regards.

Le dim. 29 mars 2020 19:19, wodel youchi <wodel.youchi@gmail.com> a écrit :
Hi,

Need to understand somethings about DR on oVirt-HI

What does mean : Scheduling regular backups using geo-replication (point 3.3.4 RHHI 1.7 Doc Maintaining RHHI) :
Does this mean creating a check-point?
If yes, does this mean that the geo-replication process will sync data up to that check-point and then stops the synchronization, then repeat the same cycle the day after? does this mean that the minimum RPO is one day?
I created a snapshot of a VM on the source Manager, I synced the volume then I executed a DR, The VM was started on the Target Manager but the VM didn't have its snapshot, any idea???

Regards, be safe.

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/N2MSZUYT2GE33IVUKGVYHLAO33ZFMJ7N/

--
Regards,
Eyal Shenitzky

--
Regards,
Eyal Shenitzky

Regards,

Eyal Shenitzky