Hi,
Thank you for your reply, but I already did all of that but I didn't
understand everything and I got several problems on the way of doing the
fail-over then a fail-back. I am writing this mail in hope to clarify those
things.
I will try to express my self correctly and give as much details as that I
can.
*The LAB :*
My LAB contains two single-host oVirt-HCI platforms, one to act as the *primary
site (the source)* the second as the *disaster-recovery site (the target)*.
Each HCI site contains one data domain, the domain is comprised of a
gluster volume which is backed by one brick. The volumes (source and
target) have the same size, and they have been created within the process
of the HCI deployment.
*At the end of the deployment, I detached the deleted the gluster data
domain on the target site, but I didn't delete the target volume.*
My goal is to test the disaster recovery (active-passive DR to be precise)
process on an HCI implementation. To test the fail-over and the fail-back
process entirely.
*Documentation*
RHHI 1.7
Maintaining_Red_Hat_Hyperconverged_Infrastructure_for_Virtualization-en-US
and I started my implementation
I prepared all the ansible playbooks.
*The Test procedure:*
*Fail-over*
1 - Create a Windows10 VM on the source volume.
2 - Replicate to the DR site.
3 - Execute the fail-over procedure and test if the WM is usable in the
target platform.
4 - Detach and Delete the data domain in the target platform without
touching the target volume
5 - Make changes to the Win10 VM on the source volume (creating files and
installing software)
6 - Replicate again to the DR site then execute another fail-over and see
if the modification were synced.
*Fail-back*
1 - Make changes to the Win10 VM on the target volume (deleting files) *and
especially creating a snapshot*
2 - Detach and Delete the data domain in the source platform without
touching the source volume.
3 - Replicate to the source site.
4 - Execute the clean up playbook
5 - Execute the fail-over and WM is usable in the source platform and that
the modifications were synced especially the snapshot
*Things I need to confirm :*
1 - When creating the geo-replication from the primary site to the target
site, we get to a point where we have to create "*Scheduling regular
backups using geo-replication*", from my understanding it's like a cron job
that starts the geo-replication at a specific time (or day time), and from
my testing, the geo-replication starts syncing at that precise time and
when its "*CRAWL STATUS*" reaches "Changelog Crawl" it stops the
synchronization. In other terms when the geo-replication reaches the same
date as the check-point (the specific time).
The smallest time you can get from the configuration window is 24hours,
which means in the event of a disaster, you can at most recover the data
from the day before. *Is this correct?*
*Problems encountered during the test:*
*Fail-over*
1 - When executing the fail-over the first time (ansible-playbook
dr-rhv-failover.yml --tags "fail_over"), the import of the target data
domain failed with the error : *An exception occurred during task
execution. To see the full traceback, use -vvv. The error was:
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is
"[Error in creating a Storage Domain. The selected storage path is not
empty (probably contains another Storage Domain). Either remove the
existing Storage Domain from this path, or change the Storage path).]".
HTTP response code is 400. *I tried manually to import the domain from
oVirt's admin console and I got the same error. so I did the following
- I deleted the target volume and the brick and the sub-directory of the
brick.
- I recreated the volume from scratch.
- I redid the geo-replication synchronization from the source.
- I executed the fail-over and this time the target data domain was
imported correctly and the Win10 VM was started correctly.
2 - I detached then deleted the target data domain without touching the
target volume, then I made change to the Win10 VM on the source site, then
I created a new schedule of geo-replication, and after the replication I
executed another fail-over.
- The Win10 VM started successfully and the changes made were synced.
*Fail-back*
1 - The documentation doesn't explain the fail-back procedure thoroughly.
It doesn't explain what does the dr-cleanup.yml do?
2 - When launching the fail-back playbook at some point I get this message :
*TASK [oVirt.disaster-recovery : Failback Replication Sync pause]
****************************************************************************************************************************[oVirt.disaster-recovery
: Failback Replication Sync pause][Failback Replication Sync] Please press
ENTER once the destination storage domains are ready to be used for the
destination setup:*
What does this mean?
3 - I did some changed on the Win10 VM and I created snapshot of that VM.
4.a - To replicate the data from the target site to the primary site I
create a new geo-replication from the target volume to the source volume,
but I get a warning that the source volume was not empty so I forced the
geo-replication creation, then :
- I detached and deleted the source data domain without touching the source
volume.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "Changelog Crawl" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*
4.b - So I redid the test but,
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "changelog" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*
4.c - I redid the test but :
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication using a shedule this time
- I executed the clean-up plyabook then I executed the fail-back playbook
- *This time the source data domain was imported correctly and the Win10 VM
was started and the modifications were synced.*
- The snapshot was imported, but there was another snapshot with it called
"Win10-TMPDR".
Regards.
Le jeu. 2 avr. 2020 à 08:42, Eyal Shenitzky <eshenitz(a)redhat.com> a écrit :
If you intention is to use active-passive disaster recovery solution,
you
can have a look at the following guild:
https://ovirt.org/documentation/disaster-recovery-guide/active_passive_ov...
On Wed, 1 Apr 2020 at 16:42, wodel youchi <wodel.youchi(a)gmail.com> wrote:
> Hi,
>
> I am trying to configure and test disaster recovery on ovirt HCI
>
> And to understand how it works
> What is the minimum RPO and its relationship with checkpoint
> And what are the steps to fail back
>
> Regards
>
> Le mer. 1 avr. 2020 14:16, Eyal Shenitzky <eshenitz(a)redhat.com> a écrit :
>
>> Hi Wodel,
>>
>> Can you please explain what you are trying to do?
>> I am not sure I understand it from your question.
>>
>> On Wed, 1 Apr 2020 at 12:55, wodel youchi <wodel.youchi(a)gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I re-did the test and it seems that the minimum RPO is one day and if
>>> someone could confirm that would be great
>>>
>>> As for the snapshot this time it was synced
>>>
>>> Then I tried to test the fail back and I found that the documentation
>>> is not clear :
>>> - it is not clear what is the purpose of the dr-clear playbook
>>> - it is not clear what does mean : put the target volume in read write
>>> mode and source volume in read-only mode
>>> - Do we have to sync back using a new georeplication link from the dr
>>> volume to source volume?
>>> I tried to so, in my first trial I forced the creation of the back
>>> georeplication without deleting the content of the source volume then I
>>> started the replication manually (I didn't use the checkpoint) and I
>>> stopped the replication once it reached the changelog state, but I
couldn't
>>> import the source volume I got the error : volume is not empty
>>>
>>> In my second trial I deleted and recreated the source volume from
>>> scratch and the i started the replication back manually at the end I got
>>> the error
>>>
>>> In my third trial I deleted the source volume and recreated it from
>>> scratch but I replicated back using the check point method and this time
>>> the fail back worked.
>>>
>>> Could someone sheds some light on this?
>>>
>>> Thank you
>>> Regards.
>>>
>>> Le dim. 29 mars 2020 19:19, wodel youchi <wodel.youchi(a)gmail.com> a
>>> écrit :
>>>
>>>> Hi,
>>>>
>>>> Need to understand somethings about DR on oVirt-HI
>>>>
>>>>
>>>> - What does mean : Scheduling regular backups using
>>>> geo-replication (point 3.3.4 RHHI 1.7 Doc Maintaining RHHI) :
>>>> - Does this mean creating a check-point?
>>>> - If yes, does this mean that the geo-replication process will
>>>> sync data up to that check-point and then stops the
synchronization, then
>>>> repeat the same cycle the day after? does this mean that the
minimum RPO is
>>>> one day?
>>>> - I created a snapshot of a VM on the source Manager, I synced the
>>>> volume then I executed a DR, The VM was started on the Target Manager
but
>>>> the VM didn't have its snapshot, any idea???
>>>>
>>>>
>>>> Regards, be safe.
>>>>
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/N2MSZUYT2GE...
>>>
>>
>>
>> --
>> Regards,
>> Eyal Shenitzky
>>
>
--
Regards,
Eyal Shenitzky