Sorry for the late response, please see my comments inline.
On Fri, 3 Apr 2020 at 02:30, wodel youchi <wodel.youchi(a)gmail.com> wrote:
Hi,
Thank you for your reply, but I already did all of that but I didn't
understand everything and I got several problems on the way of doing the
fail-over then a fail-back. I am writing this mail in hope to clarify those
things.
I will try to express my self correctly and give as much details as that I
can.
*The LAB :*
My LAB contains two single-host oVirt-HCI platforms, one to act as the *primary
site (the source)* the second as the *disaster-recovery site (the
target)*.
Each HCI site contains one data domain, the domain is comprised of a
gluster volume which is backed by one brick. The volumes (source and
target) have the same size, and they have been created within the process
of the HCI deployment.
*At the end of the deployment, I detached the deleted the gluster data
domain on the target site, but I didn't delete the target volume.*
My goal is to test the disaster recovery (active-passive DR to be precise)
process on an HCI implementation. To test the fail-over and the fail-back
process entirely.
*Documentation*
RHHI 1.7
Maintaining_Red_Hat_Hyperconverged_Infrastructure_for_Virtualization-en-US
and I started my implementation
I prepared all the ansible playbooks.
*The Test procedure:*
*Fail-over*
1 - Create a Windows10 VM on the source volume.
2 - Replicate to the DR site.
3 - Execute the fail-over procedure and test if the WM is usable in the
target platform.
4 - Detach and Delete the data domain in the target platform without
touching the target volume
5 - Make changes to the Win10 VM on the source volume (creating files and
installing software)
6 - Replicate again to the DR site then execute another fail-over and see
if the modification were synced.
*Fail-back*
1 - Make changes to the Win10 VM on the target volume (deleting files) *and
especially creating a snapshot*
2 - Detach and Delete the data domain in the source platform without
touching the source volume.
3 - Replicate to the source site.
4 - Execute the clean up playbook
5 - Execute the fail-over and WM is usable in the source platform and that
the modifications were synced especially the snapshot
*Things I need to confirm :*
1 - When creating the geo-replication from the primary site to the target
site, we get to a point where we have to create "*Scheduling regular
backups using geo-replication*", from my understanding it's like a cron
job that starts the geo-replication at a specific time (or day time), and
from my testing, the geo-replication starts syncing at that precise time
and when its "*CRAWL STATUS*" reaches "Changelog Crawl" it stops the
synchronization. In other terms when the geo-replication reaches the same
date as the check-point (the specific time).
The smallest time you can get from the configuration window is 24hours,
which means in the event of a disaster, you can at most recover the data
from the day before. *Is this correct?*
If I understand correctly, you are talking about the storage replication
that performed on the storage layer.
This question should refer to the Gluster team/community.
*Problems encountered during the test:*
*Fail-over*
1 - When executing the fail-over the first time (ansible-playbook
dr-rhv-failover.yml --tags "fail_over"), the import of the target data
domain failed with the error : *An exception occurred during task
execution. To see the full traceback, use -vvv. The error was:
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is
"[Error in creating a Storage Domain. The selected storage path is not
empty (probably contains another Storage Domain). Either remove the
existing Storage Domain from this path, or change the Storage path).]".
HTTP response code is 400. *I tried manually to import the domain from
oVirt's admin console and I got the same error. so I did the following
- I deleted the target volume and the brick and the sub-directory of the
brick.
- I recreated the volume from scratch.
- I redid the geo-replication synchronization from the source.
- I executed the fail-over and this time the target data domain was
imported correctly and the Win10 VM was started correctly.
First, I suggest you to use the python scripts that help you to automate
the DR process, you can find them under -
../your-dr-folder/files -> please use './ovirt-dr -h' to see the
available options.
According to the error, it seems like maybe you didn't wait for the sanlock
lease to expire, you must wait around 80 seconds before you are trying to
use it.
2 - I detached then deleted the target data domain without touching
the
target volume, then I made change to the Win10 VM on the source site, then
I created a new schedule of geo-replication, and after the replication I
executed another fail-over.
- The Win10 VM started successfully and the changes made were synced.
*Fail-back*
1 - The documentation doesn't explain the fail-back procedure thoroughly.
It doesn't explain what does the dr-cleanup.yml do?
it should remove all the entities from your original/source site so there
will be no conflicts when you fail back to the environment.
2 - When launching the fail-back playbook at some point I get this message
:
*TASK [oVirt.disaster-recovery : Failback Replication Sync pause]
****************************************************************************************************************************[oVirt.disaster-recovery
: Failback Replication Sync pause][Failback Replication Sync] Please press
ENTER once the destination storage domains are ready to be used for the
destination setup:*
What does this mean?
You must let sanlock to release his leases by setting the domains on
maintenance or shutting down the engine and wait around 80 seconds when it
is ready you can start to fail back,
3 - I did some changed on the Win10 VM and I created snapshot of that VM.
4.a - To replicate the data from the target site to the primary site I
create a new geo-replication from the target volume to the source volume,
but I get a warning that the source volume was not empty so I forced the
geo-replication creation, then :
- I detached and deleted the source data domain without touching the
source volume.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "Changelog Crawl" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*
4.b - So I redid the test but,
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "changelog" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*
4.c - I redid the test but :
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication using a shedule this time
- I executed the clean-up plyabook then I executed the fail-back playbook
- *This time the source data domain was imported correctly and the Win10
VM was started and the modifications were synced.*
- The snapshot was imported, but there was another snapshot with it called
"Win10-TMPDR".
Regards.
Le jeu. 2 avr. 2020 à 08:42, Eyal Shenitzky <eshenitz(a)redhat.com> a
écrit :
> If you intention is to use active-passive disaster recovery solution, you
> can have a look at the following guild:
>
>
https://ovirt.org/documentation/disaster-recovery-guide/active_passive_ov...
>
> On Wed, 1 Apr 2020 at 16:42, wodel youchi <wodel.youchi(a)gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to configure and test disaster recovery on ovirt HCI
>>
>> And to understand how it works
>> What is the minimum RPO and its relationship with checkpoint
>> And what are the steps to fail back
>>
>> Regards
>>
>> Le mer. 1 avr. 2020 14:16, Eyal Shenitzky <eshenitz(a)redhat.com> a
>> écrit :
>>
>>> Hi Wodel,
>>>
>>> Can you please explain what you are trying to do?
>>> I am not sure I understand it from your question.
>>>
>>> On Wed, 1 Apr 2020 at 12:55, wodel youchi <wodel.youchi(a)gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I re-did the test and it seems that the minimum RPO is one day and if
>>>> someone could confirm that would be great
>>>>
>>>> As for the snapshot this time it was synced
>>>>
>>>> Then I tried to test the fail back and I found that the documentation
>>>> is not clear :
>>>> - it is not clear what is the purpose of the dr-clear playbook
>>>> - it is not clear what does mean : put the target volume in read write
>>>> mode and source volume in read-only mode
>>>> - Do we have to sync back using a new georeplication link from the dr
>>>> volume to source volume?
>>>> I tried to so, in my first trial I forced the creation of the back
>>>> georeplication without deleting the content of the source volume then I
>>>> started the replication manually (I didn't use the checkpoint) and
I
>>>> stopped the replication once it reached the changelog state, but I
couldn't
>>>> import the source volume I got the error : volume is not empty
>>>>
>>>> In my second trial I deleted and recreated the source volume from
>>>> scratch and the i started the replication back manually at the end I got
>>>> the error
>>>>
>>>> In my third trial I deleted the source volume and recreated it from
>>>> scratch but I replicated back using the check point method and this time
>>>> the fail back worked.
>>>>
>>>> Could someone sheds some light on this?
>>>>
>>>> Thank you
>>>> Regards.
>>>>
>>>> Le dim. 29 mars 2020 19:19, wodel youchi <wodel.youchi(a)gmail.com>
a
>>>> écrit :
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to understand somethings about DR on oVirt-HI
>>>>>
>>>>>
>>>>> - What does mean : Scheduling regular backups using
>>>>> geo-replication (point 3.3.4 RHHI 1.7 Doc Maintaining RHHI) :
>>>>> - Does this mean creating a check-point?
>>>>> - If yes, does this mean that the geo-replication process will
>>>>> sync data up to that check-point and then stops the
synchronization, then
>>>>> repeat the same cycle the day after? does this mean that the
minimum RPO is
>>>>> one day?
>>>>> - I created a snapshot of a VM on the source Manager, I synced
>>>>> the volume then I executed a DR, The VM was started on the Target
Manager
>>>>> but the VM didn't have its snapshot, any idea???
>>>>>
>>>>>
>>>>> Regards, be safe.
>>>>>
>>>> _______________________________________________
>>>> Users mailing list -- users(a)ovirt.org
>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>>> oVirt Code of Conduct:
>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/N2MSZUYT2GE...
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Eyal Shenitzky
>>>
>>
>
> --
> Regards,
> Eyal Shenitzky
>
--
Regards,
Eyal Shenitzky