On Thu, Jul 25, 2019 at 6:24 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:

On Thu, Jul 25, 2019 at 2:21 PM Eyal Shenitzky <eshenitz@redhat.com> wrote:
On Thu, Jul 25, 2019 at 3:02 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Thu, Jul 25, 2019 at 1:54 PM Eyal Shenitzky <eshenitz@redhat.com> wrote:

Please notice that a automation python scripts created in order to facilitate the DR process.
You can find them under - path/to/your/dr/folder/files.

You can use those scripts to generate the mapping, test the generated mapping and start the failover/failback.

I strongly recommend to use it.

Yes, I have used it to create the disaster_recovery_vars.yml mapping file and then populating it with the secondary site information, thanks.
My doubt was about any difference in playbook actions between "failover" (3.3) and "discreet failover test" (B.1), as the executed playbook and tags are the same.

No, the only difference is that you disable the storage replication by yourself, this way you can test the failover while the other "primary" site is still active.

First "discreet failover test" was a success!!! Great.
Storage domain attached, templates imported and the only VM defined at source correctly started (at source I configured link down for the VM, inherited at target, so no collisions).
Elapsed between beginning of ovirt connection, until first template import has been about 6 minutes.
...
Template TOL76 has been successfully imported from the given configuration. 7/25/19 3:26:58 PM
Storage Domain ovsd3910 was attached to Data Center SVIZ3-DR by admin@internal-authz 7/25/19 3:26:46 PM
Storage Domains were attached to Data Center SVIZ3-DR by admin@internal-authz 7/25/19 3:26:46 PM
Storage Domain ovsd3910 (Data Center SVIZ3-DR) was activated by admin@internal-authz 7/25/19 3:26:46 PM
...
Storage Pool Manager runs on Host ovh201. (Address: ovh201.), Data Center SVIZ3-DR. 7/25/19 3:26:36 PM
Data Center is being initialized, please wait for initialization to complete. 7/25/19 3:23:53 PM
Storage Domain ovsd3910 was added by admin@internal-authz 7/25/19 3:20:43 PM
Disk Profile ovsd3910 was successfully added (User: admin@internal-authz). 7/25/19 3:20:42 PM
User admin@internal-authz connecting from '10.4.192.43' using session 'xxx' logged in. 7/25/19 3:20:35 PM

Some notes:

1) iSCSI multipath
my storage domains are iSCSI based and my hosts have two network cards to reach the storage.
I'm using EQL that doesn't support bonding and has one portal that all initiators use.
So in my primary env I configured "iSCSI Multipathing" tab in Compute --> Datacenter --> Datacenter_Name window.
But this tab appears only when you activate the storage.
So during the ansible playbook run the iSCSI connection has been activated through the "default" iscsi interface
I can then:
- configure "iSCSI Multipathing"
- shutdown VM
- put host into maintenance
- remove the default iSCSI session that has not been removed on host
iscsiadm -m session -r 6 -u
- activate host
now I have:
[root@ov201 ~]# iscsiadm -m session
tcp: [10] 10.10.100.8:3260,1 iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910 (non-flash)
tcp: [9] 10.10.100.8:3260,1 iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910 (non-flash)
[root@ov201 ~]#
with
# multipath -l
364817197c52fd899acbe051e0377dd5b dm-29 EQLOGIC ,100E-00
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 23:0:0:0 sdb 8:16 active undef running
`- 24:0:0:0 sdc 8:32 active undef running
- start vm

The I do a cleanup:
1. Detach the storage domains from the secondary site.
2. Enable storage replication between the primary and secondary storage domains.

The storage domain remains as "unattached" in DR environment

Then I executed the test again and during connection I got this error about 40 seconds after run of playbook

TASK [oVirt.disaster-recovery : Import iSCSI storage domain] ***************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.
failed: [localhost] (item=iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910) => {"ansible_loop_var": "dr_target", "changed": false, "dr_target": "iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910", "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[]\". HTTP response code is 400."}

In webadmin gui of DR env I see:

VDSM ov201 command CleanStorageDomainMetaDataVDS failed: Cannot obtain lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held by another host')" 7/25/19 4:50:43 PM
What could be the cause of this?

In vdsm.log:

2019-07-25 16:50:43,196+0200 INFO (jsonrpc/1) [vdsm.api] FINISH forcedDetachStorageDomain error=Cannot obtain lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held by another host')" from=::ffff:10.4.192.79,49038, flow_id=4bd330d1, task_id=c0dfac81-5c58-427d-a7d0-e8c695448d27 (api:52)
2019-07-25 16:50:43,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='c0dfac81-5c58-427d-a7d0-e8c695448d27') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File "<string>", line 2, in forcedDetachStorageDomain
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 856, in forcedDetachStorageDomain
self._detachStorageDomainFromOldPools(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 834, in _detachStorageDomainFromOldPools
dom.acquireClusterLock(host_id)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 910, in acquireClusterLock
self._manifest.acquireDomainLock(hostID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 527, in acquireDomainLock
self._domainLock.acquire(hostID, self.getDomainLease())
File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 419, in acquire
"Cannot acquire %s" % (lease,), str(e))
AcquireLockFailure: Cannot obtain lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held by another host')"
2019-07-25 16:50:43,196+0200 INFO (jsonrpc/1) [storage.TaskManager.Task] (Task='c0dfac81-5c58-427d-a7d0-e8c695448d27') aborting: Task is aborted: 'Cannot obtain lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire Lease(name=\'SDM\', path=\'/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases\', offset=1048576), err=(-243, \'Sanlock resource not acquired\', \'Lease is held by another host\')"' - code 651 (task:1181)
2019-07-25 16:50:43,197+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH forcedDetachStorageDomain error=Cannot obtain lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held by another host')" (dispatcher:83)
2019-07-25 16:50:43,197+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StorageDomain.detach failed (error 651) in 24.12 seconds (__init__:312)
2019-07-25 16:50:44,180+0200 INFO (jsonrpc/6) [api.host] START getStats() from=::ffff:10.4.192.79,49038 (api:48)
2019-07-25 16:50:44,222+0200 INFO (jsonrpc/6) [vdsm.api] START repoStats(domains=()) from=::ffff:10.4.192.79,49038, task_id=8a7a0302-4ee3-49a8-a3f7-f9636a123765 (api:48)
2019-07-25 16:50:44,222+0200 INFO (jsonrpc/6) [vdsm.api] FINISH repoStats return={} from=::ffff:10.4.192.79,49038, task_id=8a7a0302-4ee3-49a8-a3f7-f9636a123765 (api:54)
2019-07-25 16:50:44,223+0200 INFO (jsonrpc/6) [vdsm.api] START multipath_health() from=::ffff:10.4.192.79,49038, task_id=fb09923c-0888-4c3f-9b8a-a7750592da22 (api:48)
2019-07-25 16:50:44,223+0200 INFO (jsonrpc/6) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.4.192.79,49038, task_id=fb09923c-0888-4c3f-9b8a-a7750592da22 (api:54)

After putting host into maintenance + reboot of the host and re-run of playbook, all went well again..

This error is related to Sanlock, it looks like you needed to wait longer for the leases to expire.

2) Mac Address pools
I notice that the imported VM has preserved the link state (down in my case) but not the mac address.
The mac address is the one defined in my target engine, that is different from source engine to avoid overlap of mac adresses
Is this an option I can customize?

You can customize the MAC of the VM in :

compute -> virtual machines -> network interfaces -> create new with custom mac

In the DR, in defaults/main.yaml you can set the following variable to false -

# Indicate whether to reset a mac pool of a VM on register.

dr_reset_mac_pool: "True"

In general a VM could have problems when changing its mac address...

3) Clean up dest DC after "discreet failover test"
The guide says:
1. Detach the storage domains from the secondary site.
2. Enable storage replication between the primary and secondary storage domains.

better to also restart the DR hosts?

No need for that.

4) VM consistency
Can we say that all the imported VMs will be "crash consistent"?

I am not sure what you mean by "crash consistent", is it means "highly available" is oVirt language?

Thanks ,
Gianluca

Regards,

Eyal Shenitzky