On Fri, Apr 23, 2021 at 7:15 PM Nir Soffer <nsoffer@redhat.com> wrote:

>> > 1) Is this the expected behavior?
>>
>> yes, before removing multipath devices, you need to unzone LUN on storage
>> server. As oVirt doesn't manage storage server in case of iSCSI, it has to be
>> done by storage sever admin and therefore oVirt cannot manage whole flow.
>>
> Thank you for the information. Perhaps you can expand then on how the volumes are picked up once mapped from the Storage system?  Traditionally when mapping storage from an iSCSI or Fibre Channel storage we have to initiate a LIP or iSCSI login. How is it that oVirt doesn't need to do this?
>
>> > 2) Are we supposed to go to each KVM host and manually remove the
>> > underlying multipath devices?
>>
>> oVirt provides ansible script for it:
>>
>> https://github.com/oVirt/ovirt-ansible-collection/blob/master/examples/
>> remove_mpath_device.yml
>>
>> Usage is as follows:
>>
>> ansible-playbook --extra-vars "lun=<LUN_ID>" remove_mpath_device.yml
>

I had to decommission one iSCSI based storage domain, after having added one new iSCSI one (with another portal) and moved all the objects into the new one (vm disks, template disks, iso disks, leases).
The Environment is based on 4.4.6, with 3 hosts, external engine.
So I tried the ansible playbook way to verify it.

Initial situation is this below; the storage domain to decommission is the ovsd3750, based on the 5Tb LUN.

$ sudo multipath -l
364817197c52f98316900666e8c2b0b2b dm-13 EQLOGIC,100E-00
size=2.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 16:0:0:0 sde 8:64 active undef running
  `- 17:0:0:0 sdf 8:80 active undef running
36090a0d800851c9d2195d5b837c9e328 dm-2 EQLOGIC,100E-00
size=5.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 13:0:0:0 sdb 8:16 active undef running
  `- 14:0:0:0 sdc 8:32 active undef running

Connections are using iSCSI multipathing (iscsi1 and iscs2 in web admin gui), so that I have two paths to each LUN:

$sudo  iscsiadm -m node
10.10.100.7:3260,1 iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750
10.10.100.7:3260,1 iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750
10.10.100.9:3260,1 iqn.2001-05.com.equallogic:4-771816-31982fc59-2b0b2b8c6e660069-ovsd3920
10.10.100.9:3260,1 iqn.2001-05.com.equallogic:4-771816-31982fc59-2b0b2b8c6e660069-ovsd3920

$ sudo iscsiadm -m session
tcp: [1] 10.10.100.7:3260,1 iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750 (non-flash)
tcp: [2] 10.10.100.7:3260,1 iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750 (non-flash)
tcp: [4] 10.10.100.9:3260,1 iqn.2001-05.com.equallogic:4-771816-31982fc59-2b0b2b8c6e660069-ovsd3920 (non-flash)
tcp: [5] 10.10.100.9:3260,1 iqn.2001-05.com.equallogic:4-771816-31982fc59-2b0b2b8c6e660069-ovsd3920 (non-flash)

One point not taken in consideration inside the previously opened bugs in my opinion is the deletion of iSCSI connections and node at host side (probably to be done by the os admin, but it could be taken in charge by the ansible playbook...)
The bugs I'm referring are:
Bug 1310330 - [RFE] Provide a way to remove stale LUNs from hypervisors
Bug 1928041 - Stale DM links after block SD removal

Actions done:
put storage domain into maintenance
detach storage domain
remove storage domain
remove access from equallogic admin gui

I have a group named ovirt in ansible inventory composed by my 3 hosts: ov200, ov300 and ov301
executed
$ ansible-playbook -b -l ovirt --extra-vars "lun=36090a0d800851c9d2195d5b837c9e328" remove_mpath_device.yml

it went all ok with ov200 and ov300, but for ov301 I got

fatal: [ov301: FAILED! => {"changed": true, "cmd": "multipath -f \"36090a0d800851c9d2195d5b837c9e328\"", "delta": "0:00:00.009003", "end": "2021-07-15 11:17:37.340584", "msg": "non-zero return code", "rc": 1, "start": "2021-07-15 11:17:37.331581", "stderr": "Jul 15 11:17:37 | 36090a0d800851c9d2195d5b837c9e328: map in use", "stderr_lines": ["Jul 15 11:17:37 | 36090a0d800851c9d2195d5b837c9e328: map in use"], "stdout": "", "stdout_lines": []}

the complete output:

$ ansible-playbook -b -l ovirt --extra-vars "lun=36090a0d800851c9d2195d5b837c9e328" remove_mpath_device.yml

PLAY [Cleanly remove unzoned storage devices (LUNs)] *************************************************************

TASK [Gathering Facts] *******************************************************************************************
ok: [ov200]
ok: [ov300]
ok: [ov301]

TASK [Get underlying disks (paths) for a multipath device and turn them into a list.] ****************************
changed: [ov300]
changed: [ov200]
changed: [ov301]

TASK [Remove from multipath device.] *****************************************************************************
changed: [ov200]
changed: [ov300]
fatal: [ov301]: FAILED! => {"changed": true, "cmd": "multipath -f \"36090a0d800851c9d2195d5b837c9e328\"", "delta": "0:00:00.009003", "end": "2021-07-15 11:17:37.340584", "msg": "non-zero return code", "rc": 1, "start": "2021-07-15 11:17:37.331581", "stderr": "Jul 15 11:17:37 | 36090a0d800851c9d2195d5b837c9e328: map in use", "stderr_lines": ["Jul 15 11:17:37 | 36090a0d800851c9d2195d5b837c9e328: map in use"], "stdout": "", "stdout_lines": []}

TASK [Remove each path from the SCSI subsystem.] *****************************************************************
changed: [ov300] => (item=sdc)
changed: [ov300] => (item=sdb)
changed: [ov200] => (item=sdc)
changed: [ov200] => (item=sdb)

PLAY RECAP *******************************************************************************************************
ov200 : ok=4    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  
ov300 : ok=4    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  
ov301 : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0 

Indeed going to the server I get:

root@ov301 ~]# multipath -f 36090a0d800851c9d2195d5b837c9e328
Jul 15 11:24:37 | 36090a0d800851c9d2195d5b837c9e328: map in use
[root@ov301 ~]#

the dm device under the multipath one is dm-2 and
[root@ov301 ~]# ll /dev/dm-2
brw-rw----. 1 root disk 253, 2 Jul 15 11:28 /dev/dm-2
[root@ov301 ~]#

[root@ov301 ~]# lsof | grep "253,2"

I get no lines, but other devices with minor beginning with 2 (eg. 24, 25, 27..)
. . .
qemu-kvm    10638   10653 vnc_worke            qemu   84u      BLK             253,24       0t0  112027277 /dev/dm-24
qemu-kvm    11479                              qemu   43u      BLK             253,27       0t0  112135384 /dev/dm-27
qemu-kvm    11479                              qemu  110u      BLK             253,25       0t0  112140523 /dev/dm-25

so nothing for dm-2

What to do to crosscheck what is using the device and so preventing the "-f" to complete?
Now I get

# multipath -l
364817197c52f98316900666e8c2b0b2b dm-14 EQLOGIC,100E-00
size=2.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 16:0:0:0 sde 8:64 active undef running
  `- 17:0:0:0 sdf 8:80 active undef running
36090a0d800851c9d2195d5b837c9e328 dm-2 ##,##
size=5.0T features='0' hwhandler='0' wp=rw

Another thing to improve perhaps in the ansible playbook is that usually when in general I remove FC or iSCSI LUNs under multipath on a Linux system, after the "multipath -f" command and before the "echo 1 > ... /device/delete" one I run also, for safeness:

blockdev --flushbufs /dev/$i
where $i loops over the devices composing the multipath.

I see that inside the web admin Gui under Datacenter--> iSCSI multipath
iscsi1
iscsi2
there is no more the connection to the removed SD.
But at the host side nothing changed from the iSCSI point of view.
So I executed:

log out from the sessions:
[root@ov300 ~]# iscsiadm -m session -r 1 -u
Logging out of session [sid: 1, target: iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750, portal: 10.10.100.7,3260]
Logout of [sid: 1, target: iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750, portal: 10.10.100.7,3260] successful.
[root@ov300 ~]# iscsiadm -m session -r 2 -u
Logging out of session [sid: 2, target: iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750, portal: 10.10.100.7,3260]
Logout of [sid: 2, target: iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750, portal: 10.10.100.7,3260] successful.
[root@ov300 ~]#

and then removal of the node
[root@ov300 ~]# iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750 -o delete
[root@ov300 ~]# ll /var/lib/iscsi/nodes/
total 4
drw-------. 3 root root 4096 Jul 13 11:18 iqn.2001-05.com.equallogic:4-771816-31982fc59-2b0b2b8c6e660069-ovsd3920
[root@ov300 ~]#

while previously I had:
[root@ov300 ~]# ll /var/lib/iscsi/nodes/
total 8
drw-------. 3 root root 4096 Jan 12  2021 iqn.2001-05.com.equallogic:0-8a0906-9d1c8500d-28e3c937b8d59521-ovsd3750
drw-------. 3 root root 4096 Jul 13 11:18 iqn.2001-05.com.equallogic:4-771816-31982fc59-2b0b2b8c6e660069-ovsd3920
[root@ov301 ~]#

Otherwise I think that at reboot the host will try to reconnect to the no more existing portal...

Comments welcome

Gianluca