On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet <blanchet@abes.fr> wrote:

exactly the same issue by there with FC EMC domain storage...



I'm trying to mitigate inserting a timeout for my SAN devices but I'm not sure of its effectiveness as CentOS 7 behavior  of "multipathd -k" and then "show config" seems different from CentOS 6.x
In fact my attempt for multipath.conf is this


# VDSM REVISION 1.3
# VDSM PRIVATE

defaults {
    polling_interval            5
    no_path_retry               fail
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}

# Remove devices entries when overrides section is available.
devices {
    device {
        # These settings overrides built-in devices settings. It does not apply
        # to devices without built-in settings (these use the settings in the
        # "defaults" section), or to devices defined in the "devices" section.
        # Note: This is not available yet on Fedora 21. For more info see
        # https://bugzilla.redhat.com/1253799
        all_devs                yes
        no_path_retry           fail
    }
        device {
                vendor "IBM"
                product "^1814"
                product_blacklist "Universal Xport"
                path_grouping_policy "group_by_prio"
                path_checker "rdac"
                features "0"
                hardware_handler "1 rdac"
                prio "rdac"
                failback immediate
                rr_weight "uniform"
                no_path_retry "12"
        }
}

So I put exactly the default device config for my IBM/1814 device but no_path_retry set to 12.

In CentOS 6.x when you do something like this, "show config" gives you the modified entry only for your device section.
Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and also the customized one at the end of the output....

Two facts:
- before I could reproduce the problem if I selected 
Maintenance
Power Mgmt ---> Restart
(tried 3 times with same behavior)

Instead if I executed in separate steps
Maintenance
Power Mgmt --> Stop
wait a moment
Power Mgmt --> Start

I didn't get problems (tried only one time...)

With this "new" multipath config (to be confirmed if in effect, how?) I don't get the VM paused problem even with Restart option of Power Mgmt
In active host messages I see these ones when the other reboots:

Jan 31 16:50:01 ovmsrv06 systemd: Started Session 705 of user root.
Jan 31 16:50:01 ovmsrv06 systemd: Starting Session 705 of user root.
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sde - rdac checker reports path is up
Jan 31 16:53:47 ovmsrv06 multipathd: 8:64: reinstated
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdo - rdac checker reports path is ghost
Jan 31 16:53:47 ovmsrv06 multipathd: 8:224: reinstated
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdk - rdac checker reports path is up
Jan 31 16:53:47 ovmsrv06 multipathd: 8:160: reinstated
Jan 31 16:53:47 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdq - rdac checker reports path is ghost
Jan 31 16:53:47 ovmsrv06 multipathd: 65:0: reinstated
Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT returned with sense 05/91/36
Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT returned with sense 05/91/36
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT completed
Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT completed
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sde - rdac checker reports path is ghost
Jan 31 16:53:52 ovmsrv06 multipathd: 8:64: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdo - rdac checker reports path is up
Jan 31 16:53:52 ovmsrv06 multipathd: 8:224: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdk - rdac checker reports path is ghost
Jan 31 16:53:52 ovmsrv06 multipathd: 8:160: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdq - rdac checker reports path is up
Jan 31 16:53:52 ovmsrv06 multipathd: 65:0: reinstated

But they are not related to the multipath device dedicated to oVirt storage domain in this case....
What lets me be optimistic seems the difference in these lines:

before I got
Jan 31 10:27:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load table [0 41943040 multipath 0 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]

now I get
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]

multipath 0 1 rdac 
vs
multipath 1 queue_if_no_path 1 rdac

Any confirmation?
Thanks in advance