Re: [ovirt-users] VM has been paused due to storage I/O problem

31 Jan 2017

      On Tue, Jan 31, 2017 at 6:09 PM, Gianluca Cecchi
<gianluca.cecchi@gmail.com> wrote:
...
On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet <blanchet@abes.fr>
wrote:
...
exactly the same issue by there with FC EMC domain storage...
I'm trying to mitigate inserting a timeout for my SAN devices but I'm not
sure of its effectiveness as CentOS 7 behavior  of "multipathd -k" and then
"show config" seems different from CentOS 6.x
In fact my attempt for multipath.conf is this
# VDSM REVISION 1.3
# VDSM PRIVATE
defaults {
    polling_interval            5
    no_path_retry               fail
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}
# Remove devices entries when overrides section is available.
devices {
    device {
        # These settings overrides built-in devices settings. It does not
apply
        # to devices without built-in settings (these use the settings in
the
        # "defaults" section), or to devices defined in the "devices"
section.
        # Note: This is not available yet on Fedora 21. For more info see
        # https://bugzilla.redhat.com/1253799
        all_devs                yes
        no_path_retry           fail
    }
        device {
                vendor "IBM"
                product "^1814"
                product_blacklist "Universal Xport"
                path_grouping_policy "group_by_prio"
                path_checker "rdac"
                features "0"
                hardware_handler "1 rdac"
                prio "rdac"
                failback immediate
                rr_weight "uniform"
                no_path_retry "12"
Hi Gianluca,

This should be a number, not a string, maybe multipath is having trouble
parsing this and it ignores your value?
...
}
}
So I put exactly the default device config for my IBM/1814 device but
no_path_retry set to 12.
Why 12?

This will do 12 retries, 5 seconds each when no path is available. This will
block lvm commands for 60 seconds when no path is available, blocking
other stuff in vdsm. Vdsm is not designed to handle this.

I recommend value of 4.

But note that this will is not related to the fact that your devices are not
initialize properly after boot.
...
In CentOS 6.x when you do something like this, "show config" gives you the
modified entry only for your device section.
Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and
also the customized one at the end of the output....
Maybe your device configuration does not match exactly the builtin config.
...
Two facts:
- before I could reproduce the problem if I selected
Maintenance
Power Mgmt ---> Restart
(tried 3 times with same behavior)
Instead if I executed in separate steps
Maintenance
Power Mgmt --> Stop
wait a moment
Power Mgmt --> Start
I didn't get problems (tried only one time...)
Maybe waiting a moment helps the storage/switches to clean up
properly after a server is shut down?

Does your power management trigger a proper shutdown?
I would avoid using it for normal shutdown.
...
With this "new" multipath config (to be confirmed if in effect, how?) I
don't get the VM paused problem even with Restart option of Power Mgmt
In active host messages I see these ones when the other reboots:
Jan 31 16:50:01 ovmsrv06 systemd: Started Session 705 of user root.
Jan 31 16:50:01 ovmsrv06 systemd: Starting Session 705 of user root.
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sde
- rdac checker reports path is up
Jan 31 16:53:47 ovmsrv06 multipathd: 8:64: reinstated
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load
table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1
8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdo
- rdac checker reports path is ghost
Jan 31 16:53:47 ovmsrv06 multipathd: 8:224: reinstated
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdk
- rdac checker reports path is up
Jan 31 16:53:47 ovmsrv06 multipathd: 8:160: reinstated
Jan 31 16:53:47 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdq
- rdac checker reports path is ghost
Jan 31 16:53:47 ovmsrv06 multipathd: 65:0: reinstated
Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT returned with sense 05/91/36
Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT returned with sense 05/91/36
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT completed
Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT completed
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sde
- rdac checker reports path is ghost
Jan 31 16:53:52 ovmsrv06 multipathd: 8:64: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load
table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1
8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdo
- rdac checker reports path is up
Jan 31 16:53:52 ovmsrv06 multipathd: 8:224: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdk
- rdac checker reports path is ghost
Jan 31 16:53:52 ovmsrv06 multipathd: 8:160: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: sdq
- rdac checker reports path is up
Jan 31 16:53:52 ovmsrv06 multipathd: 65:0: reinstated
But they are not related to the multipath device dedicated to oVirt storage
domain in this case....
What lets me be optimistic seems the difference in these lines:
before I got
Jan 31 10:27:47 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load
table [0 41943040 multipath 0 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1
service-time 0 2 1 8:64 1 8:160 1]
now I get
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa80000d08955014098: load
table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1
8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
multipath 0 1 rdac
vs
multipath 1 queue_if_no_path 1 rdac
This is not expected, multipath is using unlimited queueing, which is the worst
setup for ovirt.

Maybe this is the result of using "12" instead of 12?

Anyway, looking in multipath source, this is the default configuration for
your device:

405         /* DS3950 / DS4200 / DS4700 / DS5020 */
 406         .vendor        = "IBM",
 407         .product       = "^1814",
 408         .bl_product    = "Universal Xport",
 409         .pgpolicy      = GROUP_BY_PRIO,
 410         .checker_name  = RDAC,
 411         .features      = "2 pg_init_retries 50",
 412         .hwhandler     = "1 rdac",
 413         .prio_name     = PRIO_RDAC,
 414         .pgfailback    = -FAILBACK_IMMEDIATE,
 415         .no_path_retry = 30,
 416     },

and this is the commit that updated this (and other rdac devices):
http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commit;h=c1ed393b...

So I would try this configuration:

device {
                vendor "IBM"
                product "^1814"

                # defaults from multipathd show config
                product_blacklist "Universal Xport"
                path_grouping_policy "group_by_prio"
                path_checker "rdac"
                hardware_handler "1 rdac"
                prio "rdac"
                failback immediate
                rr_weight "uniform"

                # Based on multipath commit
c1ed393b91acace284901f16954ba5c1c0d943c9
                features "2 pg_init_retries 50"

                # Default is 30 seconds, ovirt recommended value is 4 to avoid
                # blocking in vdsm. This gives 20 seconds (4 * polling_interval)
                # gracetime when no path is available.
                no_path_retry 4
        }

Ben, do you have any other ideas on debugging this issue and
improving multipath configuration?

Nir

Re: [ovirt-users] VM has been paused due to storage I/O problem

Nir Soffer