[ovirt-users] Storage domain experienced a high latency

Fri Feb 10 20:31:50 UTC 2017

On Fri, Feb 10, 2017 at 9:24 PM, Grundmann, Christian <
Christian.Grundmann at fabasoft.com> wrote:

> Attached
>

I don't see anything unusual.

Having one path, your system is very sensitive to
errors on the single path.

The default ovirt configuration assume that you have several
paths for each device, and is optimized for fast failover when
a path has an error.

This optimization makes your system more fragile when you
have only one path.

I would use this setting:

    no_path_retry 4

This will do 4 retries when all paths are faulty before failing the io.
If you have short outage for some reason, this may hide the issue
on the host.

If you change multipath.conf, you should mark it as private, so
vdsm will not touch this file on upgrades. To mark the file as
private the second line of the file must be:

# VDSM PRIVATE

To check that your file is considered private, you can run:

# vdsm-tool is-configured
Manual override for multipath.conf detected - preserving current
configuration

You should see the message about Manual override.

The real question is why the path failed - did you have anything
in the server logs at that time? did you have issues on other hosts
in the same time?

Nir

>
>
> Thx Christian
>
>
>
> *Von:* Nir Soffer [mailto:nsoffer at redhat.com]
> *Gesendet:* Freitag, 10. Februar 2017 17:43
>
> *An:* Grundmann, Christian <Christian.Grundmann at fabasoft.com>
> *Cc:* users at ovirt.org
> *Betreff:* Re: [ovirt-users] Storage domain experienced a high latency
>
>
>
> On Thu, Feb 9, 2017 at 10:03 AM, Grundmann, Christian <
> Christian.Grundmann at fabasoft.com> wrote:
>
> Hi,
>
> @ Can also be low level issue in kernel, hba, switch, server.
> I have the old storage on the same cable so I don’t think its hba or
> switch related
> On the same Switch I have a few ESXi Server with same storage setup which
> are working without problems.
>
> @multipath
> I use stock ng-node multipath configuration
>
> # VDSM REVISION 1.3
>
> defaults {
>     polling_interval            5
>     no_path_retry               fail
>     user_friendly_names         no
>     flush_on_last_del           yes
>     fast_io_fail_tmo            5
>     dev_loss_tmo                30
>     max_fds                     4096
> }
>
> # Remove devices entries when overrides section is available.
> devices {
>     device {
>         # These settings overrides built-in devices settings. It does not
> apply
>         # to devices without built-in settings (these use the settings in
> the
>         # "defaults" section), or to devices defined in the "devices"
> section.
>         # Note: This is not available yet on Fedora 21. For more info see
>         # https://bugzilla.redhat.com/1253799
>         all_devs                yes
>         no_path_retry           fail
>     }
> }
>
> # Enable when this section is available on all supported platforms.
> # Options defined here override device specific options embedded into
> # multipathd.
> #
> # overrides {
> #      no_path_retry           fail
> # }
>
>
> multipath -r v3
> has no output
>
>
>
> My mistake, the correct command is:
>
>
>
> multipath -r -v3
>
>
>
> It creates tons of output, so better redirect to file and attach the file:
>
>
>
> multipath -r -v3 > multiapth-r-v3.out
>
>
>
>
>
> Thx Christian
>
>
> Von: Nir Soffer [mailto:nsoffer at redhat.com]
> Gesendet: Mittwoch, 08. Februar 2017 20:44
> An: Grundmann, Christian <Christian.Grundmann at fabasoft.com>
> Cc: users at ovirt.org
> Betreff: Re: [ovirt-users] Storage domain experienced a high latency
>
>
> On Wed, Feb 8, 2017 at 6:11 PM, Grundmann, Christian <mailto:
> Christian.Grundmann at fabasoft.com> wrote:
> Hi,
> got a new FC Storage (EMC Unity 300F) which is seen by my Hosts additional
> to my old Storage for Migration.
> New Storage has only on PATH until Migration is done.
> I already have a few VMs running on the new Storage without Problem.
> But after starting some VMs (don’t really no whats the difference to
> working ones), the Path for new Storage fails.
>
> Engine tells me: Storage Domain <storagedomain> experienced a high latency
> of 22.4875 seconds from host <host>
>
> Where can I start looking?
>
> In /var/log/messages I found:
>
> Feb  8 09:03:53 ovirtnode01 multipathd: 360060160422143002a38935800ae2760:
> sdd - emc_clariion_checker: Active path is healthy.
> Feb  8 09:03:53 ovirtnode01 multipathd: 8:48: reinstated
> Feb  8 09:03:53 ovirtnode01 multipathd: 360060160422143002a38935800ae2760:
> remaining active paths: 1
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 8
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 5833475
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 5833475
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967168
> Feb  8 09:03:53 ovirtnode01 kernel: Buffer I/O error on dev dm-207,
> logical block 97, async page read
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967168
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967280
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967280
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 0
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 0
> Feb  8 09:03:53 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967168
> Feb  8 09:03:53 ovirtnode01 kernel: device-mapper: multipath: Reinstating
> path 8:48.
> Feb  8 09:03:53 ovirtnode01 kernel: sd 3:0:0:22: alua: port group 01 state
> A preferred supports tolUsNA
> Feb  8 09:03:53 ovirtnode01 sanlock[5192]: 2017-02-08 09:03:53+0100 151809
> [11772]: s59 add_lockspace fail result -202
> Feb  8 09:04:05 ovirtnode01 multipathd: dm-33: remove map (uevent)
> Feb  8 09:04:05 ovirtnode01 multipathd: dm-33: devmap not registered,
> can't remove
> Feb  8 09:04:05 ovirtnode01 multipathd: dm-33: remove map (uevent)
> Feb  8 09:04:06 ovirtnode01 multipathd: dm-34: remove map (uevent)
> Feb  8 09:04:06 ovirtnode01 multipathd: dm-34: devmap not registered,
> can't remove
> Feb  8 09:04:06 ovirtnode01 multipathd: dm-34: remove map (uevent)
> Feb  8 09:04:08 ovirtnode01 multipathd: dm-33: remove map (uevent)
> Feb  8 09:04:08 ovirtnode01 multipathd: dm-33: devmap not registered,
> can't remove
> Feb  8 09:04:08 ovirtnode01 multipathd: dm-33: remove map (uevent)
> Feb  8 09:04:08 ovirtnode01 kernel: dd: sending ioctl 80306d02 to a
> partition!
> Feb  8 09:04:24 ovirtnode01 sanlock[5192]: 2017-02-08 09:04:24+0100 151840
> [15589]: read_sectors delta_leader offset 2560 rv -202
> /dev/f9b70017-0a34-47bc-bf2f-dfc70200a347/ids
> Feb  8 09:04:34 ovirtnode01 sanlock[5192]: 2017-02-08 09:04:34+0100 151850
> [15589]: f9b70017 close_task_aio 0 0x7fd78c0008c0 busy
> Feb  8 09:04:39 ovirtnode01 multipathd: 360060160422143002a38935800ae2760:
> sdd - emc_clariion_checker: Read error for WWN
> 60060160422143002a38935800ae2760.  Sense data are 0x0/0x0/0x0.
> Feb  8 09:04:39 ovirtnode01 multipathd: checker failed path 8:48 in map
> 360060160422143002a38935800ae2760
> Feb  8 09:04:39 ovirtnode01 multipathd: 360060160422143002a38935800ae2760:
> remaining active paths: 0
> Feb  8 09:04:39 ovirtnode01 kernel: qla2xxx [0000:11:00.0]-801c:3: Abort
> command issued nexus=3:0:22 --  1 2002.
> Feb  8 09:04:39 ovirtnode01 kernel: device-mapper: multipath: Failing path
> 8:48.
> Feb  8 09:04:40 ovirtnode01 kernel: qla2xxx [0000:11:00.0]-801c:3: Abort
> command issued nexus=3:0:22 --  1 2002.
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: 8 callbacks
> suppressed
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967168
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967280
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 0
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967168
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 4294967280
> Feb  8 09:04:42 ovirtnode01 kernel: blk_update_request: I/O error, dev
> dm-10, sector 0
>
> Maybe you should consult the storage vendor about this?
>
> Can be also incorrect multipath configuration, maybe multipatch checker,
> fail, and because you have one path the device moved to faulty state, and
> sanlock fail to access the device.
>
> Can also be low level issue in kernel, hba, switch, server.
>
> Lets start by inspecting multipath configuration, can you share
> output of:
>
> cat /etc/multiapth.conf
> multipath -r v3
>
> Maybe you can expose one lun for testing, and blacklist this lun in
> multipath.conf. You will not be able to use this lun in ovirt, but it can
> be used to validate the layers below multipath. If a plain lun is ok,
> and same lun used a multipath device fails, the problem is likely to be
> multipath configuration.
>
> Nir
>
>
>
> multipath -ll output for this Domain
>
> 360060160422143002a38935800ae2760 dm-10 DGC     ,VRAID
> size=2.0T features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw
> `-+- policy='service-time 0' prio=50 status=active
>   `- 3:0:0:22 sdd 8:48  active ready  running
>
>
> Thx Christian
>
>
>
> _______________________________________________
> Users mailing list
>
> mailto:Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170210/3fba68b9/attachment.html>