[ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Nir Soffer
nsoffer at redhat.com
Wed May 20 19:10:46 EDT 2015
----- Original Message -----
> From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones at bookit.com>
> To: users at ovirt.org
> Sent: Thursday, May 21, 2015 12:49:50 AM
> Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
>
> >> vdsm.log in the node side, will help here too.
>
> https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log
> contains only the messages at and after when a host was become
> unresponsive due to storage issues.
According to the log, you have a real issue accessing storage from the host:
[nsoffer at thin untitled (master)]$ repostat vdsm.log
domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2
delay avg: 0.000856 min: 0.000000 max: 0.001168
last check avg: 11.510000 min: 0.300000 max: 64.100000
domain: 64101f40-0f10-471d-9f5f-44591f9e087d
delay avg: 0.008358 min: 0.000000 max: 0.040269
last check avg: 11.863333 min: 0.300000 max: 63.400000
domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0
delay avg: 0.007793 min: 0.000819 max: 0.041316
last check avg: 11.466667 min: 0.000000 max: 70.200000
domain: 842edf83-22c6-46cd-acaa-a1f76d61e545
delay avg: 0.000493 min: 0.000374 max: 0.000698
last check avg: 4.860000 min: 0.200000 max: 9.900000
domain: b050c455-5ab1-4107-b055-bfcc811195fc
delay avg: 0.002080 min: 0.000000 max: 0.040142
last check avg: 11.830000 min: 0.000000 max: 63.700000
domain: c46adffc-614a-4fa2-9d2d-954f174f6a39
delay avg: 0.004798 min: 0.000000 max: 0.041006
last check avg: 18.423333 min: 1.400000 max: 102.900000
domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7
delay avg: 0.001002 min: 0.000000 max: 0.001199
last check avg: 11.560000 min: 0.300000 max: 61.700000
domain: 20153412-f77a-4944-b252-ff06a78a1d64
delay avg: 0.003748 min: 0.000000 max: 0.040903
last check avg: 12.180000 min: 0.000000 max: 67.200000
domain: 26929b89-d1ca-4718-90d6-b3a6da585451
delay avg: 0.000963 min: 0.000000 max: 0.001209
last check avg: 10.993333 min: 0.000000 max: 64.300000
domain: 0137183b-ea40-49b1-b617-256f47367280
delay avg: 0.000881 min: 0.000000 max: 0.001227
last check avg: 11.086667 min: 0.100000 max: 63.200000
Note the high last check maximum value (e.g. 102 seconds).
Vdsm has a monitor thread for each domain, doing a read from one of the storage
domain special disk every 10 seconds. When we see high last check value, it
means that the monitor thread is stuck reading from the disk.
This is an indicator that vms may have trouble accessing this storage domains,
and engine is handling this by making the host non-operational, or if all hosts
cannot access the domain, making the domain inactive.
One of the known issues that can be related, is bad multipath configuration. Some
storage server have bad builtin configuration embedded into multipath. In particular,
using "no_path_retry queue", or "no_path_retry 60". This setting means that when
the SCSI layer fails, and multipath does not have any active path it will queue
io foerver (queue), or retry many times (e.g, 60) before failing the io request.
This can lead to stuck process, doing a read or write that never fails or takes
many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread
may block other unrelated threads.
Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe
it does not match the product (Compellent Vol).
See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57
device {
vendor "COMPELNT"
product "Compellent Vol"
no_path_retry fail
}
Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
the setting we ship is missing lot of settings that exists in the builtin
setting, and this may have bad effect. If your devices match this , I would
try this multipath configuration, instead of the one vdsm configures.
device {
vendor "COMPELNT"
product "Compellent Vol"
path_grouping_policy "multibus"
path_checker "tur"
features "0"
hardware_handler "0"
prio "const"
failback "immediate"
rr_weight "uniform"
no_path_retry fail
}
To verify that your devices match this, you can check the devices vendor and procut
strings in the output of "multipath -ll". I would like to see the output of this
command.
Another platform issue is bad default SCSI node.session.timeo.replacement_timeout
value, which is set to 120 seconds. This setting mean that the SCSI layer will
wait 120 seconds for io to complete on one path, before failing the io request.
So you may have one bad path, causing 120 second delay, while you could complete
the request using another path.
Multipath is trying to set this value to 5 seconds, but this value is reverting
to the default 120 seconds after a device has trouble. There is an open bug about
this which we hope to get fixed in the rhel/centos 7.2.
https://bugzilla.redhat.com/1139038
This issue together with "no_path_retry queue" is a very bad mix for ovirt.
You can fix this timeout by setting:
# /etc/iscsi/iscsid.conf
node.session.timeo.replacement_timeout = 5
And restarting iscsid service.
With these tweaks, the issue may be resolved.
I hope it helps.
Nir
>
> >> # rpm -qa | grep -i vdsm
> >> might help too.
>
> vdsm-cli-4.16.14-0.el7.noarch
> vdsm-reg-4.16.14-0.el7.noarch
> ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch
> vdsm-python-zombiereaper-4.16.14-0.el7.noarch
> vdsm-xmlrpc-4.16.14-0.el7.noarch
> vdsm-yajsonrpc-4.16.14-0.el7.noarch
> vdsm-4.16.14-0.el7.x86_64
> vdsm-gluster-4.16.14-0.el7.noarch
> vdsm-hook-ethtool-options-4.16.14-0.el7.noarch
> vdsm-python-4.16.14-0.el7.noarch
> vdsm-jsonrpc-4.16.14-0.el7.noarch
>
> >
> > Hey Chris,
> >
> > please open a bug [1] for this, then we can track it and we can help to
> > identify the issue.
>
> I will do so.
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
More information about the Users
mailing list