[ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Douglas Schilling Landgraf
dougsland at redhat.com
Wed May 20 20:31:49 EDT 2015
On 05/20/2015 07:10 PM, Nir Soffer wrote:
> ----- Original Message -----
>> From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones at bookit.com>
>> To: users at ovirt.org
>> Sent: Thursday, May 21, 2015 12:49:50 AM
>> Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
>>
>>>> vdsm.log in the node side, will help here too.
>>
>> https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log
>> contains only the messages at and after when a host was become
>> unresponsive due to storage issues.
>
> According to the log, you have a real issue accessing storage from the host:
>
> [nsoffer at thin untitled (master)]$ repostat vdsm.log
> domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2
> delay avg: 0.000856 min: 0.000000 max: 0.001168
> last check avg: 11.510000 min: 0.300000 max: 64.100000
> domain: 64101f40-0f10-471d-9f5f-44591f9e087d
> delay avg: 0.008358 min: 0.000000 max: 0.040269
> last check avg: 11.863333 min: 0.300000 max: 63.400000
> domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0
> delay avg: 0.007793 min: 0.000819 max: 0.041316
> last check avg: 11.466667 min: 0.000000 max: 70.200000
> domain: 842edf83-22c6-46cd-acaa-a1f76d61e545
> delay avg: 0.000493 min: 0.000374 max: 0.000698
> last check avg: 4.860000 min: 0.200000 max: 9.900000
> domain: b050c455-5ab1-4107-b055-bfcc811195fc
> delay avg: 0.002080 min: 0.000000 max: 0.040142
> last check avg: 11.830000 min: 0.000000 max: 63.700000
> domain: c46adffc-614a-4fa2-9d2d-954f174f6a39
> delay avg: 0.004798 min: 0.000000 max: 0.041006
> last check avg: 18.423333 min: 1.400000 max: 102.900000
> domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7
> delay avg: 0.001002 min: 0.000000 max: 0.001199
> last check avg: 11.560000 min: 0.300000 max: 61.700000
> domain: 20153412-f77a-4944-b252-ff06a78a1d64
> delay avg: 0.003748 min: 0.000000 max: 0.040903
> last check avg: 12.180000 min: 0.000000 max: 67.200000
> domain: 26929b89-d1ca-4718-90d6-b3a6da585451
> delay avg: 0.000963 min: 0.000000 max: 0.001209
> last check avg: 10.993333 min: 0.000000 max: 64.300000
> domain: 0137183b-ea40-49b1-b617-256f47367280
> delay avg: 0.000881 min: 0.000000 max: 0.001227
> last check avg: 11.086667 min: 0.100000 max: 63.200000
>
> Note the high last check maximum value (e.g. 102 seconds).
>
> Vdsm has a monitor thread for each domain, doing a read from one of the storage
> domain special disk every 10 seconds. When we see high last check value, it
> means that the monitor thread is stuck reading from the disk.
>
> This is an indicator that vms may have trouble accessing this storage domains,
> and engine is handling this by making the host non-operational, or if all hosts
> cannot access the domain, making the domain inactive.
>
> One of the known issues that can be related, is bad multipath configuration. Some
> storage server have bad builtin configuration embedded into multipath. In particular,
> using "no_path_retry queue", or "no_path_retry 60". This setting means that when
> the SCSI layer fails, and multipath does not have any active path it will queue
> io foerver (queue), or retry many times (e.g, 60) before failing the io request.
>
> This can lead to stuck process, doing a read or write that never fails or takes
> many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread
> may block other unrelated threads.
>
> Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe
> it does not match the product (Compellent Vol).
> See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57
>
> device {
> vendor "COMPELNT"
> product "Compellent Vol"
> no_path_retry fail
> }
>
> Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
> the setting we ship is missing lot of settings that exists in the builtin
> setting, and this may have bad effect. If your devices match this , I would
> try this multipath configuration, instead of the one vdsm configures.
>
> device {
> vendor "COMPELNT"
> product "Compellent Vol"
> path_grouping_policy "multibus"
> path_checker "tur"
> features "0"
> hardware_handler "0"
> prio "const"
> failback "immediate"
> rr_weight "uniform"
> no_path_retry fail
> }
>
> To verify that your devices match this, you can check the devices vendor and procut
> strings in the output of "multipath -ll". I would like to see the output of this
> command.
>
> Another platform issue is bad default SCSI node.session.timeo.replacement_timeout
> value, which is set to 120 seconds. This setting mean that the SCSI layer will
> wait 120 seconds for io to complete on one path, before failing the io request.
> So you may have one bad path, causing 120 second delay, while you could complete
> the request using another path.
>
> Multipath is trying to set this value to 5 seconds, but this value is reverting
> to the default 120 seconds after a device has trouble. There is an open bug about
> this which we hope to get fixed in the rhel/centos 7.2.
> https://bugzilla.redhat.com/1139038
>
> This issue together with "no_path_retry queue" is a very bad mix for ovirt.
>
> You can fix this timeout by setting:
>
> # /etc/iscsi/iscsid.conf
> node.session.timeo.replacement_timeout = 5
>
> And restarting iscsid service.
Chris, as you are using ovirt-node, after Nir suggestions please also
execute the below command too to save the settings changes across reboots:
# persist /etc/iscsi/iscsid.conf
>
> With these tweaks, the issue may be resolved.
>
>
> I hope it helps.
>
> Nir
>
>>
>>>> # rpm -qa | grep -i vdsm
>>>> might help too.
>>
>> vdsm-cli-4.16.14-0.el7.noarch
>> vdsm-reg-4.16.14-0.el7.noarch
>> ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch
>> vdsm-python-zombiereaper-4.16.14-0.el7.noarch
>> vdsm-xmlrpc-4.16.14-0.el7.noarch
>> vdsm-yajsonrpc-4.16.14-0.el7.noarch
>> vdsm-4.16.14-0.el7.x86_64
>> vdsm-gluster-4.16.14-0.el7.noarch
>> vdsm-hook-ethtool-options-4.16.14-0.el7.noarch
>> vdsm-python-4.16.14-0.el7.noarch
>> vdsm-jsonrpc-4.16.14-0.el7.noarch
>>
>>>
>>> Hey Chris,
>>>
>>> please open a bug [1] for this, then we can track it and we can help to
>>> identify the issue.
>>
>> I will do so.
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
--
Cheers
Douglas
More information about the Users
mailing list