Another issue may be that the setting for COMPELNT/Compellent Vol are
wrong;
the setting we ship is missing lot of settings that exists in the builtin
setting, and this may have bad effect. If your devices match this , I would
try this multipath configuration, instead of the one vdsm configures.
device {
vendor "COMPELNT"
product "Compellent Vol"
path_grouping_policy "multibus"
path_checker "tur"
features "0"
hardware_handler "0"
prio "const"
failback "immediate"
rr_weight "uniform"
no_path_retry fail
}
I wish I could. We're using the CentOS 7 ovirt-node-iso. The
multipath.conf is less than ideal but when I tried updating it, oVirt
instantly overwrites it. To be clear, yes I know changes do not survive
reboots and yes I know about persist, but it changes it while running.
Live! Persist won't help there.
I also tried building a CentOS 7 "thick client" where I set up CentOS 7
first, added the oVirt repo, then let the engine provision it. Same
problem with multipath.conf being overwritten with the default oVirt setup.
So I tried to be slick about it. I made the multipath.conf immutable.
That prevented the engine from being able to activate the node. It would
fail on a vds command that gets the nodes capabilities and part of what
it does is reads then overwrites multipath.conf.
How do I safely update multipath.conf?
To verify that your devices match this, you can check the devices vendor and procut
strings in the output of "multipath -ll". I would like to see the output of
this
command.
multipath -ll (default setup) can be seen here.
http://paste.linux-help.org/view/430c7538
Another platform issue is bad default SCSI
node.session.timeo.replacement_timeout
value, which is set to 120 seconds. This setting mean that the SCSI layer will
wait 120 seconds for io to complete on one path, before failing the io request.
So you may have one bad path, causing 120 second delay, while you could complete
the request using another path.
Multipath is trying to set this value to 5 seconds, but this value is reverting
to the default 120 seconds after a device has trouble. There is an open bug about
this which we hope to get fixed in the rhel/centos 7.2.
https://bugzilla.redhat.com/1139038
This issue together with "no_path_retry queue" is a very bad mix for ovirt.
You can fix this timeout by setting:
# /etc/iscsi/iscsid.conf
node.session.timeo.replacement_timeout = 5
I'll see if that's possible with persist. Will this change survive node
upgrades?
Thanks for the reply and the suggestions.