<div dir="ltr"><div dir="auto"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Feb 1, 2017 at 8:39 AM, Nir Soffer <span dir="ltr"><<a href="mailto:nsoffer@redhat.com" target="_blank">nsoffer@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-HOEnZb"><div class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-h5"><br>
<br>
</div></div>Hi Gianluca,<br>
<br>
This should be a number, not a string, maybe multipath is having trouble<br>
parsing this and it ignores your value?<br></blockquote><div><br></div><div>I don't think so. Also because reading dm multipath guide at<br><a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/DM_Multipath/multipath_config_confirm.html" target="_blank">https://access.redhat.com/docu<wbr>mentation/en-US/Red_Hat_Enterp<wbr>rise_Linux/7/html/DM_Multipath<wbr>/multipath_config_confirm.html</a><br></div><div>It seems that in RH EL 7.3 the "show config" command has this behaviour:<br><br>"<br>For example, the following command sequence displays the multipath configuration, including the defaults, before exiting the console.<br><br># multipathd -k<br>> > show config<br>> > CTRL-D<br>"<br><br></div><div>So the output has to include the default too. Anyway I changed it, see below<br><br></div><div><<<<< BEGIN OF PARENTHESIS <br></div><div>In theory it should be the same on RH EL 6.8 <br></div><div>(see <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/DM_Multipath/multipath_config_confirm.html" target="_blank">https://access.redhat.com/docu<wbr>mentation/en-US/Red_Hat_Enterp<wbr>rise_Linux/6/html/DM_Multipath<wbr>/multipath_config_confirm.html</a> )<br></div><div>but it is not so for me on a system that is on 6.5, with device-mapper-multipath-0.4.9-<wbr>93.el6.x86_64 and connected to Netapp<br><br>In /usr/share/doc/device-mapper-m<wbr>ultipath-0.4.9/multipath.conf.<wbr>defaults<br><br><br># vendor "NETAPP"<br># product "LUN.*"<br># path_grouping_policy group_by_prio<br># getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"<br># path_selector "round-robin 0"<br># path_checker tur<br># features "3 queue_if_no_path pg_init_retries 50"<br># hardware_handler "0"<br># prio ontap<br># failback immediate<br># rr_weight uniform<br># rr_min_io 128<br># rr_min_io_rq 1<br># flush_on_last_del yes<br># fast_io_fail_tmo 5<br># dev_loss_tmo infinity<br># retain_attached_hw_handler yes<br># detect_prio yes<br># reload_readwrite yes<br># }<br><br><br><br>My customization in multipath.conf, based on Netapp guidelines and my Netapp storage array setup:<br><br>devices {<br> device {<br> vendor "NETAPP"<br> product "LUN.*"<br> getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"<br> hardware_handler "1 alua"<br> prio alua<br> }<br><br></div><div>If I run "multipathd show config" I see only 1 entry for NETAPP/LUN vendor/product and it is a merge of default and my custom.<br></div><div><br> device {<br> vendor "NETAPP"<br> product "LUN.*"<br> path_grouping_policy group_by_prio<br> getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"<br> path_selector "round-robin 0"<br> path_checker tur<br> features "3 queue_if_no_path pg_init_retries 50"<br> hardware_handler "1 alua"<br> prio alua<br> failback immediate<br> rr_weight uniform<br> rr_min_io 128<br> rr_min_io_rq 1<br> flush_on_last_del yes<br> fast_io_fail_tmo 5<br> dev_loss_tmo infinity<br> retain_attached_hw_handler yes<br> detect_prio yes<br> reload_readwrite yes<br> }<br><br></div><div>So this difference confused me when configuring multipath in CentOS 7.3. I have to see when I'm going to update from 6.5 to 6.8 if this changes.<br></div><div><br><<<<< END OF PARENTHESIS <br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>
> }<br>
> }<br>
><br>
> So I put exactly the default device config for my IBM/1814 device but<br>
> no_path_retry set to 12.<br>
<br>
</span>Why 12?<br>
<br>
This will do 12 retries, 5 seconds each when no path is available. This will<br>
block lvm commands for 60 seconds when no path is available, blocking<br>
other stuff in vdsm. Vdsm is not designed to handle this.<br>
<br>
I recommend value of 4.<br></blockquote><div><br></div><div>OK.<br></div><br><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
But note that this will is not related to the fact that your devices are not<br>
initialize properly after boot.<br></blockquote><div><br></div><div>In fact it could be also a ds4700 overall problem.... The LUNs are configured as LNX CLUSTER type, that should be ok in theory, even if this kind of storage was never so supported with Linux.<br></div><div>Initially one had to use proprietary IBM kernel modules/drivers.<br></div><div>I will see consistency and robustness through testing. <br></div><div>I have to do a POC and this is the hw I have and I should at least try to have a working solution for it.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>
> In CentOS 6.x when you do something like this, "show config" gives you the<br>
> modified entry only for your device section.<br>
> Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and<br>
> also the customized one at the end of the output....<br>
<br>
</span>Maybe your device configuration does not match exactly the builtin config.<br></blockquote><div><br></div><div>I think it is the different behaviour as outlined above. I think you can confirm in another system where some customization has been done too...<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>
<br>
</span>Maybe waiting a moment helps the storage/switches to clean up<br>
properly after a server is shut down?<br></blockquote><div><br></div><div>I think so too. Eventually when possible, if errors repeat with the new config, I'll manage to do stop/start instead of restart<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Does your power management trigger a proper shutdown?<br>
I would avoid using it for normal shutdown.<br></blockquote><div><br></div><div>I have not understood what you mean exactly here... Can you elaborate?<br></div><div>Suppose I have to power off one hypervisor (yum update, pathing, fw update or planned server room maintenance, ...), my workflow is this one all from inside web admin gui:<br><br></div><div>Move running VMs in charge of the host (or delegate to the following step)<br></div><div>Put host into maintenance<br></div><div>Power Mgmt --> Stop<br><br></div><div>When planned maintenance ha finished<br><br></div><div>Power Mgmt --> Start<br></div><div>I should see the host in maintenance<br></div><div dir="auto">Activate</div><div dir="auto"><br></div><div dir="auto">Or do you mean I should do anything from the host itself and not the GUI?</div><div><br></div><div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br><div><div class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-h5">
><br>
> multipath 0 1 rdac<br>
> vs<br>
> multipath 1 queue_if_no_path 1 rdac<br>
<br>
</div></div>This is not expected, multipath is using unlimited queueing, which is the worst<br>
setup for ovirt.<br>
<br>
Maybe this is the result of using "12" instead of 12?<br>
<br>
Anyway, looking in multipath source, this is the default configuration for<br>
your device:<br>
<br>
405 /* DS3950 / DS4200 / DS4700 / DS5020 */<br>
406 .vendor = "IBM",<br>
407 .product = "^1814",<br>
408 .bl_product = "Universal Xport",<br>
409 .pgpolicy = GROUP_BY_PRIO,<br>
410 .checker_name = RDAC,<br>
411 .features = "2 pg_init_retries 50",<br>
412 .hwhandler = "1 rdac",<br>
413 .prio_name = PRIO_RDAC,<br>
414 .pgfailback = -FAILBACK_IMMEDIATE,<br>
415 .no_path_retry = 30,<br>
416 },<br>
<br>
and this is the commit that updated this (and other rdac devices):<br>
<a href="http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commit;h=c1ed393b91acace284901f16954ba5c1c0d943c9" rel="noreferrer" target="_blank">http://git.opensvc.com/gitweb.<wbr>cgi?p=multipath-tools/.git;a=c<wbr>ommit;h=c1ed393b91acace284901f<wbr>16954ba5c1c0d943c9</a><br>
<br>
So I would try this configuration:<br>
<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>
device {<br>
vendor "IBM"<br>
product "^1814"<br>
<br>
</span> # defaults from multipathd show config<br>
<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"> product_blacklist "Universal Xport"<br>
path_grouping_policy "group_by_prio"<br>
path_checker "rdac"<br>
</span><span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"> hardware_handler "1 rdac"<br>
prio "rdac"<br>
failback immediate<br>
rr_weight "uniform"<br>
<br>
</span> # Based on multipath commit<br>
c1ed393b91acace284901f16954ba5<wbr>c1c0d943c9<br>
features "2 pg_init_retries 50"<br>
<br>
# Default is 30 seconds, ovirt recommended value is 4 to avoid<br>
# blocking in vdsm. This gives 20 seconds (4 * polling_interval)<br>
# gracetime when no path is available.<br>
no_path_retry 4<br>
}<br>
<br>
Ben, do you have any other ideas on debugging this issue and<br>
improving multipath configuration?<br>
<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-HOEnZb"><font color="#888888"><br>
Nir<br>
</font></span></blockquote></div><br></div><div class="gmail_extra">OK. In the mean time I have applied your suggested config and restarted the 2 nodes.<br></div><div class="gmail_extra">Let we test and see if I find any problems running also some I/O tests.<br></div><div class="gmail_extra">Thanks in the mean time,<br></div><div class="gmail_extra">Gianluca<br></div></div></div>
</div>