<div dir="ltr"><div dir="auto"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Feb 1, 2017 at 8:39 AM, Nir Soffer <span dir="ltr">&lt;<a href="mailto:nsoffer@redhat.com" target="_blank">nsoffer@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-HOEnZb"><div class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-h5"><br>

<br>

</div></div>Hi Gianluca,<br>

<br>

This should be a number, not a string, maybe multipath is having trouble<br>

parsing this and it ignores your value?<br></blockquote><div><br></div><div>I don&#39;t think so. Also because reading dm multipath guide at<br><a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/DM_Multipath/multipath_config_confirm.html" target="_blank">https://access.redhat.com/docu<wbr>mentation/en-US/Red_Hat_Enterp<wbr>rise_Linux/7/html/DM_Multipath<wbr>/multipath_config_confirm.html</a><br></div><div>It seems that in RH EL 7.3 the &quot;show config&quot; command has this behaviour:<br><br>&quot;<br>For example, the following command sequence displays the multipath configuration, including the defaults, before exiting the console.<br><br># multipathd -k<br>&gt; &gt; show config<br>&gt; &gt; CTRL-D<br>&quot;<br><br></div><div>So the output has to include the default too. Anyway I changed it, see below<br><br></div><div>&lt;&lt;&lt;&lt;&lt; BEGIN OF PARENTHESIS <br></div><div>In theory it should be the same on RH EL 6.8 <br></div><div>(see <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/DM_Multipath/multipath_config_confirm.html" target="_blank">https://access.redhat.com/docu<wbr>mentation/en-US/Red_Hat_Enterp<wbr>rise_Linux/6/html/DM_Multipath<wbr>/multipath_config_confirm.html</a> )<br></div><div>but it is not so for me on a system that is on 6.5, with device-mapper-multipath-0.4.9-<wbr>93.el6.x86_64 and connected to Netapp<br><br>In /usr/share/doc/device-mapper-m<wbr>ultipath-0.4.9/multipath.conf.<wbr>defaults<br><br><br>#               vendor &quot;NETAPP&quot;<br>#               product &quot;LUN.*&quot;<br>#               path_grouping_policy group_by_prio<br>#               getuid_callout &quot;/lib/udev/scsi_id --whitelisted --device=/dev/%n&quot;<br>#               path_selector &quot;round-robin 0&quot;<br>#               path_checker tur<br>#               features &quot;3 queue_if_no_path pg_init_retries 50&quot;<br>#               hardware_handler &quot;0&quot;<br>#               prio ontap<br>#               failback immediate<br>#               rr_weight uniform<br>#               rr_min_io 128<br>#               rr_min_io_rq 1<br>#               flush_on_last_del yes<br>#               fast_io_fail_tmo 5<br>#               dev_loss_tmo infinity<br>#               retain_attached_hw_handler yes<br>#               detect_prio yes<br>#               reload_readwrite yes<br>#       }<br><br><br><br>My customization in multipath.conf, based on Netapp guidelines and my Netapp storage array setup:<br><br>devices {<br>       device {<br>               vendor &quot;NETAPP&quot;<br>               product &quot;LUN.*&quot;<br>               getuid_callout &quot;/lib/udev/scsi_id -g -u -d /dev/%n&quot;<br>               hardware_handler &quot;1 alua&quot;<br>               prio alua<br>        }<br><br></div><div>If I run &quot;multipathd show config&quot; I see only 1 entry for NETAPP/LUN vendor/product and it is a merge of default and my custom.<br></div><div><br>        device {<br>                vendor &quot;NETAPP&quot;<br>                product &quot;LUN.*&quot;<br>                path_grouping_policy group_by_prio<br>                getuid_callout &quot;/lib/udev/scsi_id -g -u -d /dev/%n&quot;<br>                path_selector &quot;round-robin 0&quot;<br>                path_checker tur<br>                features &quot;3 queue_if_no_path pg_init_retries 50&quot;<br>                hardware_handler &quot;1 alua&quot;<br>                prio alua<br>                failback immediate<br>                rr_weight uniform<br>                rr_min_io 128<br>                rr_min_io_rq 1<br>                flush_on_last_del yes<br>                fast_io_fail_tmo 5<br>                dev_loss_tmo infinity<br>                retain_attached_hw_handler yes<br>                detect_prio yes<br>                reload_readwrite yes<br>        }<br><br></div><div>So this difference confused me when configuring multipath in CentOS 7.3. I have to see when I&#39;m going to update from 6.5 to 6.8 if this changes.<br></div><div><br>&lt;&lt;&lt;&lt;&lt; END OF PARENTHESIS <br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>

&gt;         }<br>

&gt; }<br>

&gt;<br>

&gt; So I put exactly the default device config for my IBM/1814 device but<br>

&gt; no_path_retry set to 12.<br>

<br>

</span>Why 12?<br>

<br>

This will do 12 retries, 5 seconds each when no path is available. This will<br>

block lvm commands for 60 seconds when no path is available, blocking<br>

other stuff in vdsm. Vdsm is not designed to handle this.<br>

<br>

I recommend value of 4.<br></blockquote><div><br></div><div>OK.<br></div><br><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

But note that this will is not related to the fact that your devices are not<br>

initialize properly after boot.<br></blockquote><div><br></div><div>In fact it could be also a ds4700 overall problem.... The LUNs are configured as LNX CLUSTER type, that should be ok in theory, even if this kind of storage was never so supported with Linux.<br></div><div>Initially one had to use proprietary IBM kernel modules/drivers.<br></div><div>I will see consistency and robustness through testing. <br></div><div>I have to do a POC and this is the hw I have and I should at least try to have a working solution for it.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>

&gt; In CentOS 6.x when you do something like this, &quot;show config&quot; gives you the<br>

&gt; modified entry only for your device section.<br>

&gt; Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and<br>

&gt; also the customized one at the end of the output....<br>

<br>

</span>Maybe your device configuration does not match exactly the builtin config.<br></blockquote><div><br></div><div>I think it is the different behaviour as outlined above. I think you can confirm in another system where some customization has been done too...<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>

<br>

</span>Maybe waiting a moment helps the storage/switches to clean up<br>

properly after a server is shut down?<br></blockquote><div><br></div><div>I think so too. Eventually when possible, if errors repeat with the new config, I&#39;ll manage to do stop/start instead of restart<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Does your power management trigger a proper shutdown?<br>

I would avoid using it for normal shutdown.<br></blockquote><div><br></div><div>I have not understood what you mean exactly here... Can you elaborate?<br></div><div>Suppose I have to power off one hypervisor (yum update, pathing, fw update or planned server room maintenance, ...), my workflow is this one all from inside web admin gui:<br><br></div><div>Move running VMs in charge of the host (or delegate to the following step)<br></div><div>Put host into maintenance<br></div><div>Power Mgmt --&gt; Stop<br><br></div><div>When planned maintenance ha finished<br><br></div><div>Power Mgmt --&gt; Start<br></div><div>I should see the host in maintenance<br></div><div dir="auto">Activate</div><div dir="auto"><br></div><div dir="auto">Or do you mean I should do anything from the host itself and not the GUI?</div><div><br></div><div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br><div><div class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-h5">

&gt;<br>

&gt; multipath 0 1 rdac<br>

&gt; vs<br>

&gt; multipath 1 queue_if_no_path 1 rdac<br>

<br>

</div></div>This is not expected, multipath is using unlimited queueing, which is the worst<br>

setup for ovirt.<br>

<br>

Maybe this is the result of using &quot;12&quot; instead of 12?<br>

<br>

Anyway, looking in multipath source, this is the default configuration for<br>

your device:<br>

<br>

405         /* DS3950 / DS4200 / DS4700 / DS5020 */<br>

 406         .vendor        = &quot;IBM&quot;,<br>

 407         .product       = &quot;^1814&quot;,<br>

 408         .bl_product    = &quot;Universal Xport&quot;,<br>

 409         .pgpolicy      = GROUP_BY_PRIO,<br>

 410         .checker_name  = RDAC,<br>

 411         .features      = &quot;2 pg_init_retries 50&quot;,<br>

 412         .hwhandler     = &quot;1 rdac&quot;,<br>

 413         .prio_name     = PRIO_RDAC,<br>

 414         .pgfailback    = -FAILBACK_IMMEDIATE,<br>

 415         .no_path_retry = 30,<br>

 416     },<br>

<br>

and this is the commit that updated this (and other rdac devices):<br>

<a href="http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commit;h=c1ed393b91acace284901f16954ba5c1c0d943c9" rel="noreferrer" target="_blank">http://git.opensvc.com/gitweb.<wbr>cgi?p=multipath-tools/.git;a=c<wbr>ommit;h=c1ed393b91acace284901f<wbr>16954ba5c1c0d943c9</a><br>

<br>

So I would try this configuration:<br>

<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-"><br>

device {<br>

                vendor &quot;IBM&quot;<br>

                product &quot;^1814&quot;<br>

<br>

</span>                # defaults from multipathd show config<br>

<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-">                product_blacklist &quot;Universal Xport&quot;<br>

                path_grouping_policy &quot;group_by_prio&quot;<br>

                path_checker &quot;rdac&quot;<br>

</span><span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-">                hardware_handler &quot;1 rdac&quot;<br>

                prio &quot;rdac&quot;<br>

                failback immediate<br>

                rr_weight &quot;uniform&quot;<br>

<br>

</span>                # Based on multipath commit<br>

c1ed393b91acace284901f16954ba5<wbr>c1c0d943c9<br>

                features &quot;2 pg_init_retries 50&quot;<br>

<br>

                # Default is 30 seconds, ovirt recommended value is 4 to avoid<br>

                # blocking in vdsm. This gives 20 seconds (4 * polling_interval)<br>

                # gracetime when no path is available.<br>

                no_path_retry 4<br>

        }<br>

<br>

Ben, do you have any other ideas on debugging this issue and<br>

improving multipath configuration?<br>

<span class="m_-1949629816991207438m_-2008036110790628506m_-1524918528077666794gmail-HOEnZb"><font color="#888888"><br>

Nir<br>

</font></span></blockquote></div><br></div><div class="gmail_extra">OK. In the mean time I have applied your suggested config and restarted the 2 nodes.<br></div><div class="gmail_extra">Let we test and see if I find any problems running also some I/O tests.<br></div><div class="gmail_extra">Thanks in the mean time,<br></div><div class="gmail_extra">Gianluca<br></div></div></div>

</div>