<div dir="ltr">Hello,<div>my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05 and ovmsrv06) and they are connected in a SAN with FC-switches to an old IBM DS4700 storage array.</div><div>Apart from being old, they seem all ok from an hw point of view.</div><div>I have configured oVirt 4.0.6 and an FCP storage domain.</div><div>The hosts are plain CentOS 7.3 servers fully updated.</div><div>It is not an hosted engine environment: the manager is a vm outside of the cluster.</div><div>I have configured power mgmt on both and it works good.</div><div><br></div><div>I have at the moment  only one VM for test and it is doing quite nothing.<br></div><div><br></div><div>Starting point: ovmsrv05 is in maintenance (since about 2 days) and the VM is running on ovmsrv06.</div><div>I update qemu-kvm package on ovmsrv05 and then I restart it from web admin gui:</div><div>Power Mgmt --&gt; Restart</div><div><br></div><div>Sequence of events in pane and the problem in subject:</div><div><div>Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified successfully.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:29:38 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by admin@internal-authz.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O problem.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:27:55 AM VM ol65 has been paused.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by admin@internal-authz.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by admin@internal-authz.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 succeeded.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:50 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:37 AM Executing power management start on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 initiated.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:37 AM All VMs&#39; status on Non Responsive Host ovmsrv05 were changed to &#39;Down&#39; by admin@internal-authz</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by admin@internal-authz.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeeded.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:34 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:15 AM Executing power management stop on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiated.</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span></div><div>Jan 31, 2017 10:25:12 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div></div><div><br></div><div>Watching the timestamps, the culprit seems the reboot time of ovmsrv05 that detects some LUNs in owned state and other ones in unowned</div><div>Full messages of both hosts here:</div><div><a href="https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing">https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing</a><br></div><div>and</div><div><a href="https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing">https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing</a><br></div><div><br></div><div>At this time there are 4 LUNs globally seen by the two hosts but only 1 of them is currently configured as the only storage domain in oVirt cluster.</div><div><br></div><div><div>[root@ovmsrv05 ~]# multipath -l | grep ^36</div><div>3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814      FAStT </div><div>3600a0b80002999020000cd3c5501458f dm-3 IBM     ,1814      FAStT </div><div>3600a0b80002999020000ccf855011198 dm-2 IBM     ,1814      FAStT </div><div>3600a0b8000299aa80000d08955014098 dm-4 IBM     ,1814      FAStT </div></div><div><br></div><div>the configured one:</div><div><div>[root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa80000d08b55014119</div><div>3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814      FAStT </div><div>size=4.0T features=&#39;0&#39; hwhandler=&#39;1 rdac&#39; wp=rw</div><div>|-+- policy=&#39;service-time 0&#39; prio=0 status=active</div><div>| |- 0:0:1:3 sdl 8:176 active undef running</div><div>| `- 2:0:1:3 sdp 8:240 active undef running</div><div>`-+- policy=&#39;service-time 0&#39; prio=0 status=enabled</div><div>  |- 0:0:0:3 sdd 8:48  active undef running</div><div>  `- 2:0:0:3 sdi 8:128 active undef running</div></div><div><br></div><div>In mesages of booting node, arounf the problem registered by the storage:</div><div><div>[root@ovmsrv05 ~]# grep owned /var/log/messages</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:3: rdac: LUN 3 (RDAC) (owned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:2: rdac: LUN 2 (RDAC) (unowned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:4: rdac: LUN 4 (RDAC) (owned)</div><div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:3: rdac: LUN 3 (RDAC) (owned)</div><div>Jan 31 10:27:39 ovmsrv05 kernel: scsi 2:0:1:4: rdac: LUN 4 (RDAC) (owned)</div></div><div><br></div><div>I don&#39;t know exactly the meaning of owned/unowned in the output above..</div><div>Possibly it detects the 0:0:1:3 and 2:0:1:3 paths (those of the active group) as &quot;owned&quot; and this could have created problems with the active node?</div><div><br></div><div>On active node strangely I don&#39;t loose all the paths, but the VM has been paused anyway</div><div><br></div><div><div>[root@ovmsrv06 log]# grep &quot;remaining active path&quot; /var/log/messages </div><div>Jan 31 10:27:48 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 3</div><div>Jan 31 10:27:49 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 2</div><div>Jan 31 10:27:56 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 3</div><div>Jan 31 10:27:56 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 2</div><div>Jan 31 10:27:56 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 1</div><div>Jan 31 10:27:57 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 2</div><div>Jan 31 10:28:01 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 3</div><div>Jan 31 10:28:01 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 4</div></div><div><br></div><div>I&#39;m not an expert of this storage array in particular, and of the rdac hardware handler in general.</div><div><br></div><div>What I see is that multipath.conf on both nodes:</div><div><br></div><div><div># VDSM REVISION 1.3</div><div><br></div><div>defaults {</div><div>    polling_interval            5</div><div>    no_path_retry               fail</div><div>    user_friendly_names         no</div><div>    flush_on_last_del           yes</div><div>    fast_io_fail_tmo            5</div><div>    dev_loss_tmo                30</div><div>    max_fds                     4096</div><div>}</div><div><br></div><div><br></div><div>devices {</div><div>    device {</div><div>        # These settings overrides built-in devices settings. It does not apply</div><div>        # to devices without built-in settings (these use the settings in the</div><div>        # &quot;defaults&quot; section), or to devices defined in the &quot;devices&quot; section.</div><div>        # Note: This is not available yet on Fedora 21. For more info see</div><div>        # <a href="https://bugzilla.redhat.com/1253799">https://bugzilla.redhat.com/1253799</a></div><div>        all_devs                yes</div><div>        no_path_retry           fail</div><div>    }</div><div>}</div></div><div><br></div><div><br></div><div><div>beginning of /proc/scsi/scsi</div><div><br></div><div>[root@ovmsrv06 ~]# cat /proc/scsi/scsi </div><div>Attached devices:</div><div>Host: scsi1 Channel: 01 Id: 00 Lun: 00</div><div>  Vendor: HP       Model: LOGICAL VOLUME   Rev: 1.86</div><div>  Type:   Direct-Access                    ANSI  SCSI revision: 05</div><div>Host: scsi0 Channel: 00 Id: 00 Lun: 01</div><div>  Vendor: IBM      Model: 1814      FAStT  Rev: 0916</div><div>  Type:   Direct-Access                    ANSI  SCSI revision: 05</div></div><div>...</div><div><br></div><div>To get default acquired config for this storage:</div><div><div><br></div><div>multpathd -k</div><div>&gt; show config</div><div><br></div><div>I can see:</div><div><br></div><div>        device {</div><div>                vendor &quot;IBM&quot;</div><div>                product &quot;^1814&quot;</div><div>                product_blacklist &quot;Universal Xport&quot;</div><div>                path_grouping_policy &quot;group_by_prio&quot;</div><div>                path_checker &quot;rdac&quot;</div><div>                features &quot;0&quot;</div><div>                hardware_handler &quot;1 rdac&quot;</div><div>                prio &quot;rdac&quot;</div><div>                failback immediate</div><div>                rr_weight &quot;uniform&quot;</div><div>                no_path_retry &quot;fail&quot;</div><div>        }</div><div><br></div><div><br></div><div>and</div><div><br></div><div>defaults {</div><div>        verbosity 2</div><div>        polling_interval 5</div><div>        max_polling_interval 20</div><div>        reassign_maps &quot;yes&quot;</div><div>        multipath_dir &quot;/lib64/multipath&quot;</div><div>        path_selector &quot;service-time 0&quot;</div><div>        path_grouping_policy &quot;failover&quot;</div><div>        uid_attribute &quot;ID_SERIAL&quot;</div><div>        prio &quot;const&quot;</div><div>        prio_args &quot;&quot;</div><div>        features &quot;0&quot;</div><div>        path_checker &quot;directio&quot;</div><div>        alias_prefix &quot;mpath&quot;</div><div>        failback &quot;manual&quot;</div><div>        rr_min_io 1000</div><div>        rr_min_io_rq 1</div><div>        max_fds 4096</div><div>        rr_weight &quot;uniform&quot;</div><div>        no_path_retry &quot;fail&quot;</div><div>        queue_without_daemon &quot;no&quot;</div><div>        flush_on_last_del &quot;yes&quot;</div><div>        user_friendly_names &quot;no&quot;</div><div>        fast_io_fail_tmo 5</div><div>        dev_loss_tmo 30</div><div>        bindings_file &quot;/etc/multipath/bindings&quot;</div><div>        wwids_file /etc/multipath/wwids</div><div>        log_checker_err always</div><div>        find_multipaths no</div><div>        retain_attached_hw_handler no</div><div>        detect_prio no</div><div>        hw_str_match no</div><div>        force_sync no</div><div>        deferred_remove no</div><div>        ignore_new_boot_devs no</div><div>        skip_kpartx no</div><div>        config_dir &quot;/etc/multipath/conf.d&quot;</div><div>        delay_watch_checks no</div><div>        delay_wait_checks no</div><div>        retrigger_tries 3</div><div>        retrigger_delay 10</div><div>        missing_uev_wait_timeout 30</div><div>        new_bindings_in_boot no</div><div>}</div><div><br></div></div><div>Any hint on how to tune multipath.conf so that a powering on server doesn&#39;t create problems to running VMs?</div><div><br></div><div>Thanks in advance,</div><div>Gianluca</div></div>