Re: [ovirt-users] VM has been paused due to storage I/O problem

1 Feb 2017

      This is a multi-part message in MIME format.
--------------06F287541BC9D8DB37199903
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit

I'm also seeing this error using a Dell MD3800i array.  The multipath
errors shown in our logs are different however.

Feb  1 15:11:58 ovirt-node-production2 kernel: dd: sending ioctl
80306d02 to a partition!
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map
(uevent)
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: devmap not
registered, can't remove
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map
(uevent)

The dd error seems to happen every time that SPM runs a test.

On 01/31/2017 09:23 AM, Nathanaël Blanchet wrote:
...
exactly the same issue by there with FC EMC domain storage...
Le 31/01/2017 à 15:20, Gianluca Cecchi a écrit :
...
Hello,
my test environment is composed by 2 old HP blades BL685c G1
(ovmsrv05 and ovmsrv06) and they are connected in a SAN with
FC-switches to an old IBM DS4700 storage array.
Apart from being old, they seem all ok from an hw point of view.
I have configured oVirt 4.0.6 and an FCP storage domain.
The hosts are plain CentOS 7.3 servers fully updated.
It is not an hosted engine environment: the manager is a vm outside
of the cluster.
I have configured power mgmt on both and it works good.
I have at the moment  only one VM for test and it is doing quite nothing.
Starting point: ovmsrv05 is in maintenance (since about 2 days) and
the VM is running on ovmsrv06.
I update qemu-kvm package on ovmsrv05 and then I restart it from web
admin gui:
Power Mgmt --> Restart
Sequence of events in pane and the problem in subject:
Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified
successfully.
Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.
Jan 31, 2017 10:29:38 AM Executing power management status on Host
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by
admin@internal-authz.
Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.
Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O
problem.
Jan 31, 2017 10:27:55 AM VM ol65 has been paused.
Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by
admin@internal-authz.
Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by
admin@internal-authz.
Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05
succeeded.
Jan 31, 2017 10:25:50 AM Executing power management status on Host
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:37 AM Executing power management start on Host
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05
initiated.
Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.
Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host
ovmsrv05 were changed to 'Down' by admin@internal-authz
Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by
admin@internal-authz.
Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05
succeeded.
Jan 31, 2017 10:25:34 AM Executing power management status on Host
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:15 AM Executing power management stop on Host
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05
initiated.
Jan 31, 2017 10:25:12 AM Executing power management status on Host
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Watching the timestamps, the culprit seems the reboot time of
ovmsrv05 that detects some LUNs in owned state and other ones in unowned
Full messages of both hosts here:
https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharin...
and
https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharin...
At this time there are 4 LUNs globally seen by the two hosts but only
1 of them is currently configured as the only storage domain in oVirt
cluster.
[root@ovmsrv05 ~]# multipath -l | grep ^36
3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814      FAStT 
3600a0b80002999020000cd3c5501458f dm-3 IBM     ,1814      FAStT 
3600a0b80002999020000ccf855011198 dm-2 IBM     ,1814      FAStT 
3600a0b8000299aa80000d08955014098 dm-4 IBM     ,1814      FAStT
the configured one:
[root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa80000d08b55014119
3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814      FAStT 
size=4.0T features='0' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdl 8:176 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdi 8:128 active undef running
In mesages of booting node, arounf the problem registered by the storage:
[root@ovmsrv05 ~]# grep owned /var/log/messages
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:3: rdac: LUN 3 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:2: rdac: LUN 2 (RDAC)
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:4: rdac: LUN 4 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:3: rdac: LUN 3 (RDAC) (owned)
Jan 31 10:27:39 ovmsrv05 kernel: scsi 2:0:1:4: rdac: LUN 4 (RDAC) (owned)
I don't know exactly the meaning of owned/unowned in the output above..
Possibly it detects the 0:0:1:3 and 2:0:1:3 paths (those of the
active group) as "owned" and this could have created problems with
the active node?
On active node strangely I don't loose all the paths, but the VM has
been paused anyway
[root@ovmsrv06 log]# grep "remaining active path" /var/log/messages 
Jan 31 10:27:48 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 3
Jan 31 10:27:49 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 2
Jan 31 10:27:56 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 3
Jan 31 10:27:56 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 2
Jan 31 10:27:56 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 1
Jan 31 10:27:57 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 2
Jan 31 10:28:01 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 3
Jan 31 10:28:01 ovmsrv06 multipathd:
3600a0b8000299aa80000d08b55014119: remaining active paths: 4
I'm not an expert of this storage array in particular, and of the
rdac hardware handler in general.
What I see is that multipath.conf on both nodes:
# VDSM REVISION 1.3
defaults {
    polling_interval            5
    no_path_retry               fail
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}
devices {
    device {
        # These settings overrides built-in devices settings. It does
not apply
        # to devices without built-in settings (these use the
settings in the
        # "defaults" section), or to devices defined in the "devices"
section.
        # Note: This is not available yet on Fedora 21. For more info see
        # https://bugzilla.redhat.com/1253799
        all_devs                yes
        no_path_retry           fail
    }
}
beginning of /proc/scsi/scsi
[root@ovmsrv06 ~]# cat /proc/scsi/scsi 
Attached devices:
Host: scsi1 Channel: 01 Id: 00 Lun: 00
  Vendor: HP       Model: LOGICAL VOLUME   Rev: 1.86
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi0 Channel: 00 Id: 00 Lun: 01
  Vendor: IBM      Model: 1814      FAStT  Rev: 0916
  Type:   Direct-Access                    ANSI  SCSI revision: 05
...
To get default acquired config for this storage:
multpathd -k
...
show config
I can see:
device {
                vendor "IBM"
                product "^1814"
                product_blacklist "Universal Xport"
                path_grouping_policy "group_by_prio"
                path_checker "rdac"
                features "0"
                hardware_handler "1 rdac"
                prio "rdac"
                failback immediate
                rr_weight "uniform"
                no_path_retry "fail"
        }
and
defaults {
        verbosity 2
        polling_interval 5
        max_polling_interval 20
        reassign_maps "yes"
        multipath_dir "/lib64/multipath"
        path_selector "service-time 0"
        path_grouping_policy "failover"
        uid_attribute "ID_SERIAL"
        prio "const"
        prio_args ""
        features "0"
        path_checker "directio"
        alias_prefix "mpath"
        failback "manual"
        rr_min_io 1000
        rr_min_io_rq 1
        max_fds 4096
        rr_weight "uniform"
        no_path_retry "fail"
        queue_without_daemon "no"
        flush_on_last_del "yes"
        user_friendly_names "no"
        fast_io_fail_tmo 5
        dev_loss_tmo 30
        bindings_file "/etc/multipath/bindings"
        wwids_file /etc/multipath/wwids
        log_checker_err always
        find_multipaths no
        retain_attached_hw_handler no
        detect_prio no
        hw_str_match no
        force_sync no
        deferred_remove no
        ignore_new_boot_devs no
        skip_kpartx no
        config_dir "/etc/multipath/conf.d"
        delay_watch_checks no
        delay_wait_checks no
        retrigger_tries 3
        retrigger_delay 10
        missing_uev_wait_timeout 30
        new_bindings_in_boot no
}
Any hint on how to tune multipath.conf so that a powering on server
doesn't create problems to running VMs?
Thanks in advance,
Gianluca
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
-- 
Nathanaël Blanchet
Supervision réseau
Pôle Infrastrutures Informatiques
227 avenue Professeur-Jean-Louis-Viala
34193 MONTPELLIER CEDEX 5 	
Tél. 33 (0)4 67 54 84 55
Fax  33 (0)4 67 54 84 14
blanchet@abes.fr
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--------------06F287541BC9D8DB37199903
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>I'm also seeing this error using a Dell MD3800i array.  The
      multipath errors shown in our logs are different however.</p>
    <p>Feb  1 15:11:58 ovirt-node-production2 kernel: dd: sending ioctl
      80306d02 to a partition!<br>
      Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove
      map (uevent)<br>
      Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: devmap
      not registered, can't remove<br>
      Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove
      map (uevent)<br>
    </p>
    <p>The dd error seems to happen every time that SPM runs a test.<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 01/31/2017 09:23 AM, Nathanaël
      Blanchet wrote:<br>
    </div>
    <blockquote cite="mid:304eaca8-f962-d994-1607-dbbdd265f0d5@abes.fr"
      type="cite">
      <meta content="text/html; charset=windows-1252"
        http-equiv="Content-Type">
      <p>exactly the same issue by there with FC EMC domain storage...</p>
      <br>
      <div class="moz-cite-prefix">Le 31/01/2017 à 15:20, Gianluca
        Cecchi a écrit :<br>
      </div>
      <blockquote
cite="mid:CAG2kNCxWKBRLZ43OTXfAVJUPpLd3vNbN1ckw3sq2ZCr=tY+CtQ@mail.gmail.com"
        type="cite">
        <div dir="ltr">Hello,
          <div>my test environment is composed by 2 old HP blades BL685c
            G1 (ovmsrv05 and ovmsrv06) and they are connected in a SAN
            with FC-switches to an old IBM DS4700 storage array.</div>
          <div>Apart from being old, they seem all ok from an hw point
            of view.</div>
          <div>I have configured oVirt 4.0.6 and an FCP storage domain.</div>
          <div>The hosts are plain CentOS 7.3 servers fully updated.</div>
          <div>It is not an hosted engine environment: the manager is a
            vm outside of the cluster.</div>
          <div>I have configured power mgmt on both and it works good.</div>
          <div><br>
          </div>
          <div>I have at the moment  only one VM for test and it is
            doing quite nothing.<br>
          </div>
          <div><br>
          </div>
          <div>Starting point: ovmsrv05 is in maintenance (since about 2
            days) and the VM is running on ovmsrv06.</div>
          <div>I update qemu-kvm package on ovmsrv05 and then I restart
            it from web admin gui:</div>
          <div>Power Mgmt --> Restart</div>
          <div><br>
          </div>
          <div>Sequence of events in pane and the problem in subject:</div>
          <div>
            <div>Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management
              was verified successfully.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was
              set to Up.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:29:38 AM Executing power management
              status on Host ovmsrv05 using Proxy Host ovmsrv06 and
              Fence Agent ilo:10.4.192.212.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05
              initiated by admin@internal-authz.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:28:05 AM VM ol65 has recovered from
              paused back to up.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to
              storage I/O problem.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:27:55 AM VM ol65 has been paused.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by
              admin@internal-authz.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by
              admin@internal-authz.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:52 AM Power management start of Host
              ovmsrv05 succeeded.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:50 AM Executing power management
              status on Host ovmsrv05 using Proxy Host ovmsrv06 and
              Fence Agent ilo:10.4.192.212.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:37 AM Executing power management
              start on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence
              Agent ilo:10.4.192.212.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:37 AM Power management start of Host
              ovmsrv05 initiated.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05
              was started.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:37 AM All VMs' status on Non
              Responsive Host ovmsrv05 were changed to 'Down' by
              admin@internal-authz</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by
              admin@internal-authz.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:36 AM Power management stop of Host
              ovmsrv05 succeeded.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:34 AM Executing power management
              status on Host ovmsrv05 using Proxy Host ovmsrv06 and
              Fence Agent ilo:10.4.192.212.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:15 AM Executing power management
              stop on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence
              Agent ilo:10.4.192.212.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:15 AM Power management stop of Host
              ovmsrv05 initiated.</div>
            <div><span class="gmail-Apple-tab-span" style="white-space:pre">	</span></div>
            <div>Jan 31, 2017 10:25:12 AM Executing power management
              status on Host ovmsrv05 using Proxy Host ovmsrv06 and
              Fence Agent ilo:10.4.192.212.</div>
          </div>
          <div><br>
          </div>
          <div>Watching the timestamps, the culprit seems the reboot
            time of ovmsrv05 that detects some LUNs in owned state and
            other ones in unowned</div>
          <div>Full messages of both hosts here:</div>
          <div><a moz-do-not-send="true"
href="https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing">https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing</a><br>
          </div>
          <div>and</div>
          <div><a moz-do-not-send="true"
href="https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing">https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing</a><br>
          </div>
          <div><br>
          </div>
          <div>At this time there are 4 LUNs globally seen by the two
            hosts but only 1 of them is currently configured as the only
            storage domain in oVirt cluster.</div>
          <div><br>
          </div>
          <div>
            <div>[root@ovmsrv05 ~]# multipath -l | grep ^36</div>
            <div>3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814  
                 FAStT </div>
            <div>3600a0b80002999020000cd3c5501458f dm-3 IBM     ,1814  
                 FAStT </div>
            <div>3600a0b80002999020000ccf855011198 dm-2 IBM     ,1814  
                 FAStT </div>
            <div>3600a0b8000299aa80000d08955014098 dm-4 IBM     ,1814  
                 FAStT </div>
          </div>
          <div><br>
          </div>
          <div>the configured one:</div>
          <div>
            <div>[root@ovmsrv05 ~]# multipath -l
              3600a0b8000299aa80000d08b55014119</div>
            <div>3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814  
                 FAStT </div>
            <div>size=4.0T features='0' hwhandler='1 rdac' wp=rw</div>
            <div>|-+- policy='service-time 0' prio=0 status=active</div>
            <div>| |- 0:0:1:3 sdl 8:176 active undef running</div>
            <div>| `- 2:0:1:3 sdp 8:240 active undef running</div>
            <div>`-+- policy='service-time 0' prio=0 status=enabled</div>
            <div>  |- 0:0:0:3 sdd 8:48  active undef running</div>
            <div>  `- 2:0:0:3 sdi 8:128 active undef running</div>
          </div>
          <div><br>
          </div>
          <div>In mesages of booting node, arounf the problem registered
            by the storage:</div>
          <div>
            <div>[root@ovmsrv05 ~]# grep owned /var/log/messages</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac:
              LUN 1 (RDAC) (owned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac:
              LUN 2 (RDAC) (owned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac:
              LUN 3 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac:
              LUN 1 (RDAC) (owned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac:
              LUN 4 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac:
              LUN 2 (RDAC) (owned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac:
              LUN 1 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac:
              LUN 3 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac:
              LUN 4 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac:
              LUN 2 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac:
              LUN 1 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:3: rdac:
              LUN 3 (RDAC) (owned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:2: rdac:
              LUN 2 (RDAC) (unowned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:4: rdac:
              LUN 4 (RDAC) (owned)</div>
            <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:3: rdac:
              LUN 3 (RDAC) (owned)</div>
            <div>Jan 31 10:27:39 ovmsrv05 kernel: scsi 2:0:1:4: rdac:
              LUN 4 (RDAC) (owned)</div>
          </div>
          <div><br>
          </div>
          <div>I don't know exactly the meaning of owned/unowned in the
            output above..</div>
          <div>Possibly it detects the 0:0:1:3 and 2:0:1:3 paths (those
            of the active group) as "owned" and this could have created
            problems with the active node?</div>
          <div><br>
          </div>
          <div>On active node strangely I don't loose all the paths, but
            the VM has been paused anyway</div>
          <div><br>
          </div>
          <div>
            <div>[root@ovmsrv06 log]# grep "remaining active
              path" /var/log/messages </div>
            <div>Jan 31 10:27:48 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              3</div>
            <div>Jan 31 10:27:49 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              2</div>
            <div>Jan 31 10:27:56 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              3</div>
            <div>Jan 31 10:27:56 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              2</div>
            <div>Jan 31 10:27:56 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              1</div>
            <div>Jan 31 10:27:57 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              2</div>
            <div>Jan 31 10:28:01 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              3</div>
            <div>Jan 31 10:28:01 ovmsrv06 multipathd:
              3600a0b8000299aa80000d08b55014119: remaining active paths:
              4</div>
          </div>
          <div><br>
          </div>
          <div>I'm not an expert of this storage array in particular,
            and of the rdac hardware handler in general.</div>
          <div><br>
          </div>
          <div>What I see is that multipath.conf on both nodes:</div>
          <div><br>
          </div>
          <div>
            <div># VDSM REVISION 1.3</div>
            <div><br>
            </div>
            <div>defaults {</div>
            <div>    polling_interval            5</div>
            <div>    no_path_retry               fail</div>
            <div>    user_friendly_names         no</div>
            <div>    flush_on_last_del           yes</div>
            <div>    fast_io_fail_tmo            5</div>
            <div>    dev_loss_tmo                30</div>
            <div>    max_fds                     4096</div>
            <div>}</div>
            <div><br>
            </div>
            <div><br>
            </div>
            <div>devices {</div>
            <div>    device {</div>
            <div>        # These settings overrides built-in devices
              settings. It does not apply</div>
            <div>        # to devices without built-in settings (these
              use the settings in the</div>
            <div>        # "defaults" section), or to devices defined in
              the "devices" section.</div>
            <div>        # Note: This is not available yet on Fedora 21.
              For more info see</div>
            <div>        # <a moz-do-not-send="true"
                href="https://bugzilla.redhat.com/1253799">https://bugzilla.redhat.com/1253799</a></div>
            <div>        all_devs                yes</div>
            <div>        no_path_retry           fail</div>
            <div>    }</div>
            <div>}</div>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
          <div>
            <div>beginning of /proc/scsi/scsi</div>
            <div><br>
            </div>
            <div>[root@ovmsrv06 ~]# cat /proc/scsi/scsi </div>
            <div>Attached devices:</div>
            <div>Host: scsi1 Channel: 01 Id: 00 Lun: 00</div>
            <div>  Vendor: HP       Model: LOGICAL VOLUME   Rev: 1.86</div>
            <div>  Type:   Direct-Access                    ANSI  SCSI
              revision: 05</div>
            <div>Host: scsi0 Channel: 00 Id: 00 Lun: 01</div>
            <div>  Vendor: IBM      Model: 1814      FAStT  Rev: 0916</div>
            <div>  Type:   Direct-Access                    ANSI  SCSI
              revision: 05</div>
          </div>
          <div>...</div>
          <div><br>
          </div>
          <div>To get default acquired config for this storage:</div>
          <div>
            <div><br>
            </div>
            <div>multpathd -k</div>
            <div>> show config</div>
            <div><br>
            </div>
            <div>I can see:</div>
            <div><br>
            </div>
            <div>        device {</div>
            <div>                vendor "IBM"</div>
            <div>                product "^1814"</div>
            <div>                product_blacklist "Universal Xport"</div>
            <div>                path_grouping_policy "group_by_prio"</div>
            <div>                path_checker "rdac"</div>
            <div>                features "0"</div>
            <div>                hardware_handler "1 rdac"</div>
            <div>                prio "rdac"</div>
            <div>                failback immediate</div>
            <div>                rr_weight "uniform"</div>
            <div>                no_path_retry "fail"</div>
            <div>        }</div>
            <div><br>
            </div>
            <div><br>
            </div>
            <div>and</div>
            <div><br>
            </div>
            <div>defaults {</div>
            <div>        verbosity 2</div>
            <div>        polling_interval 5</div>
            <div>        max_polling_interval 20</div>
            <div>        reassign_maps "yes"</div>
            <div>        multipath_dir "/lib64/multipath"</div>
            <div>        path_selector "service-time 0"</div>
            <div>        path_grouping_policy "failover"</div>
            <div>        uid_attribute "ID_SERIAL"</div>
            <div>        prio "const"</div>
            <div>        prio_args ""</div>
            <div>        features "0"</div>
            <div>        path_checker "directio"</div>
            <div>        alias_prefix "mpath"</div>
            <div>        failback "manual"</div>
            <div>        rr_min_io 1000</div>
            <div>        rr_min_io_rq 1</div>
            <div>        max_fds 4096</div>
            <div>        rr_weight "uniform"</div>
            <div>        no_path_retry "fail"</div>
            <div>        queue_without_daemon "no"</div>
            <div>        flush_on_last_del "yes"</div>
            <div>        user_friendly_names "no"</div>
            <div>        fast_io_fail_tmo 5</div>
            <div>        dev_loss_tmo 30</div>
            <div>        bindings_file "/etc/multipath/bindings"</div>
            <div>        wwids_file /etc/multipath/wwids</div>
            <div>        log_checker_err always</div>
            <div>        find_multipaths no</div>
            <div>        retain_attached_hw_handler no</div>
            <div>        detect_prio no</div>
            <div>        hw_str_match no</div>
            <div>        force_sync no</div>
            <div>        deferred_remove no</div>
            <div>        ignore_new_boot_devs no</div>
            <div>        skip_kpartx no</div>
            <div>        config_dir "/etc/multipath/conf.d"</div>
            <div>        delay_watch_checks no</div>
            <div>        delay_wait_checks no</div>
            <div>        retrigger_tries 3</div>
            <div>        retrigger_delay 10</div>
            <div>        missing_uev_wait_timeout 30</div>
            <div>        new_bindings_in_boot no</div>
            <div>}</div>
            <div><br>
            </div>
          </div>
          <div>Any hint on how to tune multipath.conf so that a powering
            on server doesn't create problems to running VMs?</div>
          <div><br>
          </div>
          <div>Thanks in advance,</div>
          <div>Gianluca</div>
        </div>
        <br>
        <fieldset class="mimeAttachmentHeader"></fieldset>
        <br>
        <pre wrap="">_______________________________________________
Users mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a>
</pre>
      </blockquote>
      <br>
      <pre class="moz-signature" cols="72">-- 
Nathanaël Blanchet

Supervision réseau
Pôle Infrastrutures Informatiques
227 avenue Professeur-Jean-Louis-Viala
34193 MONTPELLIER CEDEX 5 	
Tél. 33 (0)4 67 54 84 55
Fax  33 (0)4 67 54 84 14
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:blanchet@abes.fr">blanchet@abes.fr</a> </pre>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>

--------------06F287541BC9D8DB37199903--

Re: [ovirt-users] VM has been paused due to storage I/O problem

Michael Watters