[ovirt-users] VM has been paused due to storage I/O problem

Wed Feb 1 20:33:01 UTC 2017

I'm also seeing this error using a Dell MD3800i array.  The multipath
errors shown in our logs are different however.

Feb  1 15:11:58 ovirt-node-production2 kernel: dd: sending ioctl
80306d02 to a partition!
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map
(uevent)
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: devmap not
registered, can't remove
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map
(uevent)

The dd error seems to happen every time that SPM runs a test.

On 01/31/2017 09:23 AM, Nathanaël Blanchet wrote:
>
> exactly the same issue by there with FC EMC domain storage...
>
>
> Le 31/01/2017 à 15:20, Gianluca Cecchi a écrit :
>> Hello,
>> my test environment is composed by 2 old HP blades BL685c G1
>> (ovmsrv05 and ovmsrv06) and they are connected in a SAN with
>> FC-switches to an old IBM DS4700 storage array.
>> Apart from being old, they seem all ok from an hw point of view.
>> I have configured oVirt 4.0.6 and an FCP storage domain.
>> The hosts are plain CentOS 7.3 servers fully updated.
>> It is not an hosted engine environment: the manager is a vm outside
>> of the cluster.
>> I have configured power mgmt on both and it works good.
>>
>> I have at the moment  only one VM for test and it is doing quite nothing.
>>
>> Starting point: ovmsrv05 is in maintenance (since about 2 days) and
>> the VM is running on ovmsrv06.
>> I update qemu-kvm package on ovmsrv05 and then I restart it from web
>> admin gui:
>> Power Mgmt --> Restart
>>
>> Sequence of events in pane and the problem in subject:
>> Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified
>> successfully.
>> Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.
>> Jan 31, 2017 10:29:38 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by
>> admin at internal-authz.
>> Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.
>> Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O
>> problem.
>> Jan 31, 2017 10:27:55 AM VM ol65 has been paused.
>> Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by
>> admin at internal-authz.
>> Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by
>> admin at internal-authz.
>> Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05
>> succeeded.
>> Jan 31, 2017 10:25:50 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:37 AM Executing power management start on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05
>> initiated.
>> Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.
>> Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host
>> ovmsrv05 were changed to 'Down' by admin at internal-authz
>> Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by
>> admin at internal-authz.
>> Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05
>> succeeded.
>> Jan 31, 2017 10:25:34 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:15 AM Executing power management stop on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05
>> initiated.
>> Jan 31, 2017 10:25:12 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>>
>> Watching the timestamps, the culprit seems the reboot time of
>> ovmsrv05 that detects some LUNs in owned state and other ones in unowned
>> Full messages of both hosts here:
>> https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing
>> and
>> https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing
>>
>> At this time there are 4 LUNs globally seen by the two hosts but only
>> 1 of them is currently configured as the only storage domain in oVirt
>> cluster.
>>
>> [root at ovmsrv05 ~]# multipath -l | grep ^36
>> 3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814      FAStT 
>> 3600a0b80002999020000cd3c5501458f dm-3 IBM     ,1814      FAStT 
>> 3600a0b80002999020000ccf855011198 dm-2 IBM     ,1814      FAStT 
>> 3600a0b8000299aa80000d08955014098 dm-4 IBM     ,1814      FAStT 
>>
>> the configured one:
>> [root at ovmsrv05 ~]# multipath -l 3600a0b8000299aa80000d08b55014119
>> 3600a0b8000299aa80000d08b55014119 dm-5 IBM     ,1814      FAStT 
>> size=4.0T features='0' hwhandler='1 rdac' wp=rw
>> |-+- policy='service-time 0' prio=0 status=active
>> | |- 0:0:1:3 sdl 8:176 active undef running
>> | `- 2:0:1:3 sdp 8:240 active undef running
>> `-+- policy='service-time 0' prio=0 status=enabled
>>   |- 0:0:0:3 sdd 8:48  active undef running
>>   `- 2:0:0:3 sdi 8:128 active undef running
>>
>> In mesages of booting node, arounf the problem registered by the storage:
>> [root at ovmsrv05 ~]# grep owned /var/log/messages
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:3: rdac: LUN 3 (RDAC) (owned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:2: rdac: LUN 2 (RDAC)
>> (unowned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:4: rdac: LUN 4 (RDAC) (owned)
>> Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:3: rdac: LUN 3 (RDAC) (owned)
>> Jan 31 10:27:39 ovmsrv05 kernel: scsi 2:0:1:4: rdac: LUN 4 (RDAC) (owned)
>>
>> I don't know exactly the meaning of owned/unowned in the output above..
>> Possibly it detects the 0:0:1:3 and 2:0:1:3 paths (those of the
>> active group) as "owned" and this could have created problems with
>> the active node?
>>
>> On active node strangely I don't loose all the paths, but the VM has
>> been paused anyway
>>
>> [root at ovmsrv06 log]# grep "remaining active path" /var/log/messages 
>> Jan 31 10:27:48 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 3
>> Jan 31 10:27:49 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 2
>> Jan 31 10:27:56 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 3
>> Jan 31 10:27:56 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 2
>> Jan 31 10:27:56 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 1
>> Jan 31 10:27:57 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 2
>> Jan 31 10:28:01 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 3
>> Jan 31 10:28:01 ovmsrv06 multipathd:
>> 3600a0b8000299aa80000d08b55014119: remaining active paths: 4
>>
>> I'm not an expert of this storage array in particular, and of the
>> rdac hardware handler in general.
>>
>> What I see is that multipath.conf on both nodes:
>>
>> # VDSM REVISION 1.3
>>
>> defaults {
>>     polling_interval            5
>>     no_path_retry               fail
>>     user_friendly_names         no
>>     flush_on_last_del           yes
>>     fast_io_fail_tmo            5
>>     dev_loss_tmo                30
>>     max_fds                     4096
>> }
>>
>>
>> devices {
>>     device {
>>         # These settings overrides built-in devices settings. It does
>> not apply
>>         # to devices without built-in settings (these use the
>> settings in the
>>         # "defaults" section), or to devices defined in the "devices"
>> section.
>>         # Note: This is not available yet on Fedora 21. For more info see
>>         # https://bugzilla.redhat.com/1253799
>>         all_devs                yes
>>         no_path_retry           fail
>>     }
>> }
>>
>>
>> beginning of /proc/scsi/scsi
>>
>> [root at ovmsrv06 ~]# cat /proc/scsi/scsi 
>> Attached devices:
>> Host: scsi1 Channel: 01 Id: 00 Lun: 00
>>   Vendor: HP       Model: LOGICAL VOLUME   Rev: 1.86
>>   Type:   Direct-Access                    ANSI  SCSI revision: 05
>> Host: scsi0 Channel: 00 Id: 00 Lun: 01
>>   Vendor: IBM      Model: 1814      FAStT  Rev: 0916
>>   Type:   Direct-Access                    ANSI  SCSI revision: 05
>> ...
>>
>> To get default acquired config for this storage:
>>
>> multpathd -k
>> > show config
>>
>> I can see:
>>
>>         device {
>>                 vendor "IBM"
>>                 product "^1814"
>>                 product_blacklist "Universal Xport"
>>                 path_grouping_policy "group_by_prio"
>>                 path_checker "rdac"
>>                 features "0"
>>                 hardware_handler "1 rdac"
>>                 prio "rdac"
>>                 failback immediate
>>                 rr_weight "uniform"
>>                 no_path_retry "fail"
>>         }
>>
>>
>> and
>>
>> defaults {
>>         verbosity 2
>>         polling_interval 5
>>         max_polling_interval 20
>>         reassign_maps "yes"
>>         multipath_dir "/lib64/multipath"
>>         path_selector "service-time 0"
>>         path_grouping_policy "failover"
>>         uid_attribute "ID_SERIAL"
>>         prio "const"
>>         prio_args ""
>>         features "0"
>>         path_checker "directio"
>>         alias_prefix "mpath"
>>         failback "manual"
>>         rr_min_io 1000
>>         rr_min_io_rq 1
>>         max_fds 4096
>>         rr_weight "uniform"
>>         no_path_retry "fail"
>>         queue_without_daemon "no"
>>         flush_on_last_del "yes"
>>         user_friendly_names "no"
>>         fast_io_fail_tmo 5
>>         dev_loss_tmo 30
>>         bindings_file "/etc/multipath/bindings"
>>         wwids_file /etc/multipath/wwids
>>         log_checker_err always
>>         find_multipaths no
>>         retain_attached_hw_handler no
>>         detect_prio no
>>         hw_str_match no
>>         force_sync no
>>         deferred_remove no
>>         ignore_new_boot_devs no
>>         skip_kpartx no
>>         config_dir "/etc/multipath/conf.d"
>>         delay_watch_checks no
>>         delay_wait_checks no
>>         retrigger_tries 3
>>         retrigger_delay 10
>>         missing_uev_wait_timeout 30
>>         new_bindings_in_boot no
>> }
>>
>> Any hint on how to tune multipath.conf so that a powering on server
>> doesn't create problems to running VMs?
>>
>> Thanks in advance,
>> Gianluca
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>
> -- 
> Nathanaël Blanchet
>
> Supervision réseau
> Pôle Infrastrutures Informatiques
> 227 avenue Professeur-Jean-Louis-Viala
> 34193 MONTPELLIER CEDEX 5 	
> Tél. 33 (0)4 67 54 84 55
> Fax  33 (0)4 67 54 84 14
> blanchet at abes.fr 
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170201/5f00cbe2/attachment-0001.html>