
This is a multi-part message in MIME format. --------------D335D695CB22377475546514 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: quoted-printable exactly the same issue by there with FC EMC domain storage... Le 31/01/2017 =E0 15:20, Gianluca Cecchi a =E9crit :
Hello, my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05=20 and ovmsrv06) and they are connected in a SAN with FC-switches to an=20 old IBM DS4700 storage array. Apart from being old, they seem all ok from an hw point of view. I have configured oVirt 4.0.6 and an FCP storage domain. The hosts are plain CentOS 7.3 servers fully updated. It is not an hosted engine environment: the manager is a vm outside of=20 the cluster. I have configured power mgmt on both and it works good.
I have at the moment only one VM for test and it is doing quite nothin= g.
Starting point: ovmsrv05 is in maintenance (since about 2 days) and=20 the VM is running on ovmsrv06. I update qemu-kvm package on ovmsrv05 and then I restart it from web=20 admin gui: Power Mgmt --> Restart
Sequence of events in pane and the problem in subject: Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified=20 successfully. Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up. Jan 31, 2017 10:29:38 AM Executing power management status on Host=20 ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by=20 admin@internal-authz. Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up. Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O=20 problem. Jan 31, 2017 10:27:55 AM VM ol65 has been paused. Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by=20 admin@internal-authz. Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by=20 admin@internal-authz. Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05=20 succeeded. Jan 31, 2017 10:25:50 AM Executing power management status on Host=20 ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:37 AM Executing power management start on Host=20 ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05=20 initiated. Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started. Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host=20 ovmsrv05 were changed to 'Down' by admin@internal-authz Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by=20 admin@internal-authz. Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeed= ed. Jan 31, 2017 10:25:34 AM Executing power management status on Host=20 ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:15 AM Executing power management stop on Host=20 ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiat= ed. Jan 31, 2017 10:25:12 AM Executing power management status on Host=20 ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Watching the timestamps, the culprit seems the reboot time of ovmsrv05=20 that detects some LUNs in owned state and other ones in unowned Full messages of both hosts here: https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=3D= sharing and https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=3D= sharing
At this time there are 4 LUNs globally seen by the two hosts but only=20 1 of them is currently configured as the only storage domain in oVirt=20 cluster.
[root@ovmsrv05 ~]# multipath -l | grep ^36 3600a0b8000299aa80000d08b55014119 dm-5 IBM ,1814 FAStT 3600a0b80002999020000cd3c5501458f dm-3 IBM ,1814 FAStT 3600a0b80002999020000ccf855011198 dm-2 IBM ,1814 FAStT 3600a0b8000299aa80000d08955014098 dm-4 IBM ,1814 FAStT
the configured one: [root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa80000d08b55014119 3600a0b8000299aa80000d08b55014119 dm-5 IBM ,1814 FAStT size=3D4.0T features=3D'0' hwhandler=3D'1 rdac' wp=3Drw |-+- policy=3D'service-time 0' prio=3D0 status=3Dactive | |- 0:0:1:3 sdl 8:176 active undef running | `- 2:0:1:3 sdp 8:240 active undef running `-+- policy=3D'service-time 0' prio=3D0 status=3Denabled |- 0:0:0:3 sdd 8:48 active undef running `- 2:0:0:3 sdi 8:128 active undef running
In mesages of booting node, arounf the problem registered by the storag= e: [root@ovmsrv05 ~]# grep owned /var/log/messages Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owne= d) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owne= d) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owne= d) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owne= d) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:3: rdac: LUN 3 (RDAC) (owne= d) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:2: rdac: LUN 2 (RDAC)=20 (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:4: rdac: LUN 4 (RDAC) (owne= d) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:3: rdac: LUN 3 (RDAC) (owne= d) Jan 31 10:27:39 ovmsrv05 kernel: scsi 2:0:1:4: rdac: LUN 4 (RDAC) (owne= d)
I don't know exactly the meaning of owned/unowned in the output above.. Possibly it detects the 0:0:1:3 and 2:0:1:3 paths (those of the active=20 group) as "owned" and this could have created problems with the active=20 node?
On active node strangely I don't loose all the paths, but the VM has=20 been paused anyway
[root@ovmsrv06 log]# grep "remaining active path" /var/log/messages Jan 31 10:27:48 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 3 Jan 31 10:27:49 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 2 Jan 31 10:27:56 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 3 Jan 31 10:27:56 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 2 Jan 31 10:27:56 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 1 Jan 31 10:27:57 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 2 Jan 31 10:28:01 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 3 Jan 31 10:28:01 ovmsrv06 multipathd:=20 3600a0b8000299aa80000d08b55014119: remaining active paths: 4
I'm not an expert of this storage array in particular, and of the rdac=20 hardware handler in general.
What I see is that multipath.conf on both nodes:
# VDSM REVISION 1.3
defaults { polling_interval 5 no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo 30 max_fds 4096 }
devices { device { # These settings overrides built-in devices settings. It does=20 not apply # to devices without built-in settings (these use the settings=20 in the # "defaults" section), or to devices defined in the "devices"=20 section. # Note: This is not available yet on Fedora 21. For more info s= ee # https://bugzilla.redhat.com/1253799 all_devs yes no_path_retry fail } }
beginning of /proc/scsi/scsi
[root@ovmsrv06 ~]# cat /proc/scsi/scsi Attached devices: Host: scsi1 Channel: 01 Id: 00 Lun: 00 Vendor: HP Model: LOGICAL VOLUME Rev: 1.86 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi0 Channel: 00 Id: 00 Lun: 01 Vendor: IBM Model: 1814 FAStT Rev: 0916 Type: Direct-Access ANSI SCSI revision: 05 ...
To get default acquired config for this storage:
multpathd -k
show config
I can see:
device { vendor "IBM" product "^1814" product_blacklist "Universal Xport" path_grouping_policy "group_by_prio" path_checker "rdac" features "0" hardware_handler "1 rdac" prio "rdac" failback immediate rr_weight "uniform" no_path_retry "fail" }
and
defaults { verbosity 2 polling_interval 5 max_polling_interval 20 reassign_maps "yes" multipath_dir "/lib64/multipath" path_selector "service-time 0" path_grouping_policy "failover" uid_attribute "ID_SERIAL" prio "const" prio_args "" features "0" path_checker "directio" alias_prefix "mpath" failback "manual" rr_min_io 1000 rr_min_io_rq 1 max_fds 4096 rr_weight "uniform" no_path_retry "fail" queue_without_daemon "no" flush_on_last_del "yes" user_friendly_names "no" fast_io_fail_tmo 5 dev_loss_tmo 30 bindings_file "/etc/multipath/bindings" wwids_file /etc/multipath/wwids log_checker_err always find_multipaths no retain_attached_hw_handler no detect_prio no hw_str_match no force_sync no deferred_remove no ignore_new_boot_devs no skip_kpartx no config_dir "/etc/multipath/conf.d" delay_watch_checks no delay_wait_checks no retrigger_tries 3 retrigger_delay 10 missing_uev_wait_timeout 30 new_bindings_in_boot no }
Any hint on how to tune multipath.conf so that a powering on server=20 doesn't create problems to running VMs?
Thanks in advance, Gianluca
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--=20 Nathana=EBl Blanchet Supervision r=E9seau P=F4le Infrastrutures Informatiques 227 avenue Professeur-Jean-Louis-Viala 34193 MONTPELLIER CEDEX 5 =09 T=E9l. 33 (0)4 67 54 84 55 Fax 33 (0)4 67 54 84 14 blanchet@abes.fr --------------D335D695CB22377475546514 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable <html> <head> <meta content=3D"text/html; charset=3Dwindows-1252" http-equiv=3D"Content-Type"> </head> <body bgcolor=3D"#FFFFFF" text=3D"#000000"> <p>exactly the same issue by there with FC EMC domain storage...</p> <br> <div class=3D"moz-cite-prefix">Le 31/01/2017 =E0 15:20, Gianluca Cecc= hi a =E9crit=A0:<br> </div> <blockquote cite=3D"mid:CAG2kNCxWKBRLZ43OTXfAVJUPpLd3vNbN1ckw3sq2ZCr=3DtY+CtQ@mail.gm= ail.com" type=3D"cite"> <div dir=3D"ltr">Hello, <div>my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05 and ovmsrv06) and they are connected in a SAN with FC-switches to an old IBM DS4700 storage array.</div> <div>Apart from being old, they seem all ok from an hw point of view.</div> <div>I have configured oVirt 4.0.6 and an FCP storage domain.</di= v> <div>The hosts are plain CentOS 7.3 servers fully updated.</div> <div>It is not an hosted engine environment: the manager is a vm outside of the cluster.</div> <div>I have configured power mgmt on both and it works good.</div=
<div><br> </div> <div>I have at the moment =A0only one VM for test and it is doing quite nothing.<br> </div> <div><br> </div> <div>Starting point: ovmsrv05 is in maintenance (since about 2 days) and the VM is running on ovmsrv06.</div> <div>I update qemu-kvm package on ovmsrv05 and then I restart it from web admin gui:</div> <div>Power Mgmt --> Restart</div> <div><br> </div> <div>Sequence of events in pane and the problem in subject:</div> <div> <div>Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified successfully.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:29:38 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by admin@internal-authz.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O problem.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:27:55 AM VM ol65 has been paused.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by admin@internal-authz.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by admin@internal-authz.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 succeeded.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:50 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:37 AM Executing power management start on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 initiated.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host ovmsrv05 were changed to 'Down' by admin@internal-authz</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by admin@internal-authz.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeeded.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:34 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:15 AM Executing power management stop on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiated.</div> <div><span class=3D"gmail-Apple-tab-span" style=3D"white-space:= pre"> </span></div> <div>Jan 31, 2017 10:25:12 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.</div> </div> <div><br> </div> <div>Watching the timestamps, the culprit seems the reboot time of ovmsrv05 that detects some LUNs in owned state and other ones in unowned</div> <div>Full messages of both hosts here:</div> <div><a moz-do-not-send=3D"true" href=3D"https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view= ?usp=3Dsharing">https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NM= RlU/view?usp=3Dsharing</a><br> </div> <div>and</div> <div><a moz-do-not-send=3D"true" href=3D"https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view= ?usp=3Dsharing">https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdX= Tms/view?usp=3Dsharing</a><br> </div> <div><br> </div> <div>At this time there are 4 LUNs globally seen by the two hosts but only 1 of them is currently configured as the only storage domain in oVirt cluster.</div> <div><br> </div> <div> <div>[root@ovmsrv05 ~]# multipath -l | grep ^36</div> <div>3600a0b8000299aa80000d08b55014119 dm-5 IBM =A0 =A0 ,1814 =A0= =A0 =A0FAStT=A0</div> <div>3600a0b80002999020000cd3c5501458f dm-3 IBM =A0 =A0 ,1814 =A0= =A0 =A0FAStT=A0</div> <div>3600a0b80002999020000ccf855011198 dm-2 IBM =A0 =A0 ,1814 =A0= =A0 =A0FAStT=A0</div> <div>3600a0b8000299aa80000d08955014098 dm-4 IBM =A0 =A0 ,1814 =A0= =A0 =A0FAStT=A0</div> </div> <div><br> </div> <div>the configured one:</div> <div> <div>[root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa80000d08b55014119</div> <div>3600a0b8000299aa80000d08b55014119 dm-5 IBM =A0 =A0 ,1814 =A0= =A0 =A0FAStT=A0</div> <div>size=3D4.0T features=3D'0' hwhandler=3D'1 rdac' wp=3Drw</d= iv> <div>|-+- policy=3D'service-time 0' prio=3D0 status=3Dactive</d= iv> <div>| |- 0:0:1:3 sdl 8:176 active undef running</div> <div>| `- 2:0:1:3 sdp 8:240 active undef running</div> <div>`-+- policy=3D'service-time 0' prio=3D0 status=3Denabled</= div> <div>=A0 |- 0:0:0:3 sdd 8:48 =A0active undef running</div> <div>=A0 `- 2:0:0:3 sdi 8:128 active undef running</div> </div> <div><br> </div> <div>In mesages of booting node, arounf the problem registered by the storage:</div> <div> <div>[root@ovmsrv05 ~]# grep owned /var/log/messages</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:3: rdac: LUN 3 (RDAC) (owned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:2: rdac: LUN 2 (RDAC) (unowned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:4: rdac: LUN 4 (RDAC) (owned)</div> <div>Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:3: rdac: LUN 3 (RDAC) (owned)</div> <div>Jan 31 10:27:39 ovmsrv05 kernel: scsi 2:0:1:4: rdac: LUN 4 (RDAC) (owned)</div> </div> <div><br> </div> <div>I don't know exactly the meaning of owned/unowned in the output above..</div> <div>Possibly it detects the 0:0:1:3 and 2:0:1:3 paths (those of the active group) as "owned" and this could have created problems with the active node?</div> <div><br> </div> <div>On active node strangely I don't loose all the paths, but the VM has been paused anyway</div> <div><br> </div> <div> <div>[root@ovmsrv06 log]# grep "remaining active path"=A0/var/log/messages=A0</div> <div>Jan 31 10:27:48 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 3<= /div> <div>Jan 31 10:27:49 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 2<= /div> <div>Jan 31 10:27:56 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 3<= /div> <div>Jan 31 10:27:56 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 2<= /div> <div>Jan 31 10:27:56 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 1<= /div> <div>Jan 31 10:27:57 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 2<= /div> <div>Jan 31 10:28:01 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 3<= /div> <div>Jan 31 10:28:01 ovmsrv06 multipathd: 3600a0b8000299aa80000d08b55014119: remaining active paths: 4<= /div> </div> <div><br> </div> <div>I'm not an expert of this storage array in particular, and of the rdac hardware handler in general.</div> <div><br> </div> <div>What I see is that multipath.conf on both nodes:</div> <div><br> </div> <div> <div># VDSM REVISION 1.3</div> <div><br> </div> <div>defaults {</div> <div>=A0 =A0 polling_interval =A0 =A0 =A0 =A0 =A0 =A05</div> <div>=A0 =A0 no_path_retry =A0 =A0 =A0 =A0 =A0 =A0 =A0 fail</di= v> <div>=A0 =A0 user_friendly_names =A0 =A0 =A0 =A0 no</div> <div>=A0 =A0 flush_on_last_del =A0 =A0 =A0 =A0 =A0 yes</div> <div>=A0 =A0 fast_io_fail_tmo =A0 =A0 =A0 =A0 =A0 =A05</div> <div>=A0 =A0 dev_loss_tmo =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A030</di= v> <div>=A0 =A0 max_fds =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 40= 96</div> <div>}</div> <div><br> </div> <div><br> </div> <div>devices {</div> <div>=A0 =A0 device {</div> <div>=A0 =A0 =A0 =A0 # These settings overrides built-in device= s settings. It does not apply</div> <div>=A0 =A0 =A0 =A0 # to devices without built-in settings (th= ese use the settings in the</div> <div>=A0 =A0 =A0 =A0 # "defaults" section), or to devices defin= ed in the "devices" section.</div> <div>=A0 =A0 =A0 =A0 # Note: This is not available yet on Fedor= a 21. For more info see</div> <div>=A0 =A0 =A0 =A0 # <a moz-do-not-send=3D"true" href=3D"https://bugzilla.redhat.com/1253799">https://bugzil= la.redhat.com/1253799</a></div> <div>=A0 =A0 =A0 =A0 all_devs =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ye= s</div> <div>=A0 =A0 =A0 =A0 no_path_retry =A0 =A0 =A0 =A0 =A0 fail</di= v> <div>=A0 =A0 }</div> <div>}</div> </div> <div><br> </div> <div><br> </div> <div> <div>beginning of /proc/scsi/scsi</div> <div><br> </div> <div>[root@ovmsrv06 ~]# cat /proc/scsi/scsi=A0</div> <div>Attached devices:</div> <div>Host: scsi1 Channel: 01 Id: 00 Lun: 00</div> <div>=A0 Vendor: HP =A0 =A0 =A0 Model: LOGICAL VOLUME =A0 Rev: = 1.86</div> <div>=A0 Type: =A0 Direct-Access =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0ANSI =A0SCSI revision: 05</div> <div>Host: scsi0 Channel: 00 Id: 00 Lun: 01</div> <div>=A0 Vendor: IBM =A0 =A0 =A0Model: 1814 =A0 =A0 =A0FAStT =A0= Rev: 0916</div> <div>=A0 Type: =A0 Direct-Access =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0ANSI =A0SCSI revision: 05</div> </div> <div>...</div> <div><br> </div> <div>To get default acquired config for this storage:</div> <div> <div><br> </div> <div>multpathd -k</div> <div>> show config</div> <div><br> </div> <div>I can see:</div> <div><br> </div> <div>=A0 =A0 =A0 =A0 device {</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 vendor "IBM"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 product "^1814"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 product_blacklist "Univers= al Xport"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 path_grouping_policy "grou= p_by_prio"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 path_checker "rdac"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 features "0"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 hardware_handler "1 rdac"<= /div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 prio "rdac"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 failback immediate</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rr_weight "uniform"</div> <div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 no_path_retry "fail"</div> <div>=A0 =A0 =A0 =A0 }</div> <div><br> </div> <div><br> </div> <div>and</div> <div><br> </div> <div>defaults {</div> <div>=A0 =A0 =A0 =A0 verbosity 2</div> <div>=A0 =A0 =A0 =A0 polling_interval 5</div> <div>=A0 =A0 =A0 =A0 max_polling_interval 20</div> <div>=A0 =A0 =A0 =A0 reassign_maps "yes"</div> <div>=A0 =A0 =A0 =A0 multipath_dir "/lib64/multipath"</div> <div>=A0 =A0 =A0 =A0 path_selector "service-time 0"</div> <div>=A0 =A0 =A0 =A0 path_grouping_policy "failover"</div> <div>=A0 =A0 =A0 =A0 uid_attribute "ID_SERIAL"</div> <div>=A0 =A0 =A0 =A0 prio "const"</div> <div>=A0 =A0 =A0 =A0 prio_args ""</div> <div>=A0 =A0 =A0 =A0 features "0"</div> <div>=A0 =A0 =A0 =A0 path_checker "directio"</div> <div>=A0 =A0 =A0 =A0 alias_prefix "mpath"</div> <div>=A0 =A0 =A0 =A0 failback "manual"</div> <div>=A0 =A0 =A0 =A0 rr_min_io 1000</div> <div>=A0 =A0 =A0 =A0 rr_min_io_rq 1</div> <div>=A0 =A0 =A0 =A0 max_fds 4096</div> <div>=A0 =A0 =A0 =A0 rr_weight "uniform"</div> <div>=A0 =A0 =A0 =A0 no_path_retry "fail"</div> <div>=A0 =A0 =A0 =A0 queue_without_daemon "no"</div> <div>=A0 =A0 =A0 =A0 flush_on_last_del "yes"</div> <div>=A0 =A0 =A0 =A0 user_friendly_names "no"</div> <div>=A0 =A0 =A0 =A0 fast_io_fail_tmo 5</div> <div>=A0 =A0 =A0 =A0 dev_loss_tmo 30</div> <div>=A0 =A0 =A0 =A0 bindings_file "/etc/multipath/bindings"</d= iv> <div>=A0 =A0 =A0 =A0 wwids_file /etc/multipath/wwids</div> <div>=A0 =A0 =A0 =A0 log_checker_err always</div> <div>=A0 =A0 =A0 =A0 find_multipaths no</div> <div>=A0 =A0 =A0 =A0 retain_attached_hw_handler no</div> <div>=A0 =A0 =A0 =A0 detect_prio no</div> <div>=A0 =A0 =A0 =A0 hw_str_match no</div> <div>=A0 =A0 =A0 =A0 force_sync no</div> <div>=A0 =A0 =A0 =A0 deferred_remove no</div> <div>=A0 =A0 =A0 =A0 ignore_new_boot_devs no</div> <div>=A0 =A0 =A0 =A0 skip_kpartx no</div> <div>=A0 =A0 =A0 =A0 config_dir "/etc/multipath/conf.d"</div> <div>=A0 =A0 =A0 =A0 delay_watch_checks no</div> <div>=A0 =A0 =A0 =A0 delay_wait_checks no</div> <div>=A0 =A0 =A0 =A0 retrigger_tries 3</div> <div>=A0 =A0 =A0 =A0 retrigger_delay 10</div> <div>=A0 =A0 =A0 =A0 missing_uev_wait_timeout 30</div> <div>=A0 =A0 =A0 =A0 new_bindings_in_boot no</div> <div>}</div> <div><br> </div> </div> <div>Any hint on how to tune multipath.conf so that a powering on server doesn't create problems to running VMs?</div> <div><br> </div> <div>Thanks in advance,</div> <div>Gianluca</div> </div> <br> <fieldset class=3D"mimeAttachmentHeader"></fieldset> <br> <pre wrap=3D"">_______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org">Use= rs@ovirt.org</a> <a class=3D"moz-txt-link-freetext" href=3D"http://lists.ovirt.org/mailman= /listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> <pre class=3D"moz-signature" cols=3D"72">--=20 Nathana=EBl Blanchet Supervision r=E9seau P=F4le Infrastrutures Informatiques 227 avenue Professeur-Jean-Louis-Viala 34193 MONTPELLIER CEDEX 5 =09 T=E9l. 33 (0)4 67 54 84 55 Fax 33 (0)4 67 54 84 14 <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:blanchet@abes.fr">bl= anchet@abes.fr</a> </pre> </body> </html> --------------D335D695CB22377475546514--