Hello,
similar questions/arguments took already place in the past, but I think it
could be a good point to dig more if possible.
I focus on block based storage and my environments have iSCSI based with
multipath, connected to Equallogic.
Currently I have set no_path_retry to 4, so I have a 20 seconds timeout
(polling_interval=5).
Sometimes some network planned activity (not under my domain) that should
have no impact "doesn't go so well" and I could have longer delays (also
60-70 seconds) and so oVirt reactions where I see soft fencing, VMs going
into paused state or "question mark" state...
Comparing to vSphere, the same events don't apparently cause anything (and
I see same lost paths events in datastore and ESXi host monitoring-->
Events pane).
This basically depends on APD default timeout that seems to be 140 seconds.
This difference of behavior has the effect of showing an apparent better
SLA of vSphere during these short time outages and a sort of show stopper
for extending oVirt implementation....
Interesting vSphere KBs here:
Storage device has entered the All Paths Down state (2032934)
https://kb.vmware.com/s/article/2032934
Containing:
"Note: By default, the APD timeout is set to 140 seconds."
All Paths Down timeout for a storage device has expired (2032940)
https://kb.vmware.com/s/article/2032940
Path redundancy to the storage device is degraded (1009555)
https://kb.vmware.com/s/article/1009555
Storage device has recovered from the APD state (2032945)
https://kb.vmware.com/s/article/2032945
So the question is what kind of real risks I have if I simulate that
behavior and set a no_path_retry value so that polling_interval x
no_path_retry = 140
(in default config of polling_interval=5 it would mean no_path_retry = 28)
BTW: I also have an environment based on RHV 4.3.5 and iSCSI and in
parallel I opened a case (02452597) for asking clarifications/chances
remaining in supported configuration, so for logs and so on it could help
going into it for Red Hat developers
Thanks,
Gianluca