Hello,

similar questions/arguments took already place in the past, but I think it could be a good point to dig more if possible.

I focus on block based storage and my environments have iSCSI based with multipath, connected to Equallogic.

Currently I have set no_path_retry to 4, so I have a 20 seconds timeout (polling_interval=5).

Sometimes some network planned activity (not under my domain) that should have no impact "doesn't go so well" and I could have longer delays (also 60-70 seconds) and so oVirt reactions where I see soft fencing, VMs going into paused state or "question mark" state...

Comparing to vSphere, the same events don't apparently cause anything (and I see same lost paths events in datastore and ESXi host monitoring--> Events pane).

This basically depends on APD default timeout that seems to be 140 seconds.

This difference of behavior has the effect of showing an apparent better SLA of vSphere during these short time outages and a sort of show stopper for extending oVirt implementation....

Interesting vSphere KBs here:

Storage device has entered the All Paths Down state (2032934)
https://kb.vmware.com/s/article/2032934
Containing:
"Note: By default, the APD timeout is set to 140 seconds."

All Paths Down timeout for a storage device has expired (2032940)
https://kb.vmware.com/s/article/2032940

Path redundancy to the storage device is degraded (1009555)
https://kb.vmware.com/s/article/1009555

Storage device has recovered from the APD state (2032945)
https://kb.vmware.com/s/article/2032945

So the question is what kind of real risks I have if I simulate that behavior and set a no_path_retry value so that polling_interval x no_path_retry = 140

(in default config of polling_interval=5 it would mean no_path_retry = 28)

BTW: I also have an environment based on RHV 4.3.5 and iSCSI and in parallel I opened a case (02452597) for asking clarifications/chances remaining in supported configuration, so for logs and so on it could help going into it for Red Hat developers

Thanks,

Gianluca