On Mon, Feb 1, 2021 at 5:23 PM Gianluca Cecchi
<gianluca.cecchi(a)gmail.com> wrote:
On Mon, Feb 1, 2021 at 4:09 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
...
> For 120 seconds, you likey need
>
> sanlock:io_timeout=20
> no_path_retry=32
>
Shouldn't the above values for a 160 seconds timeout? I need 120
120 seconds for sanlock means that sanlock will expire the lease
exactly 120 seconds since the last successful lease renewal. Sanlock
cannot exceeds this deadline since other hosts assume that timeout
when acquiring a lease from a "dead" host.
When using 15 seconds timeout, sanlock renews the lease every
30 seconds.
The best case flow is:
00 sanlock renewal succeeds
01 storage fails
30 sanlock try to renew lease 1/3 (timeout=15)
45 sanlock renewal timeout
60 sanlock try to renew lease 2/3 (timeout=15)
75 sanlock renewal timeout
90 sanlock tries to renew lease 3/3 (timeout=15)
105 sanlock renewal timeout
120 sanlock expire the lease, kill the vm/vdsm
121 storage is back
If you use 20 seconds io timeout, sanlock checks every 40 seconds.
The best case flow is:
00 sanlock renewal succeeds
01 storage fails
40 sanlock try to renew lease 1/3 (timeout=20)
60 sanlock renewal timeout
80 sanlock try to renew lease 2/3 (timeout=20)
100 sanlock renewal timeout
120 sanlock try to renew lease 3/3 (timeout=20)
121 storage is back
122 sanlock renwal succeeds
But, we need to consider also the worst case flow:
00 sanlock renewal succeeds
39 storage fails
40 sanlock try to renew lease 1/3 (timeout=20)
60 sanlock renewal timeout
80 sanlock try to renew lease 2/3 (timeout=20)
100 sanlock renewal timeout
120 sanlock try to renew lease 3/3 (timeout=20)
140 sanlock renwal timeout
159 storage is back
160 sanlock expire lease, kill vm/vdsm etc.
So even with 20 seconds io timeout, 120 seconds outage may not
succeed.
In practice we can assume that we detect storage outage sometime
in the middle between sanlock renewals, so the flow would be:
00 sanlock renewal succeeds
20 storage fails
40 sanlock try to renew lease 1/3 (timeout=20)
60 sanlock renewal timeout
80 sanlock try to renew lease 2/3 (timeout=20)
100 sanlock renewal timeout
120 sanlock try to renew lease 3/3 (timeout=20)
140 storage is back
140 sanlock renwal succeeds
160 sanlock expire lease, kill vm/vdsm etc.
So I would start with 20 seconds io timeout, and increase it if needed.
These flows assume that multiapth timeout is configured properly.
If multipath is using too short timeout, it will fail sanlock renewal
immediately instead of queuing the I/O.
I also did not add the time to detect that storage is available again.
multipath check paths every 5 seconds (polling_internal), so this
may add 5 seconds delay from the time the storage is up, until
multipath detect it and try to send queued I/O.
I think the current way sanlock works is not helpful for dealing
with long outages on the storage side. If we could keep the
io_timeout constant (e.g. 10 seconds), and change the number
of retries we could work better and be easier to predict.
Assuming we could use:
io_timeout = 10
renewal_retries = 8
The worst case would be:
00 sanlock renewal succeeds
19 storage fails
20 sanlock try to renew lease 1/7 (timeout=10)
30 sanlock renewal timeout
40 sanlock try to renew lease 2/7 (timeout=10)
50 sanlock renewal timeout
60 sanlock try to renew lease 3/7 (timeout=10)
70 sanlock renewal timeout
80 sanlock try to renew lease 4/7 (timeout=10)
90 sanlock renewal timeout
100 sanlock try to renew lease 5/7 (timeout=10)
110 sanlock renewal timeout
120 sanlock try to renew lease 6/7 (timeout=10)
130 sanlock renewal timeout
139 storage is back
140 sanlock try to renew lease 7/7 (timeout=10)
140 sanlock renewal succeeds
David, what do you think?
...
On another host with same config (other luns on the same storage), if
I run:
multipath reconfigure -v4 > /tmp/multipath_reconfigure_v4.txt 2>&1
I get this:
https://drive.google.com/file/d/1VkezFkT9IwsrYD8LoIp4-Q-j2X1dN_qR/view?us...
anything important inside, concerned with path retry settings?
I don't see anything about no_path_retry, there, maybe logging was changed,
or it is not the right flags to see all the info during reconfiguration.
I think "multipathd show config" is the canonical way to look at the current
configuration. It shows the actual values multipath will use during
runtime, after
local configuration was applied on top of the built configuration.
Nir