Direct LUNs and VM Pauses

Hello All, We recently stood up a new Ovirt install backed by an ISCSI SAN and it has been working great, but there are a few quirks I am trying to iron out. We have run into an issue where when we fail-over our SAN (for maintenance, or otherwise) any VM with a Direct LUN gets paused and doesn’t resume. VMs without a direct LUN never paused. Digging through posts on this list and reading some bug reports, it seems like this a known quirk with how Ovirt handles Direct LUNs (it doesn't monitor the LUNs and so it wont resume the VM). To get the VMs to automatically restart I have attached VM leases to them and that seems to work fine, not as nice as a pause and resume, but it minimizes downtime. What I’m trying to understand is why the VMs with Direct LUNs paused, and ones without didn’t. My only speculation is that since the Non-Direct is using LVM on top of ISCSI, that LVM is adding its own layer of timeouts that cause it to mask the outage? My other question is, how can I keep my VMs with Direct LUNs from pausing during short outages? Can I put configurations in my multipath.conf for just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent the VMs from pausing in the first place? I know in general you don’t want to increase the ‘no_path_retry’ because it can cause timeout issues with VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN would it cause any problems? Thank you, Ryan

On Mon, Jul 23, 2018 at 9:35 PM Ryan Bullock <rrb3942@gmail.com> wrote:
Hello All,
We recently stood up a new Ovirt install backed by an ISCSI SAN and it has been working great, but there are a few quirks I am trying to iron out.
We have run into an issue where when we fail-over our SAN (for maintenance, or otherwise) any VM with a Direct LUN gets paused and doesn’t resume. VMs without a direct LUN never paused.
I guess the other VMs did get paused, but they were resumed automatically by the system, so from your point of view, they did not "pause". You can check vdsm log if the other vms did pause and resume. I'm not sure engine UI reports all pause and resume events.
Digging through posts on this list and reading some bug reports, it seems like this a known quirk with how Ovirt handles Direct LUNs (it doesn't monitor the LUNs and so it wont resume the VM).
Right. Can you file a bug for supporting this? Vdsm does monitor multipath events for all LUNs, but they are used only for reporting purposes, see: https://ovirt.org/develop/release-management/features/storage/multipath-even... We could use the events for resuming vms using the multipath devices that became available. This functionality will be even more important in the next version since we plan to move to LUN per disk model.
To get the VMs to automatically restart I have attached VM leases to them and that seems to work fine, not as nice as a pause and resume, but it minimizes downtime.
Cool!
What I’m trying to understand is why the VMs with Direct LUNs paused, and ones without didn’t. My only speculation is that since the Non-Direct is using LVM on top of ISCSI, that LVM is adding its own layer of timeouts that cause it to mask the outage?
I don't know about additional retry mechanism in the data-path for LVM based disks. I think we use the same multipath failover behavior.
My other question is, how can I keep my VMs with Direct LUNs from pausing during short outages? Can I put configurations in my multipath.conf for just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent the VMs from pausing in the first place? I know in general you don’t want to increase the ‘no_path_retry’ because it can cause timeout issues with VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN would it cause any problems?
You can add a drop-in multipath configuration that will change no_path_retry for specific device, or multiapth. Increasing no_path_retry will cause larger delays when vdsm try to access the LUNs via lvm commands, but the delay should be only on the first access when a LUN is not available. Here is an example drop-in file: # cat /etc/multipath/conf.d/my.conf devices { device { vendor "my-vendor" product "my-product" # based on 5 seconds monitor interval, queue I/O for # 60 seconds when no path is available, before failing. no_path_retry 12 } } multipaths { multipath { wwid "my-wwidr" no_path_retry 12 } } See "man multipath.conf" for more info. Nir

On Tue, Jul 24, 2018 at 5:51 AM, Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Jul 23, 2018 at 9:35 PM Ryan Bullock <rrb3942@gmail.com> wrote:
Hello All,
We recently stood up a new Ovirt install backed by an ISCSI SAN and it has been working great, but there are a few quirks I am trying to iron out.
We have run into an issue where when we fail-over our SAN (for maintenance, or otherwise) any VM with a Direct LUN gets paused and doesn’t resume. VMs without a direct LUN never paused.
I guess the other VMs did get paused, but they were resumed automatically by the system, so from your point of view, they did not "pause".
You can check vdsm log if the other vms did pause and resume. I'm not sure engine UI reports all pause and resume events.
Ah, Ok. That would make sense. I had checked the events via the UI and it didn't show any pauses, but I had not checked the actual VDSM logs on the hosts. Unfortunately my logs of for the period have rolled off. I had noticed this behaviour during our first firmware upgrade on our SAN about a month ago. Since VM leases allowed us to maintain HA I just put it in my list of things to follow up on. Going forward I will make sure to double check the VDSM logs to see what is happening in the background.
Digging through posts on this list and reading some bug reports, it seems
like this a known quirk with how Ovirt handles Direct LUNs (it doesn't monitor the LUNs and so it wont resume the VM).
Right.
Can you file a bug for supporting this?
Vdsm does monitor multipath events for all LUNs, but they are used only for reporting purposes, see: https://ovirt.org/develop/release-management/features/ storage/multipath-events/
We could use the events for resuming vms using the multipath devices that became available. This functionality will be even more important in the next version since we plan to move to LUN per disk model.
I will look at doing this. At the very least I feel that differences/limitations between storage back-ends/methods should be documented. Just so users don't run into any surprises.
To get the VMs to automatically restart I have attached VM leases to them
and that seems to work fine, not as nice as a pause and resume, but it minimizes downtime.
Cool!
What I’m trying to understand is why the VMs with Direct LUNs paused, and ones without didn’t. My only speculation is that since the Non-Direct is using LVM on top of ISCSI, that LVM is adding its own layer of timeouts that cause it to mask the outage?
I don't know about additional retry mechanism in the data-path for LVM based disks. I think we use the same multipath failover behavior.
My other question is, how can I keep my VMs with Direct LUNs from pausing during short outages? Can I put configurations in my multipath.conf for just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent the VMs from pausing in the first place? I know in general you don’t want to increase the ‘no_path_retry’ because it can cause timeout issues with VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN would it cause any problems?
You can add a drop-in multipath configuration that will change no_path_retry for specific device, or multiapth.
Increasing no_path_retry will cause larger delays when vdsm try to access the LUNs via lvm commands, but the delay should be only on the first access when a LUN is not available.
Would that increased delay cause any sort of issues for Ovirt (e.g. thinking a node is offline/unresponsive) if set globally in multipath.conf? Since a Direct LUN doesn't use LVM, would this even be a consideration if the increased delay was limited to the Direct LUN only? Here is an example drop-in file:
# cat /etc/multipath/conf.d/my.conf devices { device { vendor "my-vendor" product "my-product" # based on 5 seconds monitor interval, queue I/O for # 60 seconds when no path is available, before failing. no_path_retry 12 } }
multipaths { multipath { wwid "my-wwidr" no_path_retry 12 } }
Yep, this was my plan. See "man multipath.conf" for more info.
Nir
Thanks, Ryan

On Tue, Jul 24, 2018 at 8:30 PM Ryan Bullock <rrb3942@gmail.com> wrote: ...
Vdsm does monitor multipath events for all LUNs, but they are used only
for reporting purposes, see:
https://ovirt.org/develop/release-management/features/storage/multipath-even...
We could use the events for resuming vms using the multipath devices that became available. This functionality will be even more important in the next version since we plan to move to LUN per disk model.
I will look at doing this. At the very least I feel that differences/limitations between storage back-ends/methods should be documented. Just so users don't run into any surprises.
You can file a bug for documenting this issue. ...
My other question is, how can I keep my VMs with Direct LUNs from pausing
during short outages? Can I put configurations in my multipath.conf for just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent the VMs from pausing in the first place? I know in general you don’t want to increase the ‘no_path_retry’ because it can cause timeout issues with VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN would it cause any problems?
You can add a drop-in multipath configuration that will change no_path_retry for specific device, or multiapth.
Increasing no_path_retry will cause larger delays when vdsm try to access the LUNs via lvm commands, but the delay should be only on the first access when a LUN is not available.
Would that increased delay cause any sort of issues for Ovirt (e.g. thinking a node is offline/unresponsive) if set globally in multipath.conf? Since a Direct LUN doesn't use LVM, would this even be a consideration if the increased delay was limited to the Direct LUN only?
Vdsm scans all LUNs to discover oVirt volumes, so it will be effected by multipath configuration applied only for direct LUNs. Increasing no_path_retry for any LUN will increase the chance to delay some vdsm flows accessing LUNs (e.g. updating lvm cache, scsi rescan, listing devices). But the delay happens once when the multipath device loose all paths. The benefit is smaller chance that a VM will pause or restart because of short outage. Nir

Sorry for the slow reply, was out sick end of last week. Thank you Nir! You have been very helpful in getting a grasp on this issue. I have gone ahead and open an RFE for resuming on a Direct LUN: https://bugzilla.redhat.com/show_bug.cgi?id=1610459 Thanks again! Regards, Ryan On Tue, Jul 24, 2018 at 12:30 PM, Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Jul 24, 2018 at 8:30 PM Ryan Bullock <rrb3942@gmail.com> wrote: ...
Vdsm does monitor multipath events for all LUNs, but they are used only
for reporting purposes, see: https://ovirt.org/develop/release-management/features/ storage/multipath-events/
We could use the events for resuming vms using the multipath devices that became available. This functionality will be even more important in the next version since we plan to move to LUN per disk model.
I will look at doing this. At the very least I feel that differences/limitations between storage back-ends/methods should be documented. Just so users don't run into any surprises.
You can file a bug for documenting this issue.
...
My other question is, how can I keep my VMs with Direct LUNs from pausing
during short outages? Can I put configurations in my multipath.conf for just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent the VMs from pausing in the first place? I know in general you don’t want to increase the ‘no_path_retry’ because it can cause timeout issues with VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN would it cause any problems?
You can add a drop-in multipath configuration that will change no_path_retry for specific device, or multiapth.
Increasing no_path_retry will cause larger delays when vdsm try to access the LUNs via lvm commands, but the delay should be only on the first access when a LUN is not available.
Would that increased delay cause any sort of issues for Ovirt (e.g. thinking a node is offline/unresponsive) if set globally in multipath.conf? Since a Direct LUN doesn't use LVM, would this even be a consideration if the increased delay was limited to the Direct LUN only?
Vdsm scans all LUNs to discover oVirt volumes, so it will be effected by multipath configuration applied only for direct LUNs.
Increasing no_path_retry for any LUN will increase the chance to delay some vdsm flows accessing LUNs (e.g. updating lvm cache, scsi rescan, listing devices). But the delay happens once when the multipath device loose all paths. The benefit is smaller chance that a VM will pause or restart because of short outage.
Nir
participants (2)
-
Nir Soffer
-
Ryan Bullock