Hello All,
We recently stood up a new Ovirt install backed by an ISCSI SAN and it has been working great, but there are a few quirks I am trying to iron out.
We have run into an issue where when we fail-over our SAN (for maintenance, or otherwise) any VM with a Direct LUN gets paused and doesn’t resume. VMs without a direct LUN never paused. Digging through posts on this list and reading some bug reports, it seems like this a known quirk with how Ovirt handles Direct LUNs (it doesn't monitor the LUNs and so it wont resume the VM). To get the VMs to automatically restart I have attached VM leases to them and that seems to work fine, not as nice as a pause and resume, but it minimizes downtime.
What I’m trying to understand is why the VMs with Direct LUNs paused, and ones without didn’t. My only speculation is that since the Non-Direct is using LVM on top of ISCSI, that LVM is adding its own layer of timeouts that cause it to mask the outage?
My other question is, how can I keep my VMs with Direct LUNs from pausing during short outages? Can I put configurations in my multipath.conf for just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent the VMs from pausing in the first place? I know in general you don’t want to increase the ‘no_path_retry’ because it can cause timeout issues with VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN would it cause any problems?
Thank you,
Ryan