Hello All,
We recently stood up a new Ovirt install backed by an ISCSI SAN and it has
been working great, but there are a few quirks I am trying to iron out.
We have run into an issue where when we fail-over our SAN (for maintenance,
or otherwise) any VM with a Direct LUN gets paused and doesn’t resume. VMs
without a direct LUN never paused. Digging through posts on this list and
reading some bug reports, it seems like this a known quirk with how Ovirt
handles Direct LUNs (it doesn't monitor the LUNs and so it wont resume the
VM). To get the VMs to automatically restart I have attached VM leases to
them and that seems to work fine, not as nice as a pause and resume, but it
minimizes downtime.
What I’m trying to understand is why the VMs with Direct LUNs paused, and
ones without didn’t. My only speculation is that since the Non-Direct is
using LVM on top of ISCSI, that LVM is adding its own layer of timeouts
that cause it to mask the outage?
My other question is, how can I keep my VMs with Direct LUNs from pausing
during short outages? Can I put configurations in my multipath.conf for
just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent
the VMs from pausing in the first place? I know in general you don’t want
to increase the ‘no_path_retry’ because it can cause timeout issues with
VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN
would it cause any problems?
Thank you,
Ryan