[ovirt-users] Re: HA VM Lease failure with full data storage domain

2 Jun 2022

      Here's the ausearch results from that host. Looks like more than one
issue. (openvswitch is also in there.)

I'll see about opening the bug. Should I file it on oVirt's github or
the RedHat bugzilla?

-Patrick Hibbs

On Thu, 2022-06-02 at 22:08 +0300, Nir Soffer wrote:
...
On Thu, Jun 2, 2022 at 9:52 PM Patrick Hibbs <hibbsncc1701@gmail.com>
wrote:
...
The attached logs are from the cluster hosts that were running the
HA
VMs during the failures.
I've finally got all of my HA VMs up again. The last one didn't
start
again until after I freed up more space in the storage domain than
what
was originally available when the VM was running previously. (It
now
has over 150GB of free space. Which should be more than enough, but
it
didn't boot with 140GB avaiable....)
SideNote:
I just found this in the logs on the original host that the HA VMs
were
running on:
---snip---
Jun 02 10:33:29 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:29
674607
[1054]: s1 check_our_lease warning 71 last_success 674536
                                                      # semanage
fcontext -a -t virt_image_t '1055'
                                                      *****  Plugin
catchall (2.13 confidence) suggests   **************************
                                                      Then you
should
report this as a bug.
                                                      You can
generate
a local policy module to allow this access.
                                                      Do
Not clear what is the selinux issue. If you run:
    ausearch -m avc
It should be more clear.
...
Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45
674623
[1054]: s1 kill 3441 sig 15 count 8
Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45
674623
[1054]: s1 kill 4337 sig 15 count 8
Jun 02 10:33:46 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:46
674624
[1054]: s1 kill 3206 sig 15 count 9
This means that the host could not access the storage for 80 seconds,
and the
leases expired. When leases expire, sanlock must kill the process
holding the
lease. Here we see that sanlock send a SIGTERM to 3 processes.
If these are VMs, they will pause and libvirt will release the lease.
I can check the log deeper next week.
Nir
...
Jun 02 10:33:47 ryuki.codenet kernel: ovirtmgmt: port 4(vnet2)
entered
disabled state
---snip---
That looks like some SELinux failure.
-Patrick Hibbs
On Thu, 2022-06-02 at 19:44 +0300, Nir Soffer wrote:
...
On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs
<hibbsncc1701@gmail.com>
wrote:
...
OK, so the data storage domain on a cluster filled up to the
point
that
the OS refused to allocate any more space.
This happened because I tried to create a new prealloc'd disk
from
the
Admin WebUI. The disk creation claims to be completed
successfully,
I've not tried to use that disk yet, but due to a timeout with
the
storage domain in question the engine began trying to fence all
of
the
HA VMs.
The fencing failed for all of the HA VMs leaving them in a
powered
off
state. Despite all of the HA VMs being up at the time, so no
reallocation of the leases should have been necessary.
Leases are not reallocated during fencing, not sure why you
expect
this to happen.
...
Attempting to
restart them manually from the Admin WebUI failed. With the
original
host they were running on complaining about "no space left on
device",
and the other hosts claiming that the original host still held
the
VM
lease.
No space left on device may be an unfortunate error from sanlock,
meaning that there is no locksapce. This means the host has
trouble
adding the lockspace, or it did not complete yet.
...
After cleaning up some old snapshots, the HA VMs would still
not
boot.
Toggling the High Availability setting for each one and
allowing
the
lease to be removed from the storage domain was required to get
the
VMs
to start again.
If  you know that the VM is not running, disabling the lease
temporarily is
a good way to workaround the issue.
...
Re-enabling the High Availability setting there after
fixed the lease issue. But now some, not all, of the HA VMs are
still
throwing "no space left on device" errors when attempting to
start
them. The others are working just fine even with their HA lease
enabled.
All erros come from same host(s) or some vms cannot start while
others can on the same host?
...
My questions are:
1. Why does oVirt claim to have a constantly allocated HA VM
lease
on
the storage domain when it's clearly only done while the VM is
running?
Leases are allocated when a VM is created. This allocated a the
lease
space
(1MiB) in the external leases special volume, and bind it to the
VM
ID.
When VM starts, it acquires the lease for its VM ID. If sanlock
is
not connected
to the lockspace on this host, this may fail with the confusing
"No space left on device" error.
...
2. Why does oVirt deallocate the HA VM lease when performing a
fencing
operation?
It does not. oVirt does not actually "fence" the VM. If the host
running the VM
cannot access storage and update the lease, the host lose all
leases
on that
storage. The result is pausing all the VM holding a lease on that
storage.
oVirt will try to start the VM on another host, which will try to
acquire the lease
again on the new host. If enough time passed since the original
host
lost
access to storage, the lease can be acquired on the new host. If
not,
this
will happen in the next retrie(s).
If the original host did not lose access to storage, and it is
still
updating the
lease you cannot acquire the lease from another host. This
protect
the VM
from split-brain that will corrupt the vm disk.
...
3. Why can't oVirt clear the old HA VM lease when the VM is
down
and
the storage pool has space available? (How much space is even
needed?
The leases section of the storage domain in the Admin WebUI
doesn't
contain any useful info beyond the fact that a lease should
exist
for a
VM even when it's off.)
Acquiring the lease is possible only if the lease is not held on
another host.
oVirt does not support acquiring a held lease by killing the
process
holding
the lease on another host, but sanlock provides such capability.
...
4. Is there a better way to force start a HA VM when the lease
is
old
and the VM is powered off?
If the original VM is powered off for enough time (2-3 minutes),
the
lease
expires and starting the VM on another host should succeed.
...
5. Should I file a bug on the whole HA VM failing to reacquire
a
lease
on a full storage pool?
The external lease volume is not fully allocated. If you use thin
provisioned
storage, and the there is really no storage space, it is possible
that creating
a new lease will fail, but starting and stopping VM that have
leases
should not
be affected. But if you reach to the point when you don't have
enough
storage
space you have much bigger trouble and you should fix urgently.
Do you really have issue with available space? What does engine
reports
about the storage domain? What does the underlying storage
reports?
Nir