When I stopped the NFS service, I was connect to a VM over ssh.
I was also connected to one of the physical hosts over ssh, and was running top.
I observed that server load continued to increase over time on the physical host.
Several of the VMs (perhaps all?), including the one I was connected to, went down due to
an underlying storage issue.
It appears to me that HA VMs were restarted automatically. For example, I see the
following in the oVirt Manager Event Log (domain names changed / redacted):
Jun 4, 2021, 4:25:42 AM
Highly Available VM
server2.example.com failed. It will be restarted automatically.
Jun 4, 2021, 4:25:42 AM
Highly Available VM
mail.example.com failed. It will be restarted automatically.
Jun 4, 2021, 4:25:42 AM
Highly Available VM
core1.mgt.example.com failed. It will be restarted automatically.
Jun 4, 2021, 4:25:42 AM
VM
cha1-shared.example.com has been paused due to unknown storage error.
Jun 4, 2021, 4:25:42 AM
VM
server.example.org has been paused due to storage I/O problem.
Jun 4, 2021, 4:25:42 AM
VM
server.example.com has been paused.
Jun 4, 2021, 4:25:42 AM
VM
server.example.org has been paused.
Jun 4, 2021, 4:25:41 AM
VM
server.example.org has been paused due to unknown storage error.
Jun 4, 2021, 4:25:41 AM
VM HostedEngine has been paused due to storage I/O problem.
During this outage, I also noticed that customer websites were not working.
So I clearly took an outage.
If you have a good way to reproduce the issue please file a bug with
all the logs, we try to improve this situation.
I don't have a separate lab environment, but if I'm able to reproduce the issue
off hours, I may try to do so.
What logs would be helpful?
NFS storage domain will always affect other storage domains, but if
you mount
your NFS storage outside of ovirt, the mount will not affect the system.
If I'm understanding you correctly, it sounds like you're suggesting that I just
connect 1 (or multiple) hosts to the NFS mount manually, and don't use the oVirt
manager to build the backup domain. Then just run this script on a cron or something - is
that correct?
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, June 4, 2021 12:29 PM, Nir Soffer <nsoffer(a)redhat.com> wrote:
On Fri, Jun 4, 2021 at 12:11 PM David White via Users users(a)ovirt.org
wrote:
> I'm trying to figure out how to keep a "broken"
NFS mount point from causing the entire HCI cluster to crash.
> HCI is working beautifully.
> Last night, I finished adding some NFS storage to the cluster - this is storage that
I don't necessarily need to be HA, and I was hoping to store some backups and
less-important VMs on, since my Gluster (sssd) storage availability is pretty limited.
> But as a test, after I got everything setup, I stopped the nfs-server.
> This caused the entire cluster to go down, and several VMs - that are not stored on
the NFS storage - went belly up.
Please explain in more detail "went belly up".
In general vms not using he nfs storage domain should not be
affected, but
due to unfortunate design of vdsm, all storage domain share the same global lock
and when one storage domain has trouble, it can cause delays in
operations on other
domains. This may lead to timeouts and vms reported as non-responsive,
but the actual
vms, should not be affected.
If you have a good way to reproduce the issue please file a bug with
all the logs, we try
to improve this situation.
> Once I started the NFS server process again, HCI did what it was
supposed to do, and was able to automatically recover.
> My concern is that NFS is a single point of failure, and if VMs that don't even
rely on that storage are affected if the NFS storage goes away, then I don't want
anything to do with it.
You need to understand the actual effect on the vms before you reject
NFS.
> On the other hand, I'm still struggling to come up with a
good way to run on-site backups and snapshots without using up more gluster space on my
(more expensive) sssd storage.
NFS is useful for this purpose. You don't need synchronous
replication, and
you want the backups outside of your cluster so in case of disaster you can
restore the backups on another system.
Snapshots are always on the same storage so it will not help.
> Is there any way to setup NFS storage for a Backup Domain - as
well as a Data domain (for lesser important VMs) - such that, if the NFS server crashed,
all of my non-NFS stuff would be unaffected?
NFS storage domain will always affect other storage domains, but if
you mount
your NFS storage outside of ovirt, the mount will not affect the system.
>
Or one of the backup solutions, all of them are not using a storage
domain for
keeping the backups so the mount should not affect the system.
Nir