Re: [ovirt-users] Can HA Agent control NFS Mount?

6 Jun 2014

      Hi Doron,

On Mon, May 26, 2014 at 4:38 PM, Doron Fediuck <dfediuck@redhat.com> wrote:
...
----- Original Message -----
...
From: "Andrew Lau" <andrew@andrewklau.com>
To: "Bob Doolittle" <bob@doolittle.us.com>
Cc: "users" <users@ovirt.org>
Sent: Monday, May 26, 2014 7:30:41 AM
Subject: Re: [ovirt-users] Can HA Agent control NFS Mount?
On Mon, May 26, 2014 at 5:10 AM, Bob Doolittle <bob@doolittle.us.com> wrote:
...
On 05/25/2014 02:51 PM, Joop wrote:
...
On 25-5-2014 19:38, Bob Doolittle wrote:
...
Also curious is that when I say "poweroff" it actually reboots and comes
up again. Could that be due to the timeouts on the way down?
Ah, that's something my F19 host does too. Some more info: if engine
hasn't been started on the host then I can shutdown it and it will
poweroff.
IF engine has been run on it then it will reboot.
Its not vdsm (I think) because my shutdown sequence is (on my f19 host):
 service ovirt-agent-ha stop
 service ovirt-agent-broker stop
 service vdsmd stop
 ssh root@engine01 "init 0"
init 0
I don't use maintenance mode because when I poweron my host (= my desktop)
I want engine to power on automatically which it does most of the time
within 10 min.
For comparison, I see this issue and I *do* use maintenance mode (because
presumably that's the 'blessed' way to shut things down and I'm scared to
mess this complex system up by straying off the beaten path ;). My process
is:
ssh root@engine "init 0"
(wait for "vdsClient -s 0 list | grep Status:" to show the vm as down)
hosted-engine --set-maintenance --mode=global
poweroff
And then on startup:
hosted-engine --set-maintenance --mode=none
hosted-engine --vm-start
There are two issues here. I am not sure if they are related or not.
1. The NFS timeout during shutdown (Joop do you see this also? Or just #2?)
2. The system reboot instead of poweroff (which messes up remote machine
management)
For 1. I was wondering if perhaps, we could have an option to specify
the mount options. If I understand correctly, applying a soft mount
instead of a hard mount would prevent this from happening. I'm however
not sure of the implications this would have on the data integrity..
I would really like to see it happen in the ha-agent, as it's the one
which connects/mounts the storage it should also unmount it on boot.
However the stability on it, is flaky at best. I've noticed if `df`
hangs because of another NFS mount having timed-out the agent will
die. That's not a good sign.. this was what actually caused my
hosted-engine to run twice in one case.
...
Thanks,
     Bob
...
I think wdmd or sanlock are causing the reboot instead of poweroff
Joop
Great to have your feedback guys!
So just to clarify some of the issues you mentioned;
Hosted engine wasn't designed for a 'single node' use case, as we do
want it to be highly available. This is why it's being restarted
elsewhere or even on the same server if no better alternative.
Having said that, it is possible to set global maintenance mode
as a first step (in the UI: right click engine vm and choose ha-maintenance).
Then you can ssh into the engine vm and init 0.
After a short while, the qemu process should gracefully end and release
its sanlock lease as well as any other resource, which means you can
reboot your hypervisor peacefully.
Sadly no, I've only been able to reboot my hypervisors if one of the
two conditions are met:

- Lazy unmount of /rhev/mnt/hosted-engine etc.
- killall -9 sanlock wdmd

I notice sanlock and wdmd are not able to be stopped with service wdmd
stop; service sanlock stop
These seem to fail during the shutdown/reboot process which prevents
the unmount and the graceful reboot.

Are there any logs I can look into on how to debug those failed shutdowns?
...
Doron