[ovirt-users] Can HA Agent control NFS Mount?

Thu Jun 5 22:43:48 EDT 2014

Hi Doron,

On Mon, May 26, 2014 at 4:38 PM, Doron Fediuck <dfediuck at redhat.com> wrote:
>
>
> ----- Original Message -----
>> From: "Andrew Lau" <andrew at andrewklau.com>
>> To: "Bob Doolittle" <bob at doolittle.us.com>
>> Cc: "users" <users at ovirt.org>
>> Sent: Monday, May 26, 2014 7:30:41 AM
>> Subject: Re: [ovirt-users] Can HA Agent control NFS Mount?
>>
>> On Mon, May 26, 2014 at 5:10 AM, Bob Doolittle <bob at doolittle.us.com> wrote:
>> >
>> > On 05/25/2014 02:51 PM, Joop wrote:
>> >>
>> >> On 25-5-2014 19:38, Bob Doolittle wrote:
>> >>>
>> >>>
>> >>> Also curious is that when I say "poweroff" it actually reboots and comes
>> >>> up again. Could that be due to the timeouts on the way down?
>> >>>
>> >> Ah, that's something my F19 host does too. Some more info: if engine
>> >> hasn't been started on the host then I can shutdown it and it will
>> >> poweroff.
>> >> IF engine has been run on it then it will reboot.
>> >> Its not vdsm (I think) because my shutdown sequence is (on my f19 host):
>> >>  service ovirt-agent-ha stop
>> >>  service ovirt-agent-broker stop
>> >>  service vdsmd stop
>> >>  ssh root at engine01 "init 0"
>> >> init 0
>> >>
>> >> I don't use maintenance mode because when I poweron my host (= my desktop)
>> >> I want engine to power on automatically which it does most of the time
>> >> within 10 min.
>> >
>> >
>> > For comparison, I see this issue and I *do* use maintenance mode (because
>> > presumably that's the 'blessed' way to shut things down and I'm scared to
>> > mess this complex system up by straying off the beaten path ;). My process
>> > is:
>> >
>> > ssh root at engine "init 0"
>> > (wait for "vdsClient -s 0 list | grep Status:" to show the vm as down)
>> > hosted-engine --set-maintenance --mode=global
>> > poweroff
>> >
>> > And then on startup:
>> > hosted-engine --set-maintenance --mode=none
>> > hosted-engine --vm-start
>> >
>> > There are two issues here. I am not sure if they are related or not.
>> > 1. The NFS timeout during shutdown (Joop do you see this also? Or just #2?)
>> > 2. The system reboot instead of poweroff (which messes up remote machine
>> > management)
>> >
>>
>> For 1. I was wondering if perhaps, we could have an option to specify
>> the mount options. If I understand correctly, applying a soft mount
>> instead of a hard mount would prevent this from happening. I'm however
>> not sure of the implications this would have on the data integrity..
>>
>> I would really like to see it happen in the ha-agent, as it's the one
>> which connects/mounts the storage it should also unmount it on boot.
>> However the stability on it, is flaky at best. I've noticed if `df`
>> hangs because of another NFS mount having timed-out the agent will
>> die. That's not a good sign.. this was what actually caused my
>> hosted-engine to run twice in one case.
>>
>> > Thanks,
>> >      Bob
>> >
>> >
>> >> I think wdmd or sanlock are causing the reboot instead of poweroff
>> >>
>> >> Joop
>> >>
>
> Great to have your feedback guys!
>
> So just to clarify some of the issues you mentioned;
>
> Hosted engine wasn't designed for a 'single node' use case, as we do
> want it to be highly available. This is why it's being restarted
> elsewhere or even on the same server if no better alternative.
>
> Having said that, it is possible to set global maintenance mode
> as a first step (in the UI: right click engine vm and choose ha-maintenance).
> Then you can ssh into the engine vm and init 0.
>
> After a short while, the qemu process should gracefully end and release
> its sanlock lease as well as any other resource, which means you can
> reboot your hypervisor peacefully.

Sadly no, I've only been able to reboot my hypervisors if one of the
two conditions are met:

- Lazy unmount of /rhev/mnt/hosted-engine etc.
- killall -9 sanlock wdmd

I notice sanlock and wdmd are not able to be stopped with service wdmd
stop; service sanlock stop
These seem to fail during the shutdown/reboot process which prevents
the unmount and the graceful reboot.

Are there any logs I can look into on how to debug those failed shutdowns?

>
> Doron