[ovirt-users] Can HA Agent control NFS Mount?

Bob Doolittle bob at doolittle.us.com
Fri Jun 13 23:29:54 UTC 2014


It turns out I was wrong before. I don't have to start up Engine to get 
into this situation.

I did the following:

  * Turn on Global Maintenance
  * Engine init 0
  * Reboot node
  * Wait a few minutes
  * poweroff


I'll get the timeouts and hangs during shutdown again, and a reset 
instead of poweroff.

It's possible that somehow the system is coming out of Global 
Maintenance mode during shutdown, and the Engine VM is starting up and 
causing this issue.

I did the following.
1. hosted-engine --set-maintenance --mode=none
You can see the attached output from 'hosted-engine --vm-status' 
(hosted-engine.out) at this point, indicating that the system is in 
Global Maintenance

2. Waited 60 seconds, and checked sanlock
You can see the attached output of 'sanlock client status' 
(sanlock-status.out) at this point, showing the Engine VM locks being held

3. I stopped the vdsmd service (note that the first time I tried I got 
"Job for vdsmd.service cancelled", and re-issued the stop.
You can see the attached output of 'sanlock client status', and the 
following commands (output)

What's interesting and I didn't notice right away, is that after I 
stopped vdsmd the sanlock status started changing as if the locks were 
being manipulated.
After I stopped vdsmd, the HA services, and libvirtd, and waited 60 
seconds, I noticed the locks seemed to be changing state and that 
HostedEngine was listed. At that point I got suspicious and started 
vdsmd again so that I could recheck Global Maintenance mode, and I found 
that the system was no longer *in* maintenance, and that the Engine VM 
was running.

So I think this partly explains the situation. Somehow the act of 
stopping vdsmd is making the system look like it is *out* of Global 
Maintenance mode, and the Engine VM starts up while the system is 
shutting down. This creates new sanlock leases on the Engine VM storage, 
which prevents the system from shutting down cleanly. Oddly after a 
reboot Global Maintenance is preserved.

But there may be more going on. Even if I stop vdsmd, the HA services, 
and libvirtd, and sleep 60 seconds, I still see a lock held on the 
Engine VM storage:

daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p -1 status
s 003510e8-966a-47e6-a5eb-3b5c8a6070a9:1:/rhev/data-center/mnt/xion2.smartcity.net\:_export_VM__NewDataDomain/003510e8-966a-47e6-a5eb-3b5c8a6070a9/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0


It stays in this state however and HostedEngine doesn't grab a lock again.
In any case no matter what I do, it's impossible to shut the system down 
cleanly.

-Bob

On 06/13/2014 08:33 AM, Doron Fediuck wrote:
> ----- Original Message -----
>> From: "Andrew Lau"<andrew at andrewklau.com>
>> To: "Bob Doolittle"<bob at doolittle.us.com>
>> Cc: "users"<users at ovirt.org>
>> Sent: Friday, June 6, 2014 6:14:18 AM
>> Subject: Re: [ovirt-users] Can HA Agent control NFS Mount?
>>
>> On Fri, Jun 6, 2014 at 1:09 PM, Bob Doolittle<bob at doolittle.us.com>  wrote:
>>> Thanks Andrew, I'll try this workaround tomorrow for sure. But reading
>>> though that bug report (closed not a bug) it states that the problem should
>>> only arise if something is not releasing a sanlock lease. So if we've
>>> entered Global Maintenance and shut down Engine, the question is what's
>>> holding the lease?
>>>
>>> How can that be debugged?
>> For me it's wdmd and sanlock itself failing to shutdown properly. I
>> also noticed even when in global maintenance and the engine VM powered
>> off there is still a sanlock lease for the
>> /rhev/mnt/....hosted-engine/? lease file or something along those
>> lines. So the global maintenance may not actually be releasing that
>> lock.
>>
>> I'm not too familiar with sanlock etc. So it's like stabbing in the dark :(
>>
> Sounds like a bug since once the VM is off there should not
> be a lease taken.
>
> Please check if after a minute you still have a lease taken
> according to:http://www.ovirt.org/SANLock#sanlock_timeouts
>
> In this case try to stop vdsm and libvirt just so we'll know
> who still keeps the lease.
>
>>> -Bob
>>>
>>> On Jun 5, 2014 10:56 PM, "Andrew Lau"<andrew at andrewklau.com>  wrote:
>>>> On Mon, May 26, 2014 at 5:10 AM, Bob Doolittle<bob at doolittle.us.com>
>>>> wrote:
>>>>> On 05/25/2014 02:51 PM, Joop wrote:
>>>>>> On 25-5-2014 19:38, Bob Doolittle wrote:
>>>>>>> Also curious is that when I say "poweroff" it actually reboots and
>>>>>>> comes
>>>>>>> up again. Could that be due to the timeouts on the way down?
>>>>>>>
>>>>>> Ah, that's something my F19 host does too. Some more info: if engine
>>>>>> hasn't been started on the host then I can shutdown it and it will
>>>>>> poweroff.
>>>>>> IF engine has been run on it then it will reboot.
>>>>>> Its not vdsm (I think) because my shutdown sequence is (on my f19
>>>>>> host):
>>>>>>   service ovirt-agent-ha stop
>>>>>>   service ovirt-agent-broker stop
>>>>>>   service vdsmd stop
>>>>>>   ssh root at engine01 "init 0"
>>>>>> init 0
>>>>>>
>>>>>> I don't use maintenance mode because when I poweron my host (= my
>>>>>> desktop)
>>>>>> I want engine to power on automatically which it does most of the time
>>>>>> within 10 min.
>>>>> For comparison, I see this issue and I *do* use maintenance mode
>>>>> (because
>>>>> presumably that's the 'blessed' way to shut things down and I'm scared
>>>>> to
>>>>> mess this complex system up by straying off the beaten path ;). My
>>>>> process
>>>>> is:
>>>>>
>>>>> ssh root at engine "init 0"
>>>>> (wait for "vdsClient -s 0 list | grep Status:" to show the vm as down)
>>>>> hosted-engine --set-maintenance --mode=global
>>>>> poweroff
>>>>>
>>>>> And then on startup:
>>>>> hosted-engine --set-maintenance --mode=none
>>>>> hosted-engine --vm-start
>>>>>
>>>>> There are two issues here. I am not sure if they are related or not.
>>>>> 1. The NFS timeout during shutdown (Joop do you see this also? Or just
>>>>> #2?)
>>>>> 2. The system reboot instead of poweroff (which messes up remote machine
>>>>> management)
>>>>>
>>>>> Thanks,
>>>>>       Bob
>>>>>
>>>>>
>>>>>> I think wdmd or sanlock are causing the reboot instead of poweroff
>>>> While searching for my issue of wdmd/sanlock not shutting down, I
>>>> found this which may interest you both:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=888197
>>>>
>>>> Specifically:
>>>> "To shut down sanlock without causing a wdmd reboot, you can run the
>>>> following command: "sanlock client shutdown -f 1"
>>>>
>>>> This will cause sanlock to kill any pid's that are holding leases,
>>>> release those leases, and then exit.
>>>> "
>>>>
>>>>>> Joop
>>>>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140613/c617c672/attachment-0001.html>
-------------- next part --------------


!! Cluster is in GLOBAL MAINTENANCE mode !!



--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : 172.16.0.58
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1402697097
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=1402697097 (Fri Jun 13 18:04:57 2014)
	host-id=1
	score=2400
	maintenance=False
	state=GlobalMaintenance
-------------- next part --------------
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p -1 status
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
-------------- next part --------------
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p -1 status
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
[root at xion2 Desktop]# systemctl vdsmd stop
Unknown operation 'vdsmd'.
[root at xion2 Desktop]# systemctl stop vdsmd
Job for vdsmd.service canceled.
[root at xion2 Desktop]# systemctl stop vdsmd
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p -1 status
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
[root at xion2 Desktop]# sleep 60
sanlock client status
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p -1 status
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p 3614 HostedEngine
p -1 status
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
r 18eeab54-e482-497f-b096-11f8a43f94f4:951b1e6d-f708-4a5f-a226-fea80dcf2e30:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/images/de1ec17e-fa76-46d3-91c6-8aec3fb545a2/951b1e6d-f708-4a5f-a226-fea80dcf2e30.lease:0:31 p 3614
[root at xion2 Desktop]# systemctl stop libvirtd
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p 3614 HostedEngine
p -1 status
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
r 18eeab54-e482-497f-b096-11f8a43f94f4:951b1e6d-f708-4a5f-a226-fea80dcf2e30:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/images/de1ec17e-fa76-46d3-91c6-8aec3fb545a2/951b1e6d-f708-4a5f-a226-fea80dcf2e30.lease:0:31 p 3614
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p 3614 HostedEngine
p -1 status
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
r 18eeab54-e482-497f-b096-11f8a43f94f4:951b1e6d-f708-4a5f-a226-fea80dcf2e30:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/images/de1ec17e-fa76-46d3-91c6-8aec3fb545a2/951b1e6d-f708-4a5f-a226-fea80dcf2e30.lease:0:31 p 3614
[root at xion2 Desktop]# systemctl stop ovirt-ha-agent ovirt-ha-broker
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p 3614 HostedEngine
p -1 status
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
r 18eeab54-e482-497f-b096-11f8a43f94f4:951b1e6d-f708-4a5f-a226-fea80dcf2e30:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/images/de1ec17e-fa76-46d3-91c6-8aec3fb545a2/951b1e6d-f708-4a5f-a226-fea80dcf2e30.lease:0:31 p 3614
[root at xion2 Desktop]# hosted-engine --vm-status
Cannot connect to the HA daemon, please check the logs.
Cannot connect to the HA daemon, please check the logs.
[root at xion2 Desktop]# sanlock client status
daemon 6f3af037-d05e-4ad8-a53c-61627e0c2464.xion2.smar
p -1 helper
p -1 listener
p 3614 HostedEngine
p 5993 
p -1 status
s 003510e8-966a-47e6-a5eb-3b5c8a6070a9:1:/rhev/data-center/mnt/xion2.smartcity.net\:_export_VM__NewDataDomain/003510e8-966a-47e6-a5eb-3b5c8a6070a9/dom_md/ids:0
s 18eeab54-e482-497f-b096-11f8a43f94f4:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/dom_md/ids:0
s hosted-engine:1:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/ha_agent/hosted-engine.lockspace:0
r 003510e8-966a-47e6-a5eb-3b5c8a6070a9:SDM:/rhev/data-center/mnt/xion2.smartcity.net\:_export_VM__NewDataDomain/003510e8-966a-47e6-a5eb-3b5c8a6070a9/dom_md/leases:1048576:23 p 5993
r 18eeab54-e482-497f-b096-11f8a43f94f4:951b1e6d-f708-4a5f-a226-fea80dcf2e30:/rhev/data-center/mnt/xion2\:_export_vm_he1/18eeab54-e482-497f-b096-11f8a43f94f4/images/de1ec17e-fa76-46d3-91c6-8aec3fb545a2/951b1e6d-f708-4a5f-a226-fea80dcf2e30.lease:0:31 p 3614


More information about the Users mailing list