[ovirt-users] oVirt 4.0.3 (Hosted Engine) - High Availability VM not restart after auto-fencing of host.

Michal Skrivanek michal.skrivanek at redhat.com
Fri Sep 16 14:37:48 UTC 2016


> On 16 Sep 2016, at 16:34, aleksey.maksimov at it-kb.ru wrote:
> 
> Тested.
> 
> If I run 'shutdown -h now' on host with running HA VM (not HostedEngine VM)...
> 
> in oVirt web-console appears event:
> 
> Sep 16, 2016 5:13:18 PM VM KOM-AD01-PBX02 is down. Exit message: User shut down from within the guest

that would be another bug. It should be recognized properly as a “kill”. Can you please share host logs from this attempt as well?

> 
> HA VM is turned off and will not start on another host.
> 
> This journald log from HA VM guest OS:
> 
> ...
> Sep 16 17:06:48 KOM-AD01-PBX02 python[2637]: [100B blob data]
> Sep 16 17:06:53 KOM-AD01-PBX02 systemd-timesyncd[1739]: Timed out waiting for reply from 91.189.91.157:123 (ntp.ubuntu.com).
> Sep 16 17:07:03 KOM-AD01-PBX02 systemd-timesyncd[1739]: Timed out waiting for reply from 91.189.89.199:123 (ntp.ubuntu.com).
> Sep 16 17:07:13 KOM-AD01-PBX02 systemd-timesyncd[1739]: Timed out waiting for reply from 91.189.89.198:123 (ntp.ubuntu.com).
> Sep 16 17:07:23 KOM-AD01-PBX02 systemd-timesyncd[1739]: Timed out waiting for reply from 91.189.94.4:123 (ntp.ubuntu.com).
> Sep 16 17:08:48 KOM-AD01-PBX02 python[2637]: [90B blob data]
> Sep 16 17:08:49 KOM-AD01-PBX02 python[2637]: [155B blob data]
> Sep 16 17:08:49 KOM-AD01-PBX02 python[2637]: [100B blob data]
> Sep 16 17:10:49 KOM-AD01-PBX02 python[2637]: [90B blob data]
> Sep 16 17:10:50 KOM-AD01-PBX02 python[2637]: [155B blob data]
> Sep 16 17:10:50 KOM-AD01-PBX02 python[2637]: [100B blob data]
> -- Reboot --
> ...
> 
> Before shutting down in the log no termination procedures.
> It looks like a rough poweroff the VM

yep, that is expected. But it should be properly detected as such and HE VM should restart. Somehow vdsm misidentifies the reason for the shutdown.

> 
> 16.09.2016, 17:08, "Simone Tiraboschi" <stirabos at redhat.com>:
>> On Fri, Sep 16, 2016 at 4:02 PM, <aleksey.maksimov at it-kb.ru> wrote:
>>> So, colleagues.
>>> I again tested the Fencing and now I think that my host-server power-button (physically or through ILO) sends a KILL-command to the host OS (and as a result to VM)
>>> This journald log in my guest OS when I press the power-button on the host:
>>> 
>>> ...
>>> Sep 16 16:19:27 KOM-AD01-PBX02 systemd[1]: Stopping ACPI event daemon...
>>> Sep 16 16:19:27 KOM-AD01-PBX02 systemd[1]: Stopping User Manager for UID 1000...
>>> Sep 16 16:19:27 KOM-AD01-PBX02 systemd[1]: Starting Unattended Upgrades Shutdown...
>>> Sep 16 16:19:27 KOM-AD01-PBX02 snapd[2583]: 2016/09/16 16:19:27.289063 main.go:67: Exiting on terminated signal.
>>> Sep 16 16:19:27 KOM-AD01-PBX02 sshd[2940]: pam_unix(sshd:session): session closed for user user
>>> Sep 16 16:19:27 KOM-AD01-PBX02 su[3015]: pam_unix(su:session): session closed for user root
>>> Sep 16 16:19:27 KOM-AD01-PBX02 spice-vdagentd[2638]: vdagentd quiting, returning status 0
>>> Sep 16 16:19:27 KOM-AD01-PBX02 sudo[3014]: pam_unix(sudo:session): session closed for user root
>>> Sep 16 16:19:27 KOM-AD01-PBX02 /usr/lib/snapd/snapd[2583]: main.go:67: Exiting on terminated signal.
>>> Sep 16 16:19:27 KOM-AD01-PBX02 sshd[2812]: Received signal 15; terminating.
>>> ...
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Reached target Unmount All Filesystems.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Stopped target Local File Systems (Pre).
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Stopped Remount Root and Kernel File Systems.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Stopped Create Static Device Nodes in /dev.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Reached target Shutdown.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Reached target Final Step.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Starting Reboot...
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Stopped Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd[1]: Shutting down.
>>> Sep 16 16:19:28 KOM-AD01-PBX02 kernel: [drm:qxl_enc_commit [qxl]] *ERROR* head number too large or missing monitors config: ffffc9000084a000, 0systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>>> Sep 16 16:19:28 KOM-AD01-PBX02 systemd-journald[3342]: Journal stopped
>>> -- Reboot --
>>> 
>>> Perhaps this feature of HP ProLiant DL 360 G5. I dont know.
>>> 
>>> If I test the unavailability of a host other ways that everything is going well.
>>> 
>>> I described my experience testing Fencing on practical examples on my blog for everyone in Russian.
>>> https://blog.it-kb.ru/2016/09/16/install-ovirt-4-0-part-4-about-ssh-soft-fencing-and-hard-fencing-over-hp-proliant-ilo2-power-managment-agent-and-test-of-high-availability/
>>> 
>>> Thank you all very much for your participation and support.
>>> 
>>> Michal, what kind of scenario are you talking about?
>> 
>> Basically what you just did,
>> the question is what happens when you run 'shutdown -h now' (or press the physical button if configured to trigger a soft shutdown); is it going to propagate somehow the shutdown action to the VMs or to brutally kill them?
>> 
>> In the first case the VMs will not restart regardless of their HA flags.
>> 
>>> PS: Excuse me for my bad English :)
>>> 
>>> 16.09.2016, 16:37, "Simone Tiraboschi" <stirabos at redhat.com>:
>>>> On Fri, Sep 16, 2016 at 3:34 PM, Michal Skrivanek <michal.skrivanek at redhat.com> wrote:
>>>>>> On 16 Sep 2016, at 15:31, aleksey.maksimov at it-kb.ru wrote:
>>>>>> 
>>>>>> Hi Simone.
>>>>>> Exactly.
>>>>>> Now I'll put the journald on the guest and try to understand how the guest off.
>>>>> 
>>>>> great. thanks
>>>>> 
>>>>>> 16.09.2016, 16:25, "Simone Tiraboschi" <stirabos at redhat.com>:
>>>>>>> On Fri, Sep 16, 2016 at 3:13 PM, Michal Skrivanek <michal.skrivanek at redhat.com> wrote:
>>>>>>>>> On 16 Sep 2016, at 15:05, Gianluca Cecchi <gianluca.cecchi at gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 16, 2016 at 2:50 PM, Michal Skrivanek <michal.skrivanek at redhat.com> wrote:
>>>>>>>>>> no, that’s not how HA works today. When you log into a guest and issue “shutdown” we do not restart the VM under your hands. We can argue how it should or may work, but this is the defined behavior since the dawn of oVirt.
>>>>>>>>>> 
>>>>>>>>>>> ​AFAIK that's correct, we need to be able ​
>>>>>>>>>>> ​shutdown HA VM​
>>>>>>>>>>>>>>>>>>>>>> ​ without being it immediately restarted on different host. We want to restart HA VM only if host, where HA VM is running, is non-responsive.
>>>>>>>>>> 
>>>>>>>>>> we try to restart it in all other cases other than user initiated shutdown, e.g. a QEMU process crash on an otherwise-healthy host
>>>>>>>>> Hi, just another question in case HA is not configured at all.
>>>>>>>> 
>>>>>>>> by “HA configured” I expect you’re referring to the “Highly Available” checkbox in Edit VM dialog.
>>>>>>>> 
>>>>>>>>> If I run the "shutdown -h now" command on an host where some VMs are running, what is the expected behavior?
>>>>>>>>> Clean VM shutdown (with or without timeout in case it doesn't complete?) or crash of their related QEMU processes?
>>>>>>>> 
>>>>>>>> expectation is that you won’t do that. That’s why there is the Maintenance host state.
>>>>>>>> But if you do that regardless, with VMs running, all the processes will be terminated in a regular system way, i.e. all QEMU processes get SIGTERM. From the perspective of each guest this is not a clean shutdown and it would just get killed
>>>>>>> 
>>>>>>> Aleksey is reporting that he started a shutdown on his host by power management and the VM processes didn't get roughly killed but smoothly shut down and so they didn't restarted regardless of their HA flag and so this thread.
>>>>> 
>>>>> Gianluca talks about “shutdown -h now”, you talk about power management action, those are two different things. The current idea is that systemd or some other component just propagates the action to the guest and if that guest is configured to handle it as a shutdown it starts it itself as well so it looks like a user-initiated one. Even though this mostly makes sense it is not ok for current HA logic
>>>> 
>>>> Aleksey, can you please also test this scenario?
>>>>>>>> Thanks,
>>>>>>>> michal
>>>>>>>>> Thanks,
>>>>>>>>> Gianluca
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users at ovirt.org
>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users




More information about the Users mailing list