[ovirt-users] Power failure recovery

Wed Jun 7 16:17:58 UTC 2017

Hi Anton,

Thanks for the suggestions; our engine has the same default values as
you posted. However it seems our engine tried to start each VM exactly 3
times: once on each host in the cluster, all within about 15 seconds,
and never tried again.

The engine logs don't appear to shed any useful light on this in my
opinion, but I could send them to you (offlist?) if that's any use.

Best regards,
Chris

On 07/06/17 14:56, Artyom Lukianov wrote:
> Under the engine-config, I can see two variables that connected to the
> restart of HA VM's
> MaxNumOfTriesToRunFailedAutoStartVm: "Number of attempts to restart
> highlyavailable VM that went down unexpectedly" (Value Type: Integer)
> RetryToRunAutoStartVmIntervalInSeconds: "How often to try to restart
> highly available VM that went down unexpectedly (in seconds)" (Value
> Type: Integer)
> And their default parameters are:
> # engine-config -g MaxNumOfTriesToRunFailedAutoStartVm
> MaxNumOfTriesToRunFailedAutoStartVm: 10 version: general
> # engine-config -g RetryToRunAutoStartVmIntervalInSeconds
> RetryToRunAutoStartVmIntervalInSeconds: 30 version: general
> 
> So check theengine.logif you do not see that the engine restarts the HA
> VM's ten times, it is definitely a bug otherwise, you can just to play
> with this parameters to adapt it to your case.
> Best Regards
> 
> On Wed, Jun 7, 2017 at 12:52 PM, Chris Boot <bootc at bootc.net
> <mailto:bootc at bootc.net>> wrote:
> 
>     Hi all,
> 
>     We've got a three-node "hyper-converged" oVirt 4.1.2 + GlusterFS cluster
>     on brand new hardware. It's not quite in production yet but, as these
>     things always go, we already have some important VMs on it.
> 
>     Last night the servers (which aren't yet on UPS) suffered a brief power
>     failure. They all booted up cleanly and the hosted engine started up ~10
>     minutes afterwards (presumably once the engine GlusterFS volume was
>     sufficiently healed and the HA stack realised). So far so good.
> 
>     As soon at the HostedEngine started up it tried to start all our Highly
>     Available VMs. Unfortunately our master storage domain was as yet
>     inactive as GlusterFS was presumably still trying to get it healed.
>     About 10 minutes later the master domain was activated and
>     "reconstructed" and an SPM was selected, but oVirt had tried and failed
>     to start all the HA VMs already and didn't bother trying again.
> 
>     All the VMs started just fine this morning when we realised what
>     happened and logged-in to oVirt to start them.
> 
>     Is this known and/or expected behaviour? Can we do anything to delay
>     starting HA VMs until the storage domains are there? Can we get oVirt to
>     keep trying to start HA VMs when they fail to start?
> 
>     Is there a bug for this already or should I be raising one?
> 
>     Thanks,
>     Chris
> 
>     --
>     Chris Boot
>     bootc at bootc.net <mailto:bootc at bootc.net>
>     _______________________________________________
>     Users mailing list
>     Users at ovirt.org <mailto:Users at ovirt.org>
>     http://lists.ovirt.org/mailman/listinfo/users
>     <http://lists.ovirt.org/mailman/listinfo/users>
> 
> 

-- 
Chris Boot
bootc at bootc.net