[ovirt-users] power outage: HA vms not restarted

Eli Mesika emesika at redhat.com
Mon May 19 10:36:32 UTC 2014



----- Original Message -----
> From: "Yuriy Demchenko" <demchenko.ya at gmail.com>
> To: "Eli Mesika" <emesika at redhat.com>
> Cc: users at ovirt.org
> Sent: Monday, May 19, 2014 1:09:42 PM
> Subject: Re: [ovirt-users] power outage: HA vms not restarted
> 
> Yuriy DemchenkoOn 05/19/2014 01:27 PM, Eli Mesika wrote:
> >
> > ----- Original Message -----
> >> From: "Yuriy Demchenko" <demchenko.ya at gmail.com>
> >> To: users at ovirt.org
> >> Sent: Monday, May 19, 2014 11:34:15 AM
> >> Subject: [ovirt-users] power outage: HA vms not restarted
> >>
> >> Hi,
> >>
> >> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
> >> vms in 2 datacenters (3 hosts uses different storage with no
> >> connectivity to first storage, that's why second DC)
> >> Recently (2014-05-17) i had a double power outage: first blackout at
> >> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
> >> When finally all went up (after approx. 10:16) - only 2 vms were
> >> restarted from 46.
> >>   From browsing engine log i saw failed restart attemts of almost all vms
> >> after first blackout with error 'Failed with error ENGINE and code
> >> 5001', but after second blackout i saw no attempts to restart vms, and
> >> only error was 'connect timeout' (probably to srv5 - that host
> >> physically died after blackouts).
> >> And i cant figure why HA vms were not restarted? Please advice
> >>
> >> engine and (supposedly) spm host logs in attach.
> > Hi Yuriy
> >
> > What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03
> > so I can not track the first interval you had mentioned (00:19 to 00:26)
> 00:23 is the time when engine booted up after first outage, that's why
> logs started at 00:23:03
> > However, I can clearly see that at 2014-05-17 00:23:03 the engine was
> > restarted and at 2014-05-17 00:23:09,423 we had started to get connection
> > errors.
> > We had tried to solve the problem by doing Soft-Fencing (actually vdsmd
> > service restart) on the problematic hosts, but ssh to the host failed so
> > we had tried
> > to hard-fence the host (restart/reboot), but this was due the configurable
> > "quite time" in which we are preventing fencing operation after an engine
> > restart which
> > is set by default to 5 min (DisableFenceAtStartupInSec key in
> > engine-config) and therefor we had skipped the fencing operation...
> my hosts usually boots slower than engine, long bios checks + random
> power on delay (120s), thats why at first engine reports connect errors
> however, in logs i see, that some of them were allready up and engine
> successfully contacted them:

So maybe you should put DisableFenceAtStartupInSec to a higher value instead of the default 5 min...

> 2014-05-17 00:23:10,450 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
> 2014-05-17 00:23:10,456 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv4.
> 2014-05-17 00:23:10,458 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv7.
> 2014-05-17 00:23:10,460 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
> 
> and after 00:23:11 i saw no fencing-related messages, only vm restart
> attemts that failed with strange errors like:
> 'Failed with error ENGINE and code 5001'
> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered
> out by VAR__FILTERTYPE__INTERNAL filter Memory'
> 'CanDoAction of action RunVm failed.
> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'

Strange that you didn't got the message itself (this is only the message key)
The original message is :
Cannot Run VM. There are no available running Hosts with sufficient memory in VM's Cluster .
So, it failed on the RunVM command validation where no host with enough memory to run the VM was found 


> 
> > For the first period as I said I only can guess that one of your hosts
> > fencing attempts was after those 5 minutes and therefor it was rebooted
> > and the HA VMs were freed to run on other host.
> > For the second period on which I have logs, the host fencing failed due to
> > the required "quite time" and in this situation the only hing you can do
> > in order to have the HA VMs running again is to
> > right-click on each host and press "Confirm that host has been rebooted"
> but i see in logs - after second period, at 10:14+ all hosts but one
> (srv5) were up and power management verified successfully, isn't that
> should be enough for engine to verify all ha vms down and restart them?

How should engine know that the host was rebooted??
The fact that power management verified successfully is not enough in order to run the VMs on another host.
As I see the fence commands that intend to reboot the host holding the VMs were failed ...

> 2014-05-17 10:11:56,946 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla
> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host
> srv17 power management was verified successfully.
> 
> in fact, at ~17:55 my colleague restarted engine, forced spm selection
> and started all vms - all started without errors, and he didn't had to
> click 'confirm host has been rebooted'

But here you had restarted the VMs on the original Host in which they were running after the host is UP, this will work 
The problem is to run a HA VM on another host, to do so, we must insure that the Host that run the VMs until now is not running then anymore
If the fence operation is failed, we can say nothing on the Host status and then as I said you should manually approve that the host was rebooted in order 
to prevent VMs to run on multiple hosts which will lead to data corruption

> >
> > Regards
> > Eli
> >
> >
> >> --
> >> Yuriy Demchenko
> >>
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> >>
> 
> 



More information about the Users mailing list