[ovirt-users] power outage: HA vms not restarted

Mon May 19 12:56:57 UTC 2014

----- Original Message -----
> From: "Yuriy Demchenko" <demchenko.ya at gmail.com>
> To: "Eli Mesika" <emesika at redhat.com>
> Cc: users at ovirt.org
> Sent: Monday, May 19, 2014 2:13:02 PM
> Subject: Re: [ovirt-users] power outage: HA vms not restarted
> 
> On 05/19/2014 02:36 PM, Eli Mesika wrote:
> >
> >
> > Hi,
> >
> > i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
> > vms in 2 datacenters (3 hosts uses different storage with no
> > connectivity to first storage, that's why second DC)
> > Recently (2014-05-17) i had a double power outage: first blackout at
> > 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
> > When finally all went up (after approx. 10:16) - only 2 vms were
> > restarted from 46.
> >    From browsing engine log i saw failed restart attemts of almost all vms
> > after first blackout with error 'Failed with error ENGINE and code
> > 5001', but after second blackout i saw no attempts to restart vms, and
> > only error was 'connect timeout' (probably to srv5 - that host
> > physically died after blackouts).
> > And i cant figure why HA vms were not restarted? Please advice
> >
> > engine and (supposedly) spm host logs in attach.
> >>> Hi Yuriy
> >>>
> >>> What I see is that the log for 2014-05-17 is started at 2014-05-17
> >>> 00:23:03
> >>> so I can not track the first interval you had mentioned (00:19 to 00:26)
> >> 00:23 is the time when engine booted up after first outage, that's why
> >> logs started at 00:23:03
> >>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was
> >>> restarted and at 2014-05-17 00:23:09,423 we had started to get connection
> >>> errors.
> >>> We had tried to solve the problem by doing Soft-Fencing (actually vdsmd
> >>> service restart) on the problematic hosts, but ssh to the host failed so
> >>> we had tried
> >>> to hard-fence the host (restart/reboot), but this was due the
> >>> configurable
> >>> "quite time" in which we are preventing fencing operation after an engine
> >>> restart which
> >>> is set by default to 5 min (DisableFenceAtStartupInSec key in
> >>> engine-config) and therefor we had skipped the fencing operation...
> >> my hosts usually boots slower than engine, long bios checks + random
> >> power on delay (120s), thats why at first engine reports connect errors
> >> however, in logs i see, that some of them were allready up and engine
> >> successfully contacted them:
> > So maybe you should put DisableFenceAtStartupInSec to a higher value
> > instead of the default 5 min...
> >
> >> 2014-05-17 00:23:10,450 INFO
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack:
> >> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
> >> 2014-05-17 00:23:10,456 INFO
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack:
> >> null, Custom Event ID: -1, Message: State was set to Up for host srv4.
> >> 2014-05-17 00:23:10,458 INFO
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack:
> >> null, Custom Event ID: -1, Message: State was set to Up for host srv7.
> >> 2014-05-17 00:23:10,460 INFO
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack:
> >> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
> >>
> >> and after 00:23:11 i saw no fencing-related messages, only vm restart
> >> attemts that failed with strange errors like:
> >> 'Failed with error ENGINE and code 5001'
> >> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered
> >> out by VAR__FILTERTYPE__INTERNAL filter Memory'
> >> 'CanDoAction of action RunVm failed.
> >> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
> >> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
> > Strange that you didn't got the message itself (this is only the message
> > key)
> > The original message is :
> > Cannot Run VM. There are no available running Hosts with sufficient memory
> > in VM's Cluster .
> > So, it failed on the RunVM command validation where no host with enough
> > memory to run the VM was found
> yea, but how that can be? all my hosts have 256gb ram and only 2 vms are
> ram-hungry (128gb and 240gb), other uses much less (2-64gb)
> >
> >>> For the first period as I said I only can guess that one of your hosts
> >>> fencing attempts was after those 5 minutes and therefor it was rebooted
> >>> and the HA VMs were freed to run on other host.
> >>> For the second period on which I have logs, the host fencing failed due
> >>> to
> >>> the required "quite time" and in this situation the only hing you can do
> >>> in order to have the HA VMs running again is to
> >>> right-click on each host and press "Confirm that host has been rebooted"
> >> but i see in logs - after second period, at 10:14+ all hosts but one
> >> (srv5) were up and power management verified successfully, isn't that
> >> should be enough for engine to verify all ha vms down and restart them?
> > How should engine know that the host was rebooted??
> > The fact that power management verified successfully is not enough in order
> > to run the VMs on another host.
> > As I see the fence commands that intend to reboot the host holding the VMs
> > were failed ...
> but in like 30secs after engine set holding hosts to non-responsive they
> all went up - 'state was set to up for host'
> so, engine can contact that hosts and query vm status and restart if
> needed, shouldn't it?
> 
> 2014-05-17 10:09:23,978 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id =
> 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:40,143 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
> 
> 2014-05-17 10:09:23,980 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id =
> 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:43,214 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv10.
> 
> 2014-05-17 10:09:24,029 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id =
> bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:43,239 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
> 
> 2014-05-17 10:09:23,991 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id =
> eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5,
> spm_status = Contending, non-responsive_timeout (seconds) = 82, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:43,320 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv12.
> 
> 2014-05-17 10:09:23,987 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id =
> 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4,
> spm_status = None, non-responsive_timeout (seconds) = 62, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:44,934 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, Call
> Stack: null, Custom Event ID: -1, Message: State was set to Up for host
> srv13.
> 
> 2014-05-17 10:09:24,016 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id =
> 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:43,930 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv14.
> 
> 2014-05-17 10:09:24,019 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id =
> 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:40,285 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv15.
> 
> 2014-05-17 10:09:23,996 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id =
> c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:40,188 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv16
> 
> 2014-05-17 10:09:23,963 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id =
> 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:46,358 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv17.
> 
> 2014-05-17 10:09:23,949 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id =
> 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0,
> spm_status = None, non-responsive_timeout (seconds) = 60, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:10:25,643 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv18.
> 
> 2014-05-17 10:09:23,991 INFO
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id =
> e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3,
> spm_status = None, non-responsive_timeout (seconds) = 61, error =
> java.net.ConnectException: Connection refused
> 2014-05-17 10:09:40,260 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: State was set to Up for host srv19.
> 
> >> 2014-05-17 10:11:56,946 INFO
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla
> >> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host
> >> srv17 power management was verified successfully.
> >>
> >> in fact, at ~17:55 my colleague restarted engine, forced spm selection
> >> and started all vms - all started without errors, and he didn't had to
> >> click 'confirm host has been rebooted'
> > But here you had restarted the VMs on the original Host in which they were
> > running after the host is UP, this will work
> > The problem is to run a HA VM on another host, to do so, we must insure
> > that the Host that run the VMs until now is not running then anymore
> > If the fence operation is failed, we can say nothing on the Host status and
> > then as I said you should manually approve that the host was rebooted in
> > order
> > to prevent VMs to run on multiple hosts which will lead to data corruption
> but shouldn't engine restart corresponded vms after holding host came
> up? (without manual fence)
> because they up - so engine can query them about running/not running vms
> and get actual state of vms - running or not
> the only host were down at that point is srv5, which holded only 1 vm -
> and it were correctly put in 'unknown' state, other vms were just 'down'
> until we manually started them

Are you sure that those VMs are defined as Highly Available VMs ???

>