Re: [ovirt-users] power outage: HA vms not restarted

19 May 2014

----- Original Message -----
> From: "Yuriy Demchenko" <demchenko.ya@gmail.com>
> To: "Eli Mesika" <emesika@redhat.com>
> Cc: users@ovirt.org
> Sent: Monday, May 19, 2014 3:56:41 PM
> Subject: Re: [ovirt-users] power outage: HA vms not restarted
> 
> saw another strange moment in logs:
> 
> for example, vm 'prod.gui', according to logs engine tried to restart it
> on host srv19 at 00:23:11 and it failed with error 'Failed to run VM
> prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with
> error ENGINE and code 5001)'

This is a bug and you should open a BZ for that and attach related engine & vdsm logs 

> 
> in vdsm.log on srv19 i can see this error:
> StoragePoolUnknown: Unknown pool id, pool not connected:
> ('5849b030-626e-47cb-ad90-3ce782d831b3',)
> Thread-18::DEBUG::2014-05-17
> 00:23:13,022::task::869::TaskManager.Task::(_run)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run:
> 27a323de-ac42-4d68-992f-eb5c84d8baea
> ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task
> Thread-18::DEBUG::2014-05-17
> 00:23:13,022::task::1194::TaskManager.Task::(stop)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing
> (force False)
> Thread-18::DEBUG::2014-05-17
> 00:23:13,023::task::974::TaskManager.Task::(_decref)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True
> Thread-18::INFO::2014-05-17
> 00:23:13,023::task::1151::TaskManager.Task::(prepare)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted:
> 'Unknown pool id, pool not connected' - code 309
> Thread-18::DEBUG::2014-05-17
> 00:23:13,024::task::1156::TaskManager.Task::(prepare)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown
> pool id, pool not connected
> Thread-18::DEBUG::2014-05-17
> 00:23:13,024::task::974::TaskManager.Task::(_decref)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True
> 
> according to engine database - storage pool
> '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'.
> 
> but - according to engine logs SPM start for datacenter CODD was run
> only at 00:23:27, i.e. _after_ vm restart attemts

How did you concluded that ?

> 2014-05-17 00:23:27,585 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12,
> storage pool CODD, prevId 9, LVER 11
> 2014-05-17 00:23:27,610 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START,
> SpmStartVDSCommand(HostName = srv12, HostId =
> eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId =
> 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11,
> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
> id: 1844d391
> 
> i thought no VM can be started until SPM host selected, why in my case
> vm restart occurred before that?
> and what 'pool not connected' means? no spm running, storage unavailable
> from some host or something else?

In this case, storage unavailable from some host....

> 
> 
> Yuriy Demchenko
> 
> On 05/19/2014 03:13 PM, Yuriy Demchenko wrote:
> > On 05/19/2014 02:36 PM, Eli Mesika wrote:
> >>
> >>
> >> Hi,
> >>
> >> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
> >> vms in 2 datacenters (3 hosts uses different storage with no
> >> connectivity to first storage, that's why second DC)
> >> Recently (2014-05-17) i had a double power outage: first blackout at
> >> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
> >> When finally all went up (after approx. 10:16) - only 2 vms were
> >> restarted from 46.
> >>    From browsing engine log i saw failed restart attemts of almost
> >> all vms
> >> after first blackout with error 'Failed with error ENGINE and code
> >> 5001', but after second blackout i saw no attempts to restart vms, and
> >> only error was 'connect timeout' (probably to srv5 - that host
> >> physically died after blackouts).
> >> And i cant figure why HA vms were not restarted? Please advice
> >>
> >> engine and (supposedly) spm host logs in attach.
> >>>> Hi Yuriy
> >>>>
> >>>> What I see is that the log for 2014-05-17 is started at 2014-05-17
> >>>> 00:23:03
> >>>> so I can not track the first interval you had mentioned (00:19 to
> >>>> 00:26)
> >>> 00:23 is the time when engine booted up after first outage, that's why
> >>> logs started at 00:23:03
> >>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was
> >>>> restarted and at 2014-05-17 00:23:09,423 we had started to get
> >>>> connection
> >>>> errors.
> >>>> We had tried to solve the problem by doing Soft-Fencing (actually
> >>>> vdsmd
> >>>> service restart) on the problematic hosts, but ssh to the host
> >>>> failed so
> >>>> we had tried
> >>>> to hard-fence the host (restart/reboot), but this was due the
> >>>> configurable
> >>>> "quite time" in which we are preventing fencing operation after an
> >>>> engine
> >>>> restart which
> >>>> is set by default to 5 min (DisableFenceAtStartupInSec key in
> >>>> engine-config) and therefor we had skipped the fencing operation...
> >>> my hosts usually boots slower than engine, long bios checks + random
> >>> power on delay (120s), thats why at first engine reports connect errors
> >>> however, in logs i see, that some of them were allready up and engine
> >>> successfully contacted them:
> >> So maybe you should put DisableFenceAtStartupInSec to a higher value
> >> instead of the default 5 min...
> >>
> >>> 2014-05-17 00:23:10,450 INFO
> >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack:
> >>> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
> >>> 2014-05-17 00:23:10,456 INFO
> >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack:
> >>> null, Custom Event ID: -1, Message: State was set to Up for host srv4.
> >>> 2014-05-17 00:23:10,458 INFO
> >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack:
> >>> null, Custom Event ID: -1, Message: State was set to Up for host srv7.
> >>> 2014-05-17 00:23:10,460 INFO
> >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack:
> >>> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
> >>>
> >>> and after 00:23:11 i saw no fencing-related messages, only vm restart
> >>> attemts that failed with strange errors like:
> >>> 'Failed with error ENGINE and code 5001'
> >>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was
> >>> filtered
> >>> out by VAR__FILTERTYPE__INTERNAL filter Memory'
> >>> 'CanDoAction of action RunVm failed.
> >>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
> >>>
> >>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
> >> Strange that you didn't got the message itself (this is only the
> >> message key)
> >> The original message is :
> >> Cannot Run VM. There are no available running Hosts with sufficient
> >> memory in VM's Cluster .
> >> So, it failed on the RunVM command validation where no host with
> >> enough memory to run the VM was found
> > yea, but how that can be? all my hosts have 256gb ram and only 2 vms
> > are ram-hungry (128gb and 240gb), other uses much less (2-64gb)
> >>
> >>>> For the first period as I said I only can guess that one of your hosts
> >>>> fencing attempts was after those 5 minutes and therefor it was
> >>>> rebooted
> >>>> and the HA VMs were freed to run on other host.
> >>>> For the second period on which I have logs, the host fencing failed
> >>>> due to
> >>>> the required "quite time" and in this situation the only hing you
> >>>> can do
> >>>> in order to have the HA VMs running again is to
> >>>> right-click on each host and press "Confirm that host has been
> >>>> rebooted"
> >>> but i see in logs - after second period, at 10:14+ all hosts but one
> >>> (srv5) were up and power management verified successfully, isn't that
> >>> should be enough for engine to verify all ha vms down and restart them?
> >> How should engine know that the host was rebooted??
> >> The fact that power management verified successfully is not enough in
> >> order to run the VMs on another host.
> >> As I see the fence commands that intend to reboot the host holding
> >> the VMs were failed ...
> > but in like 30secs after engine set holding hosts to non-responsive
> > they all went up - 'state was set to up for host'
> > so, engine can contact that hosts and query vm status and restart if
> > needed, shouldn't it?
> >
> > 2014-05-17 10:09:23,978 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id =
> > 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:40,143 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv9.
> >
> > 2014-05-17 10:09:23,980 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id =
> > 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:43,214 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv10.
> >
> > 2014-05-17 10:09:24,029 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id =
> > bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:43,239 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv11.
> >
> > 2014-05-17 10:09:23,991 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id =
> > eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5,
> > spm_status = Contending, non-responsive_timeout (seconds) = 82, error
> > = java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:43,320 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv12.
> >
> > 2014-05-17 10:09:23,987 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id =
> > 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4,
> > spm_status = None, non-responsive_timeout (seconds) = 62, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:44,934 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null,
> > Call Stack: null, Custom Event ID: -1, Message: State was set to Up
> > for host srv13.
> >
> > 2014-05-17 10:09:24,016 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id =
> > 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:43,930 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv14.
> >
> > 2014-05-17 10:09:24,019 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id =
> > 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:40,285 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv15.
> >
> > 2014-05-17 10:09:23,996 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id =
> > c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:40,188 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv16
> >
> > 2014-05-17 10:09:23,963 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id =
> > 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:46,358 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv17.
> >
> > 2014-05-17 10:09:23,949 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id =
> > 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0,
> > spm_status = None, non-responsive_timeout (seconds) = 60, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:10:25,643 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv18.
> >
> > 2014-05-17 10:09:23,991 INFO
> > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id =
> > e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3,
> > spm_status = None, non-responsive_timeout (seconds) = 61, error =
> > java.net.ConnectException: Connection refused
> > 2014-05-17 10:09:40,260 INFO
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack:
> > null, Custom Event ID: -1, Message: State was set to Up for host srv19.
> >
> >>> 2014-05-17 10:11:56,946 INFO
> >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla
> >>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host
> >>> srv17 power management was verified successfully.
> >>>
> >>> in fact, at ~17:55 my colleague restarted engine, forced spm selection
> >>> and started all vms - all started without errors, and he didn't had to
> >>> click 'confirm host has been rebooted'
> >> But here you had restarted the VMs on the original Host in which they
> >> were running after the host is UP, this will work
> >> The problem is to run a HA VM on another host, to do so, we must
> >> insure that the Host that run the VMs until now is not running then
> >> anymore
> >> If the fence operation is failed, we can say nothing on the Host
> >> status and then as I said you should manually approve that the host
> >> was rebooted in order
> >> to prevent VMs to run on multiple hosts which will lead to data
> >> corruption
> > but shouldn't engine restart corresponded vms after holding host came
> > up? (without manual fence)
> > because they up - so engine can query them about running/not running
> > vms and get actual state of vms - running or not
> > the only host were down at that point is srv5, which holded only 1 vm
> > - and it were correctly put in 'unknown' state, other vms were just
> > 'down' until we manually started them
> 
>