Re: [ovirt-users] power outage: HA vms not restarted

19 May 2014


      On 05/19/2014 05:10 PM, Eli Mesika wrote:
>
> ----- Original Message -----
>> From: "Yuriy Demchenko" <demchenko.ya@gmail.com>
>> To: "Eli Mesika" <emesika@redhat.com>
>> Cc: users@ovirt.org
>> Sent: Monday, May 19, 2014 3:56:41 PM
>> Subject: Re: [ovirt-users] power outage: HA vms not restarted
>>
>> saw another strange moment in logs:
>>
>> for example, vm 'prod.gui', according to logs engine tried to restart it
>> on host srv19 at 00:23:11 and it failed with error 'Failed to run VM
>> prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with
>> error ENGINE and code 5001)'
> This is a bug and you should open a BZ for that and attach related engine & vdsm logs
done
https://bugzilla.redhat.com/show_bug.cgi?id=1099134
>> in vdsm.log on srv19 i can see this error:
>> StoragePoolUnknown: Unknown pool id, pool not connected:
>> ('5849b030-626e-47cb-ad90-3ce782d831b3',)
>> Thread-18::DEBUG::2014-05-17
>> 00:23:13,022::task::869::TaskManager.Task::(_run)
>> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run:
>> 27a323de-ac42-4d68-992f-eb5c84d8baea
>> ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task
>> Thread-18::DEBUG::2014-05-17
>> 00:23:13,022::task::1194::TaskManager.Task::(stop)
>> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing
>> (force False)
>> Thread-18::DEBUG::2014-05-17
>> 00:23:13,023::task::974::TaskManager.Task::(_decref)
>> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True
>> Thread-18::INFO::2014-05-17
>> 00:23:13,023::task::1151::TaskManager.Task::(prepare)
>> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted:
>> 'Unknown pool id, pool not connected' - code 309
>> Thread-18::DEBUG::2014-05-17
>> 00:23:13,024::task::1156::TaskManager.Task::(prepare)
>> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown
>> pool id, pool not connected
>> Thread-18::DEBUG::2014-05-17
>> 00:23:13,024::task::974::TaskManager.Task::(_decref)
>> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True
>>
>> according to engine database - storage pool
>> '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'.
>>
>> but - according to engine logs SPM start for datacenter CODD was run
>> only at 00:23:27, i.e. _after_ vm restart attemts
> How did you concluded that ?
erm, am i wrong here?
restart attempt according to logs above were at 00:23:11
SPM start according to logs below were at 00:23:27
or i'm wrong in assumption that spm is needed for vm start?
>> 2014-05-17 00:23:27,585 INFO
>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12,
>> storage pool CODD, prevId 9, LVER 11
>> 2014-05-17 00:23:27,610 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START,
>> SpmStartVDSCommand(HostName = srv12, HostId =
>> eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId =
>> 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11,
>> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
>> id: 1844d391
>>
>> i thought no VM can be started until SPM host selected, why in my case
>> vm restart occurred before that?
>> and what 'pool not connected' means? no spm running, storage unavailable
>> from some host or something else?
> In this case, storage unavailable from some host....
shouldn't then that host be set into 'non-operational' state and restart 
attempt carried by another host?
>
>>
>> Yuriy Demchenko
>>
>> On 05/19/2014 03:13 PM, Yuriy Demchenko wrote:
>>> On 05/19/2014 02:36 PM, Eli Mesika wrote:
>>>>
>>>> Hi,
>>>>
>>>> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
>>>> vms in 2 datacenters (3 hosts uses different storage with no
>>>> connectivity to first storage, that's why second DC)
>>>> Recently (2014-05-17) i had a double power outage: first blackout at
>>>> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
>>>> When finally all went up (after approx. 10:16) - only 2 vms were
>>>> restarted from 46.
>>>>     From browsing engine log i saw failed restart attemts of almost
>>>> all vms
>>>> after first blackout with error 'Failed with error ENGINE and code
>>>> 5001', but after second blackout i saw no attempts to restart vms, and
>>>> only error was 'connect timeout' (probably to srv5 - that host
>>>> physically died after blackouts).
>>>> And i cant figure why HA vms were not restarted? Please advice
>>>>
>>>> engine and (supposedly) spm host logs in attach.
>>>>>> Hi Yuriy
>>>>>>
>>>>>> What I see is that the log for 2014-05-17 is started at 2014-05-17
>>>>>> 00:23:03
>>>>>> so I can not track the first interval you had mentioned (00:19 to
>>>>>> 00:26)
>>>>> 00:23 is the time when engine booted up after first outage, that's why
>>>>> logs started at 00:23:03
>>>>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was
>>>>>> restarted and at 2014-05-17 00:23:09,423 we had started to get
>>>>>> connection
>>>>>> errors.
>>>>>> We had tried to solve the problem by doing Soft-Fencing (actually
>>>>>> vdsmd
>>>>>> service restart) on the problematic hosts, but ssh to the host
>>>>>> failed so
>>>>>> we had tried
>>>>>> to hard-fence the host (restart/reboot), but this was due the
>>>>>> configurable
>>>>>> "quite time" in which we are preventing fencing operation after an
>>>>>> engine
>>>>>> restart which
>>>>>> is set by default to 5 min (DisableFenceAtStartupInSec key in
>>>>>> engine-config) and therefor we had skipped the fencing operation...
>>>>> my hosts usually boots slower than engine, long bios checks + random
>>>>> power on delay (120s), thats why at first engine reports connect errors
>>>>> however, in logs i see, that some of them were allready up and engine
>>>>> successfully contacted them:
>>>> So maybe you should put DisableFenceAtStartupInSec to a higher value
>>>> instead of the default 5 min...
>>>>
>>>>> 2014-05-17 00:23:10,450 INFO
>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack:
>>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
>>>>> 2014-05-17 00:23:10,456 INFO
>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack:
>>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv4.
>>>>> 2014-05-17 00:23:10,458 INFO
>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack:
>>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv7.
>>>>> 2014-05-17 00:23:10,460 INFO
>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack:
>>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
>>>>>
>>>>> and after 00:23:11 i saw no fencing-related messages, only vm restart
>>>>> attemts that failed with strange errors like:
>>>>> 'Failed with error ENGINE and code 5001'
>>>>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was
>>>>> filtered
>>>>> out by VAR__FILTERTYPE__INTERNAL filter Memory'
>>>>> 'CanDoAction of action RunVm failed.
>>>>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
>>>>>
>>>>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
>>>> Strange that you didn't got the message itself (this is only the
>>>> message key)
>>>> The original message is :
>>>> Cannot Run VM. There are no available running Hosts with sufficient
>>>> memory in VM's Cluster .
>>>> So, it failed on the RunVM command validation where no host with
>>>> enough memory to run the VM was found
>>> yea, but how that can be? all my hosts have 256gb ram and only 2 vms
>>> are ram-hungry (128gb and 240gb), other uses much less (2-64gb)
>>>>>> For the first period as I said I only can guess that one of your hosts
>>>>>> fencing attempts was after those 5 minutes and therefor it was
>>>>>> rebooted
>>>>>> and the HA VMs were freed to run on other host.
>>>>>> For the second period on which I have logs, the host fencing failed
>>>>>> due to
>>>>>> the required "quite time" and in this situation the only hing you
>>>>>> can do
>>>>>> in order to have the HA VMs running again is to
>>>>>> right-click on each host and press "Confirm that host has been
>>>>>> rebooted"
>>>>> but i see in logs - after second period, at 10:14+ all hosts but one
>>>>> (srv5) were up and power management verified successfully, isn't that
>>>>> should be enough for engine to verify all ha vms down and restart them?
>>>> How should engine know that the host was rebooted??
>>>> The fact that power management verified successfully is not enough in
>>>> order to run the VMs on another host.
>>>> As I see the fence commands that intend to reboot the host holding
>>>> the VMs were failed ...
>>> but in like 30secs after engine set holding hosts to non-responsive
>>> they all went up - 'state was set to up for host'
>>> so, engine can contact that hosts and query vm status and restart if
>>> needed, shouldn't it?
>>>
>>> 2014-05-17 10:09:23,978 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id =
>>> 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:40,143 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
>>>
>>> 2014-05-17 10:09:23,980 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id =
>>> 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:43,214 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv10.
>>>
>>> 2014-05-17 10:09:24,029 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id =
>>> bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:43,239 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
>>>
>>> 2014-05-17 10:09:23,991 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id =
>>> eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5,
>>> spm_status = Contending, non-responsive_timeout (seconds) = 82, error
>>> = java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:43,320 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv12.
>>>
>>> 2014-05-17 10:09:23,987 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id =
>>> 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4,
>>> spm_status = None, non-responsive_timeout (seconds) = 62, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:44,934 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null,
>>> Call Stack: null, Custom Event ID: -1, Message: State was set to Up
>>> for host srv13.
>>>
>>> 2014-05-17 10:09:24,016 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id =
>>> 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:43,930 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv14.
>>>
>>> 2014-05-17 10:09:24,019 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id =
>>> 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:40,285 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv15.
>>>
>>> 2014-05-17 10:09:23,996 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id =
>>> c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:40,188 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv16
>>>
>>> 2014-05-17 10:09:23,963 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id =
>>> 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:46,358 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv17.
>>>
>>> 2014-05-17 10:09:23,949 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id =
>>> 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0,
>>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:10:25,643 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv18.
>>>
>>> 2014-05-17 10:09:23,991 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>> (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id =
>>> e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3,
>>> spm_status = None, non-responsive_timeout (seconds) = 61, error =
>>> java.net.ConnectException: Connection refused
>>> 2014-05-17 10:09:40,260 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: State was set to Up for host srv19.
>>>
>>>>> 2014-05-17 10:11:56,946 INFO
>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla
>>>>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host
>>>>> srv17 power management was verified successfully.
>>>>>
>>>>> in fact, at ~17:55 my colleague restarted engine, forced spm selection
>>>>> and started all vms - all started without errors, and he didn't had to
>>>>> click 'confirm host has been rebooted'
>>>> But here you had restarted the VMs on the original Host in which they
>>>> were running after the host is UP, this will work
>>>> The problem is to run a HA VM on another host, to do so, we must
>>>> insure that the Host that run the VMs until now is not running then
>>>> anymore
>>>> If the fence operation is failed, we can say nothing on the Host
>>>> status and then as I said you should manually approve that the host
>>>> was rebooted in order
>>>> to prevent VMs to run on multiple hosts which will lead to data
>>>> corruption
>>> but shouldn't engine restart corresponded vms after holding host came
>>> up? (without manual fence)
>>> because they up - so engine can query them about running/not running
>>> vms and get actual state of vms - running or not
>>> the only host were down at that point is srv5, which holded only 1 vm
>>> - and it were correctly put in 'unknown' state, other vms were just
>>> 'down' until we manually started them
>>