On 05/19/2014 05:10 PM, Eli Mesika wrote:
----- Original Message -----
> From: "Yuriy Demchenko" <demchenko.ya(a)gmail.com>
> To: "Eli Mesika" <emesika(a)redhat.com>
> Cc: users(a)ovirt.org
> Sent: Monday, May 19, 2014 3:56:41 PM
> Subject: Re: [ovirt-users] power outage: HA vms not restarted
>
> saw another strange moment in logs:
>
> for example, vm 'prod.gui', according to logs engine tried to restart it
> on host srv19 at 00:23:11 and it failed with error 'Failed to run VM
> prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with
> error ENGINE and code 5001)'
This is a bug and you should open a BZ for that and attach related engine & vdsm logs
done
https://bugzilla.redhat.com/show_bug.cgi?id=1099134
> in vdsm.log on srv19 i can see this error:
> StoragePoolUnknown: Unknown pool id, pool not connected:
> ('5849b030-626e-47cb-ad90-3ce782d831b3',)
> Thread-18::DEBUG::2014-05-17
> 00:23:13,022::task::869::TaskManager.Task::(_run)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run:
> 27a323de-ac42-4d68-992f-eb5c84d8baea
> ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task
> Thread-18::DEBUG::2014-05-17
> 00:23:13,022::task::1194::TaskManager.Task::(stop)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing
> (force False)
> Thread-18::DEBUG::2014-05-17
> 00:23:13,023::task::974::TaskManager.Task::(_decref)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True
> Thread-18::INFO::2014-05-17
> 00:23:13,023::task::1151::TaskManager.Task::(prepare)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted:
> 'Unknown pool id, pool not connected' - code 309
> Thread-18::DEBUG::2014-05-17
> 00:23:13,024::task::1156::TaskManager.Task::(prepare)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown
> pool id, pool not connected
> Thread-18::DEBUG::2014-05-17
> 00:23:13,024::task::974::TaskManager.Task::(_decref)
> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True
>
> according to engine database - storage pool
> '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'.
>
> but - according to engine logs SPM start for datacenter CODD was run
> only at 00:23:27, i.e. _after_ vm restart attemts
How did you concluded that ?
erm, am i wrong here?
restart attempt according to logs above were at 00:23:11
SPM start according to logs below were at 00:23:27
or i'm wrong in assumption that spm is needed for vm start?
> 2014-05-17 00:23:27,585 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12,
> storage pool CODD, prevId 9, LVER 11
> 2014-05-17 00:23:27,610 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START,
> SpmStartVDSCommand(HostName = srv12, HostId =
> eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId =
> 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11,
> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
> id: 1844d391
>
> i thought no VM can be started until SPM host selected, why in my case
> vm restart occurred before that?
> and what 'pool not connected' means? no spm running, storage unavailable
> from some host or something else?
In this case, storage unavailable from some host....
shouldn't then that host
be set into 'non-operational' state and restart
attempt carried by another host?
>
> Yuriy Demchenko
>
> On 05/19/2014 03:13 PM, Yuriy Demchenko wrote:
>> On 05/19/2014 02:36 PM, Eli Mesika wrote:
>>>
>>> Hi,
>>>
>>> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
>>> vms in 2 datacenters (3 hosts uses different storage with no
>>> connectivity to first storage, that's why second DC)
>>> Recently (2014-05-17) i had a double power outage: first blackout at
>>> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
>>> When finally all went up (after approx. 10:16) - only 2 vms were
>>> restarted from 46.
>>> From browsing engine log i saw failed restart attemts of almost
>>> all vms
>>> after first blackout with error 'Failed with error ENGINE and code
>>> 5001', but after second blackout i saw no attempts to restart vms, and
>>> only error was 'connect timeout' (probably to srv5 - that host
>>> physically died after blackouts).
>>> And i cant figure why HA vms were not restarted? Please advice
>>>
>>> engine and (supposedly) spm host logs in attach.
>>>>> Hi Yuriy
>>>>>
>>>>> What I see is that the log for 2014-05-17 is started at 2014-05-17
>>>>> 00:23:03
>>>>> so I can not track the first interval you had mentioned (00:19 to
>>>>> 00:26)
>>>> 00:23 is the time when engine booted up after first outage, that's
why
>>>> logs started at 00:23:03
>>>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine
was
>>>>> restarted and at 2014-05-17 00:23:09,423 we had started to get
>>>>> connection
>>>>> errors.
>>>>> We had tried to solve the problem by doing Soft-Fencing (actually
>>>>> vdsmd
>>>>> service restart) on the problematic hosts, but ssh to the host
>>>>> failed so
>>>>> we had tried
>>>>> to hard-fence the host (restart/reboot), but this was due the
>>>>> configurable
>>>>> "quite time" in which we are preventing fencing operation
after an
>>>>> engine
>>>>> restart which
>>>>> is set by default to 5 min (DisableFenceAtStartupInSec key in
>>>>> engine-config) and therefor we had skipped the fencing operation...
>>>> my hosts usually boots slower than engine, long bios checks + random
>>>> power on delay (120s), thats why at first engine reports connect errors
>>>> however, in logs i see, that some of them were allready up and engine
>>>> successfully contacted them:
>>> So maybe you should put DisableFenceAtStartupInSec to a higher value
>>> instead of the default 5 min...
>>>
>>>> 2014-05-17 00:23:10,450 INFO
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
>>>> 2014-05-17 00:23:10,456 INFO
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv4.
>>>> 2014-05-17 00:23:10,458 INFO
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv7.
>>>> 2014-05-17 00:23:10,460 INFO
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
>>>>
>>>> and after 00:23:11 i saw no fencing-related messages, only vm restart
>>>> attemts that failed with strange errors like:
>>>> 'Failed with error ENGINE and code 5001'
>>>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was
>>>> filtered
>>>> out by VAR__FILTERTYPE__INTERNAL filter Memory'
>>>> 'CanDoAction of action RunVm failed.
>>>>
Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
>>>>
>>>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
>>> Strange that you didn't got the message itself (this is only the
>>> message key)
>>> The original message is :
>>> Cannot Run VM. There are no available running Hosts with sufficient
>>> memory in VM's Cluster .
>>> So, it failed on the RunVM command validation where no host with
>>> enough memory to run the VM was found
>> yea, but how that can be? all my hosts have 256gb ram and only 2 vms
>> are ram-hungry (128gb and 240gb), other uses much less (2-64gb)
>>>>> For the first period as I said I only can guess that one of your
hosts
>>>>> fencing attempts was after those 5 minutes and therefor it was
>>>>> rebooted
>>>>> and the HA VMs were freed to run on other host.
>>>>> For the second period on which I have logs, the host fencing failed
>>>>> due to
>>>>> the required "quite time" and in this situation the only
hing you
>>>>> can do
>>>>> in order to have the HA VMs running again is to
>>>>> right-click on each host and press "Confirm that host has been
>>>>> rebooted"
>>>> but i see in logs - after second period, at 10:14+ all hosts but one
>>>> (srv5) were up and power management verified successfully, isn't
that
>>>> should be enough for engine to verify all ha vms down and restart them?
>>> How should engine know that the host was rebooted??
>>> The fact that power management verified successfully is not enough in
>>> order to run the VMs on another host.
>>> As I see the fence commands that intend to reboot the host holding
>>> the VMs were failed ...
>> but in like 30secs after engine set holding hosts to non-responsive
>> they all went up - 'state was set to up for host'
>> so, engine can contact that hosts and query vm status and restart if
>> needed, shouldn't it?
>>
>> 2014-05-17 10:09:23,978 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id =
>> 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:40,143 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv9.
>>
>> 2014-05-17 10:09:23,980 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id =
>> 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:43,214 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv10.
>>
>> 2014-05-17 10:09:24,029 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id =
>> bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:43,239 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv11.
>>
>> 2014-05-17 10:09:23,991 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id =
>> eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5,
>> spm_status = Contending, non-responsive_timeout (seconds) = 82, error
>> = java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:43,320 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv12.
>>
>> 2014-05-17 10:09:23,987 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id =
>> 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4,
>> spm_status = None, non-responsive_timeout (seconds) = 62, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:44,934 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null,
>> Call Stack: null, Custom Event ID: -1, Message: State was set to Up
>> for host srv13.
>>
>> 2014-05-17 10:09:24,016 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id =
>> 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:43,930 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv14.
>>
>> 2014-05-17 10:09:24,019 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id =
>> 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:40,285 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv15.
>>
>> 2014-05-17 10:09:23,996 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id =
>> c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:40,188 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv16
>>
>> 2014-05-17 10:09:23,963 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id =
>> 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:46,358 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv17.
>>
>> 2014-05-17 10:09:23,949 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id =
>> 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0,
>> spm_status = None, non-responsive_timeout (seconds) = 60, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:10:25,643 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv18.
>>
>> 2014-05-17 10:09:23,991 INFO
>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>> (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id =
>> e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3,
>> spm_status = None, non-responsive_timeout (seconds) = 61, error =
>> java.net.ConnectException: Connection refused
>> 2014-05-17 10:09:40,260 INFO
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack:
>> null, Custom Event ID: -1, Message: State was set to Up for host srv19.
>>
>>>> 2014-05-17 10:11:56,946 INFO
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla
>>>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host
>>>> srv17 power management was verified successfully.
>>>>
>>>> in fact, at ~17:55 my colleague restarted engine, forced spm selection
>>>> and started all vms - all started without errors, and he didn't had
to
>>>> click 'confirm host has been rebooted'
>>> But here you had restarted the VMs on the original Host in which they
>>> were running after the host is UP, this will work
>>> The problem is to run a HA VM on another host, to do so, we must
>>> insure that the Host that run the VMs until now is not running then
>>> anymore
>>> If the fence operation is failed, we can say nothing on the Host
>>> status and then as I said you should manually approve that the host
>>> was rebooted in order
>>> to prevent VMs to run on multiple hosts which will lead to data
>>> corruption
>> but shouldn't engine restart corresponded vms after holding host came
>> up? (without manual fence)
>> because they up - so engine can query them about running/not running
>> vms and get actual state of vms - running or not
>> the only host were down at that point is srv5, which holded only 1 vm
>> - and it were correctly put in 'unknown' state, other vms were just
>> 'down' until we manually started them
>