
19 May
2014
19 May
'14
4:10 p.m.
----- Original Message ----- > From: "Yuriy Demchenko" <demchenko.ya@gmail.com> > To: "Eli Mesika" <emesika@redhat.com> > Cc: users@ovirt.org > Sent: Monday, May 19, 2014 3:56:41 PM > Subject: Re: [ovirt-users] power outage: HA vms not restarted > > saw another strange moment in logs: > > for example, vm 'prod.gui', according to logs engine tried to restart it > on host srv19 at 00:23:11 and it failed with error 'Failed to run VM > prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with > error ENGINE and code 5001)' This is a bug and you should open a BZ for that and attach related engine & vdsm logs > > in vdsm.log on srv19 i can see this error: > StoragePoolUnknown: Unknown pool id, pool not connected: > ('5849b030-626e-47cb-ad90-3ce782d831b3',) > Thread-18::DEBUG::2014-05-17 > 00:23:13,022::task::869::TaskManager.Task::(_run) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run: > 27a323de-ac42-4d68-992f-eb5c84d8baea > ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task > Thread-18::DEBUG::2014-05-17 > 00:23:13,022::task::1194::TaskManager.Task::(stop) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing > (force False) > Thread-18::DEBUG::2014-05-17 > 00:23:13,023::task::974::TaskManager.Task::(_decref) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True > Thread-18::INFO::2014-05-17 > 00:23:13,023::task::1151::TaskManager.Task::(prepare) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted: > 'Unknown pool id, pool not connected' - code 309 > Thread-18::DEBUG::2014-05-17 > 00:23:13,024::task::1156::TaskManager.Task::(prepare) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown > pool id, pool not connected > Thread-18::DEBUG::2014-05-17 > 00:23:13,024::task::974::TaskManager.Task::(_decref) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True > > according to engine database - storage pool > '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'. > > but - according to engine logs SPM start for datacenter CODD was run > only at 00:23:27, i.e. _after_ vm restart attemts How did you concluded that ? > 2014-05-17 00:23:27,585 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12, > storage pool CODD, prevId 9, LVER 11 > 2014-05-17 00:23:27,610 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START, > SpmStartVDSCommand(HostName = srv12, HostId = > eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId = > 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11, > storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log > id: 1844d391 > > i thought no VM can be started until SPM host selected, why in my case > vm restart occurred before that? > and what 'pool not connected' means? no spm running, storage unavailable > from some host or something else? In this case, storage unavailable from some host.... > > > Yuriy Demchenko > > On 05/19/2014 03:13 PM, Yuriy Demchenko wrote: > > On 05/19/2014 02:36 PM, Eli Mesika wrote: > >> > >> > >> Hi, > >> > >> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA > >> vms in 2 datacenters (3 hosts uses different storage with no > >> connectivity to first storage, that's why second DC) > >> Recently (2014-05-17) i had a double power outage: first blackout at > >> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 > >> When finally all went up (after approx. 10:16) - only 2 vms were > >> restarted from 46. > >> From browsing engine log i saw failed restart attemts of almost > >> all vms > >> after first blackout with error 'Failed with error ENGINE and code > >> 5001', but after second blackout i saw no attempts to restart vms, and > >> only error was 'connect timeout' (probably to srv5 - that host > >> physically died after blackouts). > >> And i cant figure why HA vms were not restarted? Please advice > >> > >> engine and (supposedly) spm host logs in attach. > >>>> Hi Yuriy > >>>> > >>>> What I see is that the log for 2014-05-17 is started at 2014-05-17 > >>>> 00:23:03 > >>>> so I can not track the first interval you had mentioned (00:19 to > >>>> 00:26) > >>> 00:23 is the time when engine booted up after first outage, that's why > >>> logs started at 00:23:03 > >>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was > >>>> restarted and at 2014-05-17 00:23:09,423 we had started to get > >>>> connection > >>>> errors. > >>>> We had tried to solve the problem by doing Soft-Fencing (actually > >>>> vdsmd > >>>> service restart) on the problematic hosts, but ssh to the host > >>>> failed so > >>>> we had tried > >>>> to hard-fence the host (restart/reboot), but this was due the > >>>> configurable > >>>> "quite time" in which we are preventing fencing operation after an > >>>> engine > >>>> restart which > >>>> is set by default to 5 min (DisableFenceAtStartupInSec key in > >>>> engine-config) and therefor we had skipped the fencing operation... > >>> my hosts usually boots slower than engine, long bios checks + random > >>> power on delay (120s), thats why at first engine reports connect errors > >>> however, in logs i see, that some of them were allready up and engine > >>> successfully contacted them: > >> So maybe you should put DisableFenceAtStartupInSec to a higher value > >> instead of the default 5 min... > >> > >>> 2014-05-17 00:23:10,450 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. > >>> 2014-05-17 00:23:10,456 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv4. > >>> 2014-05-17 00:23:10,458 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv7. > >>> 2014-05-17 00:23:10,460 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. > >>> > >>> and after 00:23:11 i saw no fencing-related messages, only vm restart > >>> attemts that failed with strange errors like: > >>> 'Failed with error ENGINE and code 5001' > >>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was > >>> filtered > >>> out by VAR__FILTERTYPE__INTERNAL filter Memory' > >>> 'CanDoAction of action RunVm failed. > >>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName > >>> > >>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' > >> Strange that you didn't got the message itself (this is only the > >> message key) > >> The original message is : > >> Cannot Run VM. There are no available running Hosts with sufficient > >> memory in VM's Cluster . > >> So, it failed on the RunVM command validation where no host with > >> enough memory to run the VM was found > > yea, but how that can be? all my hosts have 256gb ram and only 2 vms > > are ram-hungry (128gb and 240gb), other uses much less (2-64gb) > >> > >>>> For the first period as I said I only can guess that one of your hosts > >>>> fencing attempts was after those 5 minutes and therefor it was > >>>> rebooted > >>>> and the HA VMs were freed to run on other host. > >>>> For the second period on which I have logs, the host fencing failed > >>>> due to > >>>> the required "quite time" and in this situation the only hing you > >>>> can do > >>>> in order to have the HA VMs running again is to > >>>> right-click on each host and press "Confirm that host has been > >>>> rebooted" > >>> but i see in logs - after second period, at 10:14+ all hosts but one > >>> (srv5) were up and power management verified successfully, isn't that > >>> should be enough for engine to verify all ha vms down and restart them? > >> How should engine know that the host was rebooted?? > >> The fact that power management verified successfully is not enough in > >> order to run the VMs on another host. > >> As I see the fence commands that intend to reboot the host holding > >> the VMs were failed ... > > but in like 30secs after engine set holding hosts to non-responsive > > they all went up - 'state was set to up for host' > > so, engine can contact that hosts and query vm status and restart if > > needed, shouldn't it? > > > > 2014-05-17 10:09:23,978 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = > > 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,143 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv9. > > > > 2014-05-17 10:09:23,980 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = > > 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,214 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv10. > > > > 2014-05-17 10:09:24,029 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = > > bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,239 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv11. > > > > 2014-05-17 10:09:23,991 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = > > eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, > > spm_status = Contending, non-responsive_timeout (seconds) = 82, error > > = java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,320 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv12. > > > > 2014-05-17 10:09:23,987 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = > > 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, > > spm_status = None, non-responsive_timeout (seconds) = 62, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:44,934 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, > > Call Stack: null, Custom Event ID: -1, Message: State was set to Up > > for host srv13. > > > > 2014-05-17 10:09:24,016 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = > > 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,930 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv14. > > > > 2014-05-17 10:09:24,019 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = > > 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,285 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv15. > > > > 2014-05-17 10:09:23,996 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = > > c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,188 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv16 > > > > 2014-05-17 10:09:23,963 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = > > 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:46,358 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv17. > > > > 2014-05-17 10:09:23,949 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = > > 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:10:25,643 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv18. > > > > 2014-05-17 10:09:23,991 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = > > e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, > > spm_status = None, non-responsive_timeout (seconds) = 61, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,260 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv19. > > > >>> 2014-05-17 10:11:56,946 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla > >>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host > >>> srv17 power management was verified successfully. > >>> > >>> in fact, at ~17:55 my colleague restarted engine, forced spm selection > >>> and started all vms - all started without errors, and he didn't had to > >>> click 'confirm host has been rebooted' > >> But here you had restarted the VMs on the original Host in which they > >> were running after the host is UP, this will work > >> The problem is to run a HA VM on another host, to do so, we must > >> insure that the Host that run the VMs until now is not running then > >> anymore > >> If the fence operation is failed, we can say nothing on the Host > >> status and then as I said you should manually approve that the host > >> was rebooted in order > >> to prevent VMs to run on multiple hosts which will lead to data > >> corruption > > but shouldn't engine restart corresponded vms after holding host came > > up? (without manual fence) > > because they up - so engine can query them about running/not running > > vms and get actual state of vms - running or not > > the only host were down at that point is srv5, which holded only 1 vm > > - and it were correctly put in 'unknown' state, other vms were just > > 'down' until we manually started them > >