
19 May
2014
19 May
'14
5:25 p.m.
On 05/19/2014 05:10 PM, Eli Mesika wrote: > > ----- Original Message ----- >> From: "Yuriy Demchenko" <demchenko.ya@gmail.com> >> To: "Eli Mesika" <emesika@redhat.com> >> Cc: users@ovirt.org >> Sent: Monday, May 19, 2014 3:56:41 PM >> Subject: Re: [ovirt-users] power outage: HA vms not restarted >> >> saw another strange moment in logs: >> >> for example, vm 'prod.gui', according to logs engine tried to restart it >> on host srv19 at 00:23:11 and it failed with error 'Failed to run VM >> prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with >> error ENGINE and code 5001)' > This is a bug and you should open a BZ for that and attach related engine & vdsm logs done https://bugzilla.redhat.com/show_bug.cgi?id=1099134 >> in vdsm.log on srv19 i can see this error: >> StoragePoolUnknown: Unknown pool id, pool not connected: >> ('5849b030-626e-47cb-ad90-3ce782d831b3',) >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,022::task::869::TaskManager.Task::(_run) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run: >> 27a323de-ac42-4d68-992f-eb5c84d8baea >> ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,022::task::1194::TaskManager.Task::(stop) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing >> (force False) >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,023::task::974::TaskManager.Task::(_decref) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True >> Thread-18::INFO::2014-05-17 >> 00:23:13,023::task::1151::TaskManager.Task::(prepare) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted: >> 'Unknown pool id, pool not connected' - code 309 >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,024::task::1156::TaskManager.Task::(prepare) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown >> pool id, pool not connected >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,024::task::974::TaskManager.Task::(_decref) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True >> >> according to engine database - storage pool >> '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'. >> >> but - according to engine logs SPM start for datacenter CODD was run >> only at 00:23:27, i.e. _after_ vm restart attemts > How did you concluded that ? erm, am i wrong here? restart attempt according to logs above were at 00:23:11 SPM start according to logs below were at 00:23:27 or i'm wrong in assumption that spm is needed for vm start? >> 2014-05-17 00:23:27,585 INFO >> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] >> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12, >> storage pool CODD, prevId 9, LVER 11 >> 2014-05-17 00:23:27,610 INFO >> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] >> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START, >> SpmStartVDSCommand(HostName = srv12, HostId = >> eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId = >> 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11, >> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log >> id: 1844d391 >> >> i thought no VM can be started until SPM host selected, why in my case >> vm restart occurred before that? >> and what 'pool not connected' means? no spm running, storage unavailable >> from some host or something else? > In this case, storage unavailable from some host.... shouldn't then that host be set into 'non-operational' state and restart attempt carried by another host? > >> >> Yuriy Demchenko >> >> On 05/19/2014 03:13 PM, Yuriy Demchenko wrote: >>> On 05/19/2014 02:36 PM, Eli Mesika wrote: >>>> >>>> Hi, >>>> >>>> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA >>>> vms in 2 datacenters (3 hosts uses different storage with no >>>> connectivity to first storage, that's why second DC) >>>> Recently (2014-05-17) i had a double power outage: first blackout at >>>> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 >>>> When finally all went up (after approx. 10:16) - only 2 vms were >>>> restarted from 46. >>>> From browsing engine log i saw failed restart attemts of almost >>>> all vms >>>> after first blackout with error 'Failed with error ENGINE and code >>>> 5001', but after second blackout i saw no attempts to restart vms, and >>>> only error was 'connect timeout' (probably to srv5 - that host >>>> physically died after blackouts). >>>> And i cant figure why HA vms were not restarted? Please advice >>>> >>>> engine and (supposedly) spm host logs in attach. >>>>>> Hi Yuriy >>>>>> >>>>>> What I see is that the log for 2014-05-17 is started at 2014-05-17 >>>>>> 00:23:03 >>>>>> so I can not track the first interval you had mentioned (00:19 to >>>>>> 00:26) >>>>> 00:23 is the time when engine booted up after first outage, that's why >>>>> logs started at 00:23:03 >>>>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was >>>>>> restarted and at 2014-05-17 00:23:09,423 we had started to get >>>>>> connection >>>>>> errors. >>>>>> We had tried to solve the problem by doing Soft-Fencing (actually >>>>>> vdsmd >>>>>> service restart) on the problematic hosts, but ssh to the host >>>>>> failed so >>>>>> we had tried >>>>>> to hard-fence the host (restart/reboot), but this was due the >>>>>> configurable >>>>>> "quite time" in which we are preventing fencing operation after an >>>>>> engine >>>>>> restart which >>>>>> is set by default to 5 min (DisableFenceAtStartupInSec key in >>>>>> engine-config) and therefor we had skipped the fencing operation... >>>>> my hosts usually boots slower than engine, long bios checks + random >>>>> power on delay (120s), thats why at first engine reports connect errors >>>>> however, in logs i see, that some of them were allready up and engine >>>>> successfully contacted them: >>>> So maybe you should put DisableFenceAtStartupInSec to a higher value >>>> instead of the default 5 min... >>>> >>>>> 2014-05-17 00:23:10,450 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. >>>>> 2014-05-17 00:23:10,456 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv4. >>>>> 2014-05-17 00:23:10,458 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv7. >>>>> 2014-05-17 00:23:10,460 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. >>>>> >>>>> and after 00:23:11 i saw no fencing-related messages, only vm restart >>>>> attemts that failed with strange errors like: >>>>> 'Failed with error ENGINE and code 5001' >>>>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was >>>>> filtered >>>>> out by VAR__FILTERTYPE__INTERNAL filter Memory' >>>>> 'CanDoAction of action RunVm failed. >>>>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName >>>>> >>>>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' >>>> Strange that you didn't got the message itself (this is only the >>>> message key) >>>> The original message is : >>>> Cannot Run VM. There are no available running Hosts with sufficient >>>> memory in VM's Cluster . >>>> So, it failed on the RunVM command validation where no host with >>>> enough memory to run the VM was found >>> yea, but how that can be? all my hosts have 256gb ram and only 2 vms >>> are ram-hungry (128gb and 240gb), other uses much less (2-64gb) >>>>>> For the first period as I said I only can guess that one of your hosts >>>>>> fencing attempts was after those 5 minutes and therefor it was >>>>>> rebooted >>>>>> and the HA VMs were freed to run on other host. >>>>>> For the second period on which I have logs, the host fencing failed >>>>>> due to >>>>>> the required "quite time" and in this situation the only hing you >>>>>> can do >>>>>> in order to have the HA VMs running again is to >>>>>> right-click on each host and press "Confirm that host has been >>>>>> rebooted" >>>>> but i see in logs - after second period, at 10:14+ all hosts but one >>>>> (srv5) were up and power management verified successfully, isn't that >>>>> should be enough for engine to verify all ha vms down and restart them? >>>> How should engine know that the host was rebooted?? >>>> The fact that power management verified successfully is not enough in >>>> order to run the VMs on another host. >>>> As I see the fence commands that intend to reboot the host holding >>>> the VMs were failed ... >>> but in like 30secs after engine set holding hosts to non-responsive >>> they all went up - 'state was set to up for host' >>> so, engine can contact that hosts and query vm status and restart if >>> needed, shouldn't it? >>> >>> 2014-05-17 10:09:23,978 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = >>> 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,143 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. >>> >>> 2014-05-17 10:09:23,980 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = >>> 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,214 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv10. >>> >>> 2014-05-17 10:09:24,029 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = >>> bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,239 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. >>> >>> 2014-05-17 10:09:23,991 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = >>> eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, >>> spm_status = Contending, non-responsive_timeout (seconds) = 82, error >>> = java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,320 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv12. >>> >>> 2014-05-17 10:09:23,987 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = >>> 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, >>> spm_status = None, non-responsive_timeout (seconds) = 62, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:44,934 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, >>> Call Stack: null, Custom Event ID: -1, Message: State was set to Up >>> for host srv13. >>> >>> 2014-05-17 10:09:24,016 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = >>> 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,930 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv14. >>> >>> 2014-05-17 10:09:24,019 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = >>> 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,285 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv15. >>> >>> 2014-05-17 10:09:23,996 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = >>> c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,188 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv16 >>> >>> 2014-05-17 10:09:23,963 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = >>> 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:46,358 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv17. >>> >>> 2014-05-17 10:09:23,949 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = >>> 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:10:25,643 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv18. >>> >>> 2014-05-17 10:09:23,991 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = >>> e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, >>> spm_status = None, non-responsive_timeout (seconds) = 61, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,260 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv19. >>> >>>>> 2014-05-17 10:11:56,946 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla >>>>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host >>>>> srv17 power management was verified successfully. >>>>> >>>>> in fact, at ~17:55 my colleague restarted engine, forced spm selection >>>>> and started all vms - all started without errors, and he didn't had to >>>>> click 'confirm host has been rebooted' >>>> But here you had restarted the VMs on the original Host in which they >>>> were running after the host is UP, this will work >>>> The problem is to run a HA VM on another host, to do so, we must >>>> insure that the Host that run the VMs until now is not running then >>>> anymore >>>> If the fence operation is failed, we can say nothing on the Host >>>> status and then as I said you should manually approve that the host >>>> was rebooted in order >>>> to prevent VMs to run on multiple hosts which will lead to data >>>> corruption >>> but shouldn't engine restart corresponded vms after holding host came >>> up? (without manual fence) >>> because they up - so engine can query them about running/not running >>> vms and get actual state of vms - running or not >>> the only host were down at that point is srv5, which holded only 1 vm >>> - and it were correctly put in 'unknown' state, other vms were just >>> 'down' until we manually started them >>