power outage: HA vms not restarted

Hi, i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA vms in 2 datacenters (3 hosts uses different storage with no connectivity to first storage, that's why second DC) Recently (2014-05-17) i had a double power outage: first blackout at 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 When finally all went up (after approx. 10:16) - only 2 vms were restarted from 46. From browsing engine log i saw failed restart attemts of almost all vms after first blackout with error 'Failed with error ENGINE and code 5001', but after second blackout i saw no attempts to restart vms, and only error was 'connect timeout' (probably to srv5 - that host physically died after blackouts). And i cant figure why HA vms were not restarted? Please advice engine and (supposedly) spm host logs in attach. -- Yuriy Demchenko

----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: users@ovirt.org Sent: Monday, May 19, 2014 11:34:15 AM Subject: [ovirt-users] power outage: HA vms not restarted
Hi,
i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA vms in 2 datacenters (3 hosts uses different storage with no connectivity to first storage, that's why second DC) Recently (2014-05-17) i had a double power outage: first blackout at 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 When finally all went up (after approx. 10:16) - only 2 vms were restarted from 46. From browsing engine log i saw failed restart attemts of almost all vms after first blackout with error 'Failed with error ENGINE and code 5001', but after second blackout i saw no attempts to restart vms, and only error was 'connect timeout' (probably to srv5 - that host physically died after blackouts). And i cant figure why HA vms were not restarted? Please advice
engine and (supposedly) spm host logs in attach.
Hi Yuriy What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03 so I can not track the first interval you had mentioned (00:19 to 00:26) However, I can clearly see that at 2014-05-17 00:23:03 the engine was restarted and at 2014-05-17 00:23:09,423 we had started to get connection errors. We had tried to solve the problem by doing Soft-Fencing (actually vdsmd service restart) on the problematic hosts, but ssh to the host failed so we had tried to hard-fence the host (restart/reboot), but this was due the configurable "quite time" in which we are preventing fencing operation after an engine restart which is set by default to 5 min (DisableFenceAtStartupInSec key in engine-config) and therefor we had skipped the fencing operation... For the first period as I said I only can guess that one of your hosts fencing attempts was after those 5 minutes and therefor it was rebooted and the HA VMs were freed to run on other host. For the second period on which I have logs, the host fencing failed due to the required "quite time" and in this situation the only hing you can do in order to have the HA VMs running again is to right-click on each host and press "Confirm that host has been rebooted" Regards Eli
-- Yuriy Demchenko
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Yuriy DemchenkoOn 05/19/2014 01:27 PM, Eli Mesika wrote:
----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: users@ovirt.org Sent: Monday, May 19, 2014 11:34:15 AM Subject: [ovirt-users] power outage: HA vms not restarted
Hi,
i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA vms in 2 datacenters (3 hosts uses different storage with no connectivity to first storage, that's why second DC) Recently (2014-05-17) i had a double power outage: first blackout at 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 When finally all went up (after approx. 10:16) - only 2 vms were restarted from 46. From browsing engine log i saw failed restart attemts of almost all vms after first blackout with error 'Failed with error ENGINE and code 5001', but after second blackout i saw no attempts to restart vms, and only error was 'connect timeout' (probably to srv5 - that host physically died after blackouts). And i cant figure why HA vms were not restarted? Please advice
engine and (supposedly) spm host logs in attach. Hi Yuriy
What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03 so I can not track the first interval you had mentioned (00:19 to 00:26)
However, I can clearly see that at 2014-05-17 00:23:03 the engine was restarted and at 2014-05-17 00:23:09,423 we had started to get connection errors. We had tried to solve the problem by doing Soft-Fencing (actually vdsmd service restart) on the problematic hosts, but ssh to the host failed so we had tried to hard-fence the host (restart/reboot), but this was due the configurable "quite time" in which we are preventing fencing operation after an engine restart which is set by default to 5 min (DisableFenceAtStartupInSec key in engine-config) and therefor we had skipped the fencing operation... my hosts usually boots slower than engine, long bios checks + random
00:23 is the time when engine booted up after first outage, that's why logs started at 00:23:03 power on delay (120s), thats why at first engine reports connect errors however, in logs i see, that some of them were allready up and engine successfully contacted them: 2014-05-17 00:23:10,450 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv11. 2014-05-17 00:23:10,456 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv4. 2014-05-17 00:23:10,458 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv7. 2014-05-17 00:23:10,460 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv9. and after 00:23:11 i saw no fencing-related messages, only vm restart attemts that failed with strange errors like: 'Failed with error ENGINE and code 5001' 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory' 'CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
For the first period as I said I only can guess that one of your hosts fencing attempts was after those 5 minutes and therefor it was rebooted and the HA VMs were freed to run on other host. For the second period on which I have logs, the host fencing failed due to the required "quite time" and in this situation the only hing you can do in order to have the HA VMs running again is to right-click on each host and press "Confirm that host has been rebooted" but i see in logs - after second period, at 10:14+ all hosts but one (srv5) were up and power management verified successfully, isn't that should be enough for engine to verify all ha vms down and restart them? 2014-05-17 10:11:56,946 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host srv17 power management was verified successfully.
in fact, at ~17:55 my colleague restarted engine, forced spm selection and started all vms - all started without errors, and he didn't had to click 'confirm host has been rebooted'
Regards Eli
-- Yuriy Demchenko
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 1:09:42 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
Yuriy DemchenkoOn 05/19/2014 01:27 PM, Eli Mesika wrote:
----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: users@ovirt.org Sent: Monday, May 19, 2014 11:34:15 AM Subject: [ovirt-users] power outage: HA vms not restarted
Hi,
i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA vms in 2 datacenters (3 hosts uses different storage with no connectivity to first storage, that's why second DC) Recently (2014-05-17) i had a double power outage: first blackout at 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 When finally all went up (after approx. 10:16) - only 2 vms were restarted from 46. From browsing engine log i saw failed restart attemts of almost all vms after first blackout with error 'Failed with error ENGINE and code 5001', but after second blackout i saw no attempts to restart vms, and only error was 'connect timeout' (probably to srv5 - that host physically died after blackouts). And i cant figure why HA vms were not restarted? Please advice
engine and (supposedly) spm host logs in attach. Hi Yuriy
What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03 so I can not track the first interval you had mentioned (00:19 to 00:26)
However, I can clearly see that at 2014-05-17 00:23:03 the engine was restarted and at 2014-05-17 00:23:09,423 we had started to get connection errors. We had tried to solve the problem by doing Soft-Fencing (actually vdsmd service restart) on the problematic hosts, but ssh to the host failed so we had tried to hard-fence the host (restart/reboot), but this was due the configurable "quite time" in which we are preventing fencing operation after an engine restart which is set by default to 5 min (DisableFenceAtStartupInSec key in engine-config) and therefor we had skipped the fencing operation... my hosts usually boots slower than engine, long bios checks + random
00:23 is the time when engine booted up after first outage, that's why logs started at 00:23:03 power on delay (120s), thats why at first engine reports connect errors however, in logs i see, that some of them were allready up and engine successfully contacted them:
So maybe you should put DisableFenceAtStartupInSec to a higher value instead of the default 5 min...
2014-05-17 00:23:10,450 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv11. 2014-05-17 00:23:10,456 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv4. 2014-05-17 00:23:10,458 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv7. 2014-05-17 00:23:10,460 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv9.
and after 00:23:11 i saw no fencing-related messages, only vm restart attemts that failed with strange errors like: 'Failed with error ENGINE and code 5001' 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory' 'CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
Strange that you didn't got the message itself (this is only the message key) The original message is : Cannot Run VM. There are no available running Hosts with sufficient memory in VM's Cluster . So, it failed on the RunVM command validation where no host with enough memory to run the VM was found
For the first period as I said I only can guess that one of your hosts fencing attempts was after those 5 minutes and therefor it was rebooted and the HA VMs were freed to run on other host. For the second period on which I have logs, the host fencing failed due to the required "quite time" and in this situation the only hing you can do in order to have the HA VMs running again is to right-click on each host and press "Confirm that host has been rebooted" but i see in logs - after second period, at 10:14+ all hosts but one (srv5) were up and power management verified successfully, isn't that should be enough for engine to verify all ha vms down and restart them?
How should engine know that the host was rebooted?? The fact that power management verified successfully is not enough in order to run the VMs on another host. As I see the fence commands that intend to reboot the host holding the VMs were failed ...
2014-05-17 10:11:56,946 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host srv17 power management was verified successfully.
in fact, at ~17:55 my colleague restarted engine, forced spm selection and started all vms - all started without errors, and he didn't had to click 'confirm host has been rebooted'
But here you had restarted the VMs on the original Host in which they were running after the host is UP, this will work The problem is to run a HA VM on another host, to do so, we must insure that the Host that run the VMs until now is not running then anymore If the fence operation is failed, we can say nothing on the Host status and then as I said you should manually approve that the host was rebooted in order to prevent VMs to run on multiple hosts which will lead to data corruption
Regards Eli
-- Yuriy Demchenko
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 05/19/2014 02:36 PM, Eli Mesika wrote:
Hi,
i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA vms in 2 datacenters (3 hosts uses different storage with no connectivity to first storage, that's why second DC) Recently (2014-05-17) i had a double power outage: first blackout at 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 When finally all went up (after approx. 10:16) - only 2 vms were restarted from 46. From browsing engine log i saw failed restart attemts of almost all vms after first blackout with error 'Failed with error ENGINE and code 5001', but after second blackout i saw no attempts to restart vms, and only error was 'connect timeout' (probably to srv5 - that host physically died after blackouts). And i cant figure why HA vms were not restarted? Please advice
engine and (supposedly) spm host logs in attach.
Hi Yuriy
What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03 so I can not track the first interval you had mentioned (00:19 to 00:26) 00:23 is the time when engine booted up after first outage, that's why logs started at 00:23:03 However, I can clearly see that at 2014-05-17 00:23:03 the engine was restarted and at 2014-05-17 00:23:09,423 we had started to get connection errors. We had tried to solve the problem by doing Soft-Fencing (actually vdsmd service restart) on the problematic hosts, but ssh to the host failed so we had tried to hard-fence the host (restart/reboot), but this was due the configurable "quite time" in which we are preventing fencing operation after an engine restart which is set by default to 5 min (DisableFenceAtStartupInSec key in engine-config) and therefor we had skipped the fencing operation... my hosts usually boots slower than engine, long bios checks + random power on delay (120s), thats why at first engine reports connect errors however, in logs i see, that some of them were allready up and engine successfully contacted them: So maybe you should put DisableFenceAtStartupInSec to a higher value instead of the default 5 min...
2014-05-17 00:23:10,450 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv11. 2014-05-17 00:23:10,456 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv4. 2014-05-17 00:23:10,458 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv7. 2014-05-17 00:23:10,460 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv9.
and after 00:23:11 i saw no fencing-related messages, only vm restart attemts that failed with strange errors like: 'Failed with error ENGINE and code 5001' 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory' 'CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' Strange that you didn't got the message itself (this is only the message key) The original message is : Cannot Run VM. There are no available running Hosts with sufficient memory in VM's Cluster . So, it failed on the RunVM command validation where no host with enough memory to run the VM was found
yea, but how that can be? all my hosts have 256gb ram and only 2 vms are ram-hungry (128gb and 240gb), other uses much less (2-64gb)
For the first period as I said I only can guess that one of your hosts fencing attempts was after those 5 minutes and therefor it was rebooted and the HA VMs were freed to run on other host. For the second period on which I have logs, the host fencing failed due to the required "quite time" and in this situation the only hing you can do in order to have the HA VMs running again is to right-click on each host and press "Confirm that host has been rebooted" but i see in logs - after second period, at 10:14+ all hosts but one (srv5) were up and power management verified successfully, isn't that should be enough for engine to verify all ha vms down and restart them? How should engine know that the host was rebooted?? The fact that power management verified successfully is not enough in order to run the VMs on another host. As I see the fence commands that intend to reboot the host holding the VMs were failed ...
2014-05-17 10:11:56,946 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host srv17 power management was verified successfully.
in fact, at ~17:55 my colleague restarted engine, forced spm selection and started all vms - all started without errors, and he didn't had to click 'confirm host has been rebooted' But here you had restarted the VMs on the original Host in which they were running after the host is UP, this will work The problem is to run a HA VM on another host, to do so, we must insure that the Host that run the VMs until now is not running then anymore If the fence operation is failed, we can say nothing on the Host status and then as I said you should manually approve that the host was rebooted in order to prevent VMs to run on multiple hosts which will lead to data corruption but shouldn't engine restart corresponded vms after holding host came up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not
but in like 30secs after engine set holding hosts to non-responsive they all went up - 'state was set to up for host' so, engine can contact that hosts and query vm status and restart if needed, shouldn't it? 2014-05-17 10:09:23,978 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,143 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv9. 2014-05-17 10:09:23,980 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,214 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv10. 2014-05-17 10:09:24,029 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,239 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv11. 2014-05-17 10:09:23,991 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, spm_status = Contending, non-responsive_timeout (seconds) = 82, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,320 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv12. 2014-05-17 10:09:23,987 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, spm_status = None, non-responsive_timeout (seconds) = 62, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:44,934 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv13. 2014-05-17 10:09:24,016 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,930 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv14. 2014-05-17 10:09:24,019 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,285 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv15. 2014-05-17 10:09:23,996 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,188 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv16 2014-05-17 10:09:23,963 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:46,358 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv17. 2014-05-17 10:09:23,949 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:10:25,643 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv18. 2014-05-17 10:09:23,991 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, spm_status = None, non-responsive_timeout (seconds) = 61, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,260 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv19. the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them

saw another strange moment in logs: for example, vm 'prod.gui', according to logs engine tried to restart it on host srv19 at 00:23:11 and it failed with error 'Failed to run VM prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)' in vdsm.log on srv19 i can see this error: StoragePoolUnknown: Unknown pool id, pool not connected: ('5849b030-626e-47cb-ad90-3ce782d831b3',) Thread-18::DEBUG::2014-05-17 00:23:13,022::task::869::TaskManager.Task::(_run) Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run: 27a323de-ac42-4d68-992f-eb5c84d8baea ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task Thread-18::DEBUG::2014-05-17 00:23:13,022::task::1194::TaskManager.Task::(stop) Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing (force False) Thread-18::DEBUG::2014-05-17 00:23:13,023::task::974::TaskManager.Task::(_decref) Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True Thread-18::INFO::2014-05-17 00:23:13,023::task::1151::TaskManager.Task::(prepare) Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted: 'Unknown pool id, pool not connected' - code 309 Thread-18::DEBUG::2014-05-17 00:23:13,024::task::1156::TaskManager.Task::(prepare) Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown pool id, pool not connected Thread-18::DEBUG::2014-05-17 00:23:13,024::task::974::TaskManager.Task::(_decref) Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True according to engine database - storage pool '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'. but - according to engine logs SPM start for datacenter CODD was run only at 00:23:27, i.e. _after_ vm restart attemts 2014-05-17 00:23:27,585 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12, storage pool CODD, prevId 9, LVER 11 2014-05-17 00:23:27,610 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START, SpmStartVDSCommand(HostName = srv12, HostId = eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId = 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 1844d391 i thought no VM can be started until SPM host selected, why in my case vm restart occurred before that? and what 'pool not connected' means? no spm running, storage unavailable from some host or something else? Yuriy Demchenko On 05/19/2014 03:13 PM, Yuriy Demchenko wrote: > On 05/19/2014 02:36 PM, Eli Mesika wrote: >> >> >> Hi, >> >> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA >> vms in 2 datacenters (3 hosts uses different storage with no >> connectivity to first storage, that's why second DC) >> Recently (2014-05-17) i had a double power outage: first blackout at >> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 >> When finally all went up (after approx. 10:16) - only 2 vms were >> restarted from 46. >> From browsing engine log i saw failed restart attemts of almost >> all vms >> after first blackout with error 'Failed with error ENGINE and code >> 5001', but after second blackout i saw no attempts to restart vms, and >> only error was 'connect timeout' (probably to srv5 - that host >> physically died after blackouts). >> And i cant figure why HA vms were not restarted? Please advice >> >> engine and (supposedly) spm host logs in attach. >>>> Hi Yuriy >>>> >>>> What I see is that the log for 2014-05-17 is started at 2014-05-17 >>>> 00:23:03 >>>> so I can not track the first interval you had mentioned (00:19 to >>>> 00:26) >>> 00:23 is the time when engine booted up after first outage, that's why >>> logs started at 00:23:03 >>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was >>>> restarted and at 2014-05-17 00:23:09,423 we had started to get >>>> connection >>>> errors. >>>> We had tried to solve the problem by doing Soft-Fencing (actually >>>> vdsmd >>>> service restart) on the problematic hosts, but ssh to the host >>>> failed so >>>> we had tried >>>> to hard-fence the host (restart/reboot), but this was due the >>>> configurable >>>> "quite time" in which we are preventing fencing operation after an >>>> engine >>>> restart which >>>> is set by default to 5 min (DisableFenceAtStartupInSec key in >>>> engine-config) and therefor we had skipped the fencing operation... >>> my hosts usually boots slower than engine, long bios checks + random >>> power on delay (120s), thats why at first engine reports connect errors >>> however, in logs i see, that some of them were allready up and engine >>> successfully contacted them: >> So maybe you should put DisableFenceAtStartupInSec to a higher value >> instead of the default 5 min... >> >>> 2014-05-17 00:23:10,450 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. >>> 2014-05-17 00:23:10,456 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv4. >>> 2014-05-17 00:23:10,458 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv7. >>> 2014-05-17 00:23:10,460 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. >>> >>> and after 00:23:11 i saw no fencing-related messages, only vm restart >>> attemts that failed with strange errors like: >>> 'Failed with error ENGINE and code 5001' >>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was >>> filtered >>> out by VAR__FILTERTYPE__INTERNAL filter Memory' >>> 'CanDoAction of action RunVm failed. >>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName >>> >>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' >> Strange that you didn't got the message itself (this is only the >> message key) >> The original message is : >> Cannot Run VM. There are no available running Hosts with sufficient >> memory in VM's Cluster . >> So, it failed on the RunVM command validation where no host with >> enough memory to run the VM was found > yea, but how that can be? all my hosts have 256gb ram and only 2 vms > are ram-hungry (128gb and 240gb), other uses much less (2-64gb) >> >>>> For the first period as I said I only can guess that one of your hosts >>>> fencing attempts was after those 5 minutes and therefor it was >>>> rebooted >>>> and the HA VMs were freed to run on other host. >>>> For the second period on which I have logs, the host fencing failed >>>> due to >>>> the required "quite time" and in this situation the only hing you >>>> can do >>>> in order to have the HA VMs running again is to >>>> right-click on each host and press "Confirm that host has been >>>> rebooted" >>> but i see in logs - after second period, at 10:14+ all hosts but one >>> (srv5) were up and power management verified successfully, isn't that >>> should be enough for engine to verify all ha vms down and restart them? >> How should engine know that the host was rebooted?? >> The fact that power management verified successfully is not enough in >> order to run the VMs on another host. >> As I see the fence commands that intend to reboot the host holding >> the VMs were failed ... > but in like 30secs after engine set holding hosts to non-responsive > they all went up - 'state was set to up for host' > so, engine can contact that hosts and query vm status and restart if > needed, shouldn't it? > > 2014-05-17 10:09:23,978 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = > 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,143 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv9. > > 2014-05-17 10:09:23,980 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = > 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,214 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv10. > > 2014-05-17 10:09:24,029 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = > bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,239 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv11. > > 2014-05-17 10:09:23,991 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = > eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, > spm_status = Contending, non-responsive_timeout (seconds) = 82, error > = java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,320 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv12. > > 2014-05-17 10:09:23,987 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = > 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, > spm_status = None, non-responsive_timeout (seconds) = 62, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:44,934 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, > Call Stack: null, Custom Event ID: -1, Message: State was set to Up > for host srv13. > > 2014-05-17 10:09:24,016 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = > 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,930 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv14. > > 2014-05-17 10:09:24,019 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = > 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,285 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv15. > > 2014-05-17 10:09:23,996 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = > c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,188 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv16 > > 2014-05-17 10:09:23,963 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = > 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:46,358 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv17. > > 2014-05-17 10:09:23,949 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = > 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:10:25,643 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv18. > > 2014-05-17 10:09:23,991 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = > e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, > spm_status = None, non-responsive_timeout (seconds) = 61, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,260 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv19. > >>> 2014-05-17 10:11:56,946 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla >>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host >>> srv17 power management was verified successfully. >>> >>> in fact, at ~17:55 my colleague restarted engine, forced spm selection >>> and started all vms - all started without errors, and he didn't had to >>> click 'confirm host has been rebooted' >> But here you had restarted the VMs on the original Host in which they >> were running after the host is UP, this will work >> The problem is to run a HA VM on another host, to do so, we must >> insure that the Host that run the VMs until now is not running then >> anymore >> If the fence operation is failed, we can say nothing on the Host >> status and then as I said you should manually approve that the host >> was rebooted in order >> to prevent VMs to run on multiple hosts which will lead to data >> corruption > but shouldn't engine restart corresponded vms after holding host came > up? (without manual fence) > because they up - so engine can query them about running/not running > vms and get actual state of vms - running or not > the only host were down at that point is srv5, which holded only 1 vm > - and it were correctly put in 'unknown' state, other vms were just > 'down' until we manually started them

----- Original Message ----- > From: "Yuriy Demchenko" <demchenko.ya@gmail.com> > To: "Eli Mesika" <emesika@redhat.com> > Cc: users@ovirt.org > Sent: Monday, May 19, 2014 3:56:41 PM > Subject: Re: [ovirt-users] power outage: HA vms not restarted > > saw another strange moment in logs: > > for example, vm 'prod.gui', according to logs engine tried to restart it > on host srv19 at 00:23:11 and it failed with error 'Failed to run VM > prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with > error ENGINE and code 5001)' This is a bug and you should open a BZ for that and attach related engine & vdsm logs > > in vdsm.log on srv19 i can see this error: > StoragePoolUnknown: Unknown pool id, pool not connected: > ('5849b030-626e-47cb-ad90-3ce782d831b3',) > Thread-18::DEBUG::2014-05-17 > 00:23:13,022::task::869::TaskManager.Task::(_run) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run: > 27a323de-ac42-4d68-992f-eb5c84d8baea > ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task > Thread-18::DEBUG::2014-05-17 > 00:23:13,022::task::1194::TaskManager.Task::(stop) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing > (force False) > Thread-18::DEBUG::2014-05-17 > 00:23:13,023::task::974::TaskManager.Task::(_decref) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True > Thread-18::INFO::2014-05-17 > 00:23:13,023::task::1151::TaskManager.Task::(prepare) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted: > 'Unknown pool id, pool not connected' - code 309 > Thread-18::DEBUG::2014-05-17 > 00:23:13,024::task::1156::TaskManager.Task::(prepare) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown > pool id, pool not connected > Thread-18::DEBUG::2014-05-17 > 00:23:13,024::task::974::TaskManager.Task::(_decref) > Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True > > according to engine database - storage pool > '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'. > > but - according to engine logs SPM start for datacenter CODD was run > only at 00:23:27, i.e. _after_ vm restart attemts How did you concluded that ? > 2014-05-17 00:23:27,585 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12, > storage pool CODD, prevId 9, LVER 11 > 2014-05-17 00:23:27,610 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START, > SpmStartVDSCommand(HostName = srv12, HostId = > eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId = > 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11, > storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log > id: 1844d391 > > i thought no VM can be started until SPM host selected, why in my case > vm restart occurred before that? > and what 'pool not connected' means? no spm running, storage unavailable > from some host or something else? In this case, storage unavailable from some host.... > > > Yuriy Demchenko > > On 05/19/2014 03:13 PM, Yuriy Demchenko wrote: > > On 05/19/2014 02:36 PM, Eli Mesika wrote: > >> > >> > >> Hi, > >> > >> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA > >> vms in 2 datacenters (3 hosts uses different storage with no > >> connectivity to first storage, that's why second DC) > >> Recently (2014-05-17) i had a double power outage: first blackout at > >> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 > >> When finally all went up (after approx. 10:16) - only 2 vms were > >> restarted from 46. > >> From browsing engine log i saw failed restart attemts of almost > >> all vms > >> after first blackout with error 'Failed with error ENGINE and code > >> 5001', but after second blackout i saw no attempts to restart vms, and > >> only error was 'connect timeout' (probably to srv5 - that host > >> physically died after blackouts). > >> And i cant figure why HA vms were not restarted? Please advice > >> > >> engine and (supposedly) spm host logs in attach. > >>>> Hi Yuriy > >>>> > >>>> What I see is that the log for 2014-05-17 is started at 2014-05-17 > >>>> 00:23:03 > >>>> so I can not track the first interval you had mentioned (00:19 to > >>>> 00:26) > >>> 00:23 is the time when engine booted up after first outage, that's why > >>> logs started at 00:23:03 > >>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was > >>>> restarted and at 2014-05-17 00:23:09,423 we had started to get > >>>> connection > >>>> errors. > >>>> We had tried to solve the problem by doing Soft-Fencing (actually > >>>> vdsmd > >>>> service restart) on the problematic hosts, but ssh to the host > >>>> failed so > >>>> we had tried > >>>> to hard-fence the host (restart/reboot), but this was due the > >>>> configurable > >>>> "quite time" in which we are preventing fencing operation after an > >>>> engine > >>>> restart which > >>>> is set by default to 5 min (DisableFenceAtStartupInSec key in > >>>> engine-config) and therefor we had skipped the fencing operation... > >>> my hosts usually boots slower than engine, long bios checks + random > >>> power on delay (120s), thats why at first engine reports connect errors > >>> however, in logs i see, that some of them were allready up and engine > >>> successfully contacted them: > >> So maybe you should put DisableFenceAtStartupInSec to a higher value > >> instead of the default 5 min... > >> > >>> 2014-05-17 00:23:10,450 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. > >>> 2014-05-17 00:23:10,456 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv4. > >>> 2014-05-17 00:23:10,458 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv7. > >>> 2014-05-17 00:23:10,460 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: > >>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. > >>> > >>> and after 00:23:11 i saw no fencing-related messages, only vm restart > >>> attemts that failed with strange errors like: > >>> 'Failed with error ENGINE and code 5001' > >>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was > >>> filtered > >>> out by VAR__FILTERTYPE__INTERNAL filter Memory' > >>> 'CanDoAction of action RunVm failed. > >>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName > >>> > >>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' > >> Strange that you didn't got the message itself (this is only the > >> message key) > >> The original message is : > >> Cannot Run VM. There are no available running Hosts with sufficient > >> memory in VM's Cluster . > >> So, it failed on the RunVM command validation where no host with > >> enough memory to run the VM was found > > yea, but how that can be? all my hosts have 256gb ram and only 2 vms > > are ram-hungry (128gb and 240gb), other uses much less (2-64gb) > >> > >>>> For the first period as I said I only can guess that one of your hosts > >>>> fencing attempts was after those 5 minutes and therefor it was > >>>> rebooted > >>>> and the HA VMs were freed to run on other host. > >>>> For the second period on which I have logs, the host fencing failed > >>>> due to > >>>> the required "quite time" and in this situation the only hing you > >>>> can do > >>>> in order to have the HA VMs running again is to > >>>> right-click on each host and press "Confirm that host has been > >>>> rebooted" > >>> but i see in logs - after second period, at 10:14+ all hosts but one > >>> (srv5) were up and power management verified successfully, isn't that > >>> should be enough for engine to verify all ha vms down and restart them? > >> How should engine know that the host was rebooted?? > >> The fact that power management verified successfully is not enough in > >> order to run the VMs on another host. > >> As I see the fence commands that intend to reboot the host holding > >> the VMs were failed ... > > but in like 30secs after engine set holding hosts to non-responsive > > they all went up - 'state was set to up for host' > > so, engine can contact that hosts and query vm status and restart if > > needed, shouldn't it? > > > > 2014-05-17 10:09:23,978 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = > > 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,143 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv9. > > > > 2014-05-17 10:09:23,980 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = > > 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,214 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv10. > > > > 2014-05-17 10:09:24,029 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = > > bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,239 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv11. > > > > 2014-05-17 10:09:23,991 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = > > eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, > > spm_status = Contending, non-responsive_timeout (seconds) = 82, error > > = java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,320 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv12. > > > > 2014-05-17 10:09:23,987 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = > > 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, > > spm_status = None, non-responsive_timeout (seconds) = 62, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:44,934 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, > > Call Stack: null, Custom Event ID: -1, Message: State was set to Up > > for host srv13. > > > > 2014-05-17 10:09:24,016 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = > > 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:43,930 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv14. > > > > 2014-05-17 10:09:24,019 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = > > 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,285 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv15. > > > > 2014-05-17 10:09:23,996 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = > > c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,188 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv16 > > > > 2014-05-17 10:09:23,963 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = > > 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:46,358 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv17. > > > > 2014-05-17 10:09:23,949 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = > > 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, > > spm_status = None, non-responsive_timeout (seconds) = 60, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:10:25,643 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv18. > > > > 2014-05-17 10:09:23,991 INFO > > [org.ovirt.engine.core.vdsbroker.VdsManager] > > (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = > > e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, > > spm_status = None, non-responsive_timeout (seconds) = 61, error = > > java.net.ConnectException: Connection refused > > 2014-05-17 10:09:40,260 INFO > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: > > null, Custom Event ID: -1, Message: State was set to Up for host srv19. > > > >>> 2014-05-17 10:11:56,946 INFO > >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla > >>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host > >>> srv17 power management was verified successfully. > >>> > >>> in fact, at ~17:55 my colleague restarted engine, forced spm selection > >>> and started all vms - all started without errors, and he didn't had to > >>> click 'confirm host has been rebooted' > >> But here you had restarted the VMs on the original Host in which they > >> were running after the host is UP, this will work > >> The problem is to run a HA VM on another host, to do so, we must > >> insure that the Host that run the VMs until now is not running then > >> anymore > >> If the fence operation is failed, we can say nothing on the Host > >> status and then as I said you should manually approve that the host > >> was rebooted in order > >> to prevent VMs to run on multiple hosts which will lead to data > >> corruption > > but shouldn't engine restart corresponded vms after holding host came > > up? (without manual fence) > > because they up - so engine can query them about running/not running > > vms and get actual state of vms - running or not > > the only host were down at that point is srv5, which holded only 1 vm > > - and it were correctly put in 'unknown' state, other vms were just > > 'down' until we manually started them > >

On 05/19/2014 05:10 PM, Eli Mesika wrote: > > ----- Original Message ----- >> From: "Yuriy Demchenko" <demchenko.ya@gmail.com> >> To: "Eli Mesika" <emesika@redhat.com> >> Cc: users@ovirt.org >> Sent: Monday, May 19, 2014 3:56:41 PM >> Subject: Re: [ovirt-users] power outage: HA vms not restarted >> >> saw another strange moment in logs: >> >> for example, vm 'prod.gui', according to logs engine tried to restart it >> on host srv19 at 00:23:11 and it failed with error 'Failed to run VM >> prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with >> error ENGINE and code 5001)' > This is a bug and you should open a BZ for that and attach related engine & vdsm logs done https://bugzilla.redhat.com/show_bug.cgi?id=1099134 >> in vdsm.log on srv19 i can see this error: >> StoragePoolUnknown: Unknown pool id, pool not connected: >> ('5849b030-626e-47cb-ad90-3ce782d831b3',) >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,022::task::869::TaskManager.Task::(_run) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run: >> 27a323de-ac42-4d68-992f-eb5c84d8baea >> ('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping task >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,022::task::1194::TaskManager.Task::(stop) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing >> (force False) >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,023::task::974::TaskManager.Task::(_decref) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting True >> Thread-18::INFO::2014-05-17 >> 00:23:13,023::task::1151::TaskManager.Task::(prepare) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted: >> 'Unknown pool id, pool not connected' - code 309 >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,024::task::1156::TaskManager.Task::(prepare) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknown >> pool id, pool not connected >> Thread-18::DEBUG::2014-05-17 >> 00:23:13,024::task::974::TaskManager.Task::(_decref) >> Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True >> >> according to engine database - storage pool >> '5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'. >> >> but - according to engine logs SPM start for datacenter CODD was run >> only at 00:23:27, i.e. _after_ vm restart attemts > How did you concluded that ? erm, am i wrong here? restart attempt according to logs above were at 00:23:11 SPM start according to logs below were at 00:23:27 or i'm wrong in assumption that spm is needed for vm start? >> 2014-05-17 00:23:27,585 INFO >> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] >> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12, >> storage pool CODD, prevId 9, LVER 11 >> 2014-05-17 00:23:27,610 INFO >> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] >> (DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START, >> SpmStartVDSCommand(HostName = srv12, HostId = >> eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId = >> 5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11, >> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log >> id: 1844d391 >> >> i thought no VM can be started until SPM host selected, why in my case >> vm restart occurred before that? >> and what 'pool not connected' means? no spm running, storage unavailable >> from some host or something else? > In this case, storage unavailable from some host.... shouldn't then that host be set into 'non-operational' state and restart attempt carried by another host? > >> >> Yuriy Demchenko >> >> On 05/19/2014 03:13 PM, Yuriy Demchenko wrote: >>> On 05/19/2014 02:36 PM, Eli Mesika wrote: >>>> >>>> Hi, >>>> >>>> i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA >>>> vms in 2 datacenters (3 hosts uses different storage with no >>>> connectivity to first storage, that's why second DC) >>>> Recently (2014-05-17) i had a double power outage: first blackout at >>>> 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 >>>> When finally all went up (after approx. 10:16) - only 2 vms were >>>> restarted from 46. >>>> From browsing engine log i saw failed restart attemts of almost >>>> all vms >>>> after first blackout with error 'Failed with error ENGINE and code >>>> 5001', but after second blackout i saw no attempts to restart vms, and >>>> only error was 'connect timeout' (probably to srv5 - that host >>>> physically died after blackouts). >>>> And i cant figure why HA vms were not restarted? Please advice >>>> >>>> engine and (supposedly) spm host logs in attach. >>>>>> Hi Yuriy >>>>>> >>>>>> What I see is that the log for 2014-05-17 is started at 2014-05-17 >>>>>> 00:23:03 >>>>>> so I can not track the first interval you had mentioned (00:19 to >>>>>> 00:26) >>>>> 00:23 is the time when engine booted up after first outage, that's why >>>>> logs started at 00:23:03 >>>>>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was >>>>>> restarted and at 2014-05-17 00:23:09,423 we had started to get >>>>>> connection >>>>>> errors. >>>>>> We had tried to solve the problem by doing Soft-Fencing (actually >>>>>> vdsmd >>>>>> service restart) on the problematic hosts, but ssh to the host >>>>>> failed so >>>>>> we had tried >>>>>> to hard-fence the host (restart/reboot), but this was due the >>>>>> configurable >>>>>> "quite time" in which we are preventing fencing operation after an >>>>>> engine >>>>>> restart which >>>>>> is set by default to 5 min (DisableFenceAtStartupInSec key in >>>>>> engine-config) and therefor we had skipped the fencing operation... >>>>> my hosts usually boots slower than engine, long bios checks + random >>>>> power on delay (120s), thats why at first engine reports connect errors >>>>> however, in logs i see, that some of them were allready up and engine >>>>> successfully contacted them: >>>> So maybe you should put DisableFenceAtStartupInSec to a higher value >>>> instead of the default 5 min... >>>> >>>>> 2014-05-17 00:23:10,450 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. >>>>> 2014-05-17 00:23:10,456 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv4. >>>>> 2014-05-17 00:23:10,458 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv7. >>>>> 2014-05-17 00:23:10,460 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: >>>>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. >>>>> >>>>> and after 00:23:11 i saw no fencing-related messages, only vm restart >>>>> attemts that failed with strange errors like: >>>>> 'Failed with error ENGINE and code 5001' >>>>> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was >>>>> filtered >>>>> out by VAR__FILTERTYPE__INTERNAL filter Memory' >>>>> 'CanDoAction of action RunVm failed. >>>>> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName >>>>> >>>>> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' >>>> Strange that you didn't got the message itself (this is only the >>>> message key) >>>> The original message is : >>>> Cannot Run VM. There are no available running Hosts with sufficient >>>> memory in VM's Cluster . >>>> So, it failed on the RunVM command validation where no host with >>>> enough memory to run the VM was found >>> yea, but how that can be? all my hosts have 256gb ram and only 2 vms >>> are ram-hungry (128gb and 240gb), other uses much less (2-64gb) >>>>>> For the first period as I said I only can guess that one of your hosts >>>>>> fencing attempts was after those 5 minutes and therefor it was >>>>>> rebooted >>>>>> and the HA VMs were freed to run on other host. >>>>>> For the second period on which I have logs, the host fencing failed >>>>>> due to >>>>>> the required "quite time" and in this situation the only hing you >>>>>> can do >>>>>> in order to have the HA VMs running again is to >>>>>> right-click on each host and press "Confirm that host has been >>>>>> rebooted" >>>>> but i see in logs - after second period, at 10:14+ all hosts but one >>>>> (srv5) were up and power management verified successfully, isn't that >>>>> should be enough for engine to verify all ha vms down and restart them? >>>> How should engine know that the host was rebooted?? >>>> The fact that power management verified successfully is not enough in >>>> order to run the VMs on another host. >>>> As I see the fence commands that intend to reboot the host holding >>>> the VMs were failed ... >>> but in like 30secs after engine set holding hosts to non-responsive >>> they all went up - 'state was set to up for host' >>> so, engine can contact that hosts and query vm status and restart if >>> needed, shouldn't it? >>> >>> 2014-05-17 10:09:23,978 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = >>> 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,143 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv9. >>> >>> 2014-05-17 10:09:23,980 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = >>> 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,214 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv10. >>> >>> 2014-05-17 10:09:24,029 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = >>> bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,239 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv11. >>> >>> 2014-05-17 10:09:23,991 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = >>> eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, >>> spm_status = Contending, non-responsive_timeout (seconds) = 82, error >>> = java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,320 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv12. >>> >>> 2014-05-17 10:09:23,987 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = >>> 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, >>> spm_status = None, non-responsive_timeout (seconds) = 62, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:44,934 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, >>> Call Stack: null, Custom Event ID: -1, Message: State was set to Up >>> for host srv13. >>> >>> 2014-05-17 10:09:24,016 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = >>> 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:43,930 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv14. >>> >>> 2014-05-17 10:09:24,019 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = >>> 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,285 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv15. >>> >>> 2014-05-17 10:09:23,996 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = >>> c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,188 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv16 >>> >>> 2014-05-17 10:09:23,963 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = >>> 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:46,358 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv17. >>> >>> 2014-05-17 10:09:23,949 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = >>> 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, >>> spm_status = None, non-responsive_timeout (seconds) = 60, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:10:25,643 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv18. >>> >>> 2014-05-17 10:09:23,991 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>> (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = >>> e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, >>> spm_status = None, non-responsive_timeout (seconds) = 61, error = >>> java.net.ConnectException: Connection refused >>> 2014-05-17 10:09:40,260 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: State was set to Up for host srv19. >>> >>>>> 2014-05-17 10:11:56,946 INFO >>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla >>>>> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host >>>>> srv17 power management was verified successfully. >>>>> >>>>> in fact, at ~17:55 my colleague restarted engine, forced spm selection >>>>> and started all vms - all started without errors, and he didn't had to >>>>> click 'confirm host has been rebooted' >>>> But here you had restarted the VMs on the original Host in which they >>>> were running after the host is UP, this will work >>>> The problem is to run a HA VM on another host, to do so, we must >>>> insure that the Host that run the VMs until now is not running then >>>> anymore >>>> If the fence operation is failed, we can say nothing on the Host >>>> status and then as I said you should manually approve that the host >>>> was rebooted in order >>>> to prevent VMs to run on multiple hosts which will lead to data >>>> corruption >>> but shouldn't engine restart corresponded vms after holding host came >>> up? (without manual fence) >>> because they up - so engine can query them about running/not running >>> vms and get actual state of vms - running or not >>> the only host were down at that point is srv5, which holded only 1 vm >>> - and it were correctly put in 'unknown' state, other vms were just >>> 'down' until we manually started them >>

----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 2:13:02 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 02:36 PM, Eli Mesika wrote:
Hi,
i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA vms in 2 datacenters (3 hosts uses different storage with no connectivity to first storage, that's why second DC) Recently (2014-05-17) i had a double power outage: first blackout at 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 When finally all went up (after approx. 10:16) - only 2 vms were restarted from 46. From browsing engine log i saw failed restart attemts of almost all vms after first blackout with error 'Failed with error ENGINE and code 5001', but after second blackout i saw no attempts to restart vms, and only error was 'connect timeout' (probably to srv5 - that host physically died after blackouts). And i cant figure why HA vms were not restarted? Please advice
engine and (supposedly) spm host logs in attach.
Hi Yuriy
What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03 so I can not track the first interval you had mentioned (00:19 to 00:26) 00:23 is the time when engine booted up after first outage, that's why logs started at 00:23:03 However, I can clearly see that at 2014-05-17 00:23:03 the engine was restarted and at 2014-05-17 00:23:09,423 we had started to get connection errors. We had tried to solve the problem by doing Soft-Fencing (actually vdsmd service restart) on the problematic hosts, but ssh to the host failed so we had tried to hard-fence the host (restart/reboot), but this was due the configurable "quite time" in which we are preventing fencing operation after an engine restart which is set by default to 5 min (DisableFenceAtStartupInSec key in engine-config) and therefor we had skipped the fencing operation... my hosts usually boots slower than engine, long bios checks + random power on delay (120s), thats why at first engine reports connect errors however, in logs i see, that some of them were allready up and engine successfully contacted them: So maybe you should put DisableFenceAtStartupInSec to a higher value instead of the default 5 min...
2014-05-17 00:23:10,450 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv11. 2014-05-17 00:23:10,456 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv4. 2014-05-17 00:23:10,458 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv7. 2014-05-17 00:23:10,460 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv9.
and after 00:23:11 i saw no fencing-related messages, only vm restart attemts that failed with strange errors like: 'Failed with error ENGINE and code 5001' 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory' 'CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' Strange that you didn't got the message itself (this is only the message key) The original message is : Cannot Run VM. There are no available running Hosts with sufficient memory in VM's Cluster . So, it failed on the RunVM command validation where no host with enough memory to run the VM was found
yea, but how that can be? all my hosts have 256gb ram and only 2 vms are ram-hungry (128gb and 240gb), other uses much less (2-64gb)
For the first period as I said I only can guess that one of your hosts fencing attempts was after those 5 minutes and therefor it was rebooted and the HA VMs were freed to run on other host. For the second period on which I have logs, the host fencing failed due to the required "quite time" and in this situation the only hing you can do in order to have the HA VMs running again is to right-click on each host and press "Confirm that host has been rebooted" but i see in logs - after second period, at 10:14+ all hosts but one (srv5) were up and power management verified successfully, isn't that should be enough for engine to verify all ha vms down and restart them? How should engine know that the host was rebooted?? The fact that power management verified successfully is not enough in order to run the VMs on another host. As I see the fence commands that intend to reboot the host holding the VMs were failed ...
but in like 30secs after engine set holding hosts to non-responsive they all went up - 'state was set to up for host' so, engine can contact that hosts and query vm status and restart if needed, shouldn't it?
2014-05-17 10:09:23,978 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,143 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv9.
2014-05-17 10:09:23,980 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,214 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv10.
2014-05-17 10:09:24,029 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,239 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv11.
2014-05-17 10:09:23,991 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, spm_status = Contending, non-responsive_timeout (seconds) = 82, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,320 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv12.
2014-05-17 10:09:23,987 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, spm_status = None, non-responsive_timeout (seconds) = 62, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:44,934 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv13.
2014-05-17 10:09:24,016 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:43,930 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv14.
2014-05-17 10:09:24,019 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,285 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv15.
2014-05-17 10:09:23,996 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,188 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv16
2014-05-17 10:09:23,963 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:46,358 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv17.
2014-05-17 10:09:23,949 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.ConnectException: Connection refused 2014-05-17 10:10:25,643 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv18.
2014-05-17 10:09:23,991 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, spm_status = None, non-responsive_timeout (seconds) = 61, error = java.net.ConnectException: Connection refused 2014-05-17 10:09:40,260 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv19.
2014-05-17 10:11:56,946 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host srv17 power management was verified successfully.
in fact, at ~17:55 my colleague restarted engine, forced spm selection and started all vms - all started without errors, and he didn't had to click 'confirm host has been rebooted' But here you had restarted the VMs on the original Host in which they were running after the host is UP, this will work The problem is to run a HA VM on another host, to do so, we must insure that the Host that run the VMs until now is not running then anymore If the fence operation is failed, we can say nothing on the Host status and then as I said you should manually approve that the host was rebooted in order to prevent VMs to run on multiple hosts which will lead to data corruption but shouldn't engine restart corresponded vms after holding host came up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them
Are you sure that those VMs are defined as Highly Available VMs ???

On 05/19/2014 04:56 PM, Eli Mesika wrote:
but shouldn't engine restart corresponded vms after holding host came
up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them Are you sure that those VMs are defined as Highly Available VMs ???
yes, i'm sure. double checked in webinterface, plus log entries like: 2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958

----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:01:04 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 04:56 PM, Eli Mesika wrote:
but shouldn't engine restart corresponded vms after holding host came
up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them Are you sure that those VMs are defined as Highly Available VMs ???
yes, i'm sure. double checked in webinterface, plus log entries like:
May this be related, I think that in your case host came up very fast while the fencing operation already started .... https://bugzilla.redhat.com/show_bug.cgi?id=1064860
2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958

On 05/19/2014 05:13 PM, Eli Mesika wrote:
----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:01:04 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 04:56 PM, Eli Mesika wrote:
but shouldn't engine restart corresponded vms after holding host came
up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them Are you sure that those VMs are defined as Highly Available VMs ???
yes, i'm sure. double checked in webinterface, plus log entries like: May this be related, I think that in your case host came up very fast while the fencing operation already started .... https://bugzilla.redhat.com/show_bug.cgi?id=1064860
doesn't seems so, as vm wasnt put into 'unknown' state and srv19 were allready up when engine booted, so no fence attempt ever made for it
2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958

----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:29:54 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 05:13 PM, Eli Mesika wrote:
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:01:04 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 04:56 PM, Eli Mesika wrote:
but shouldn't engine restart corresponded vms after holding host came
up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them Are you sure that those VMs are defined as Highly Available VMs ???
yes, i'm sure. double checked in webinterface, plus log entries like: May this be related, I think that in your case host came up very fast while
----- Original Message ----- the fencing operation already started .... https://bugzilla.redhat.com/show_bug.cgi?id=1064860
doesn't seems so, as vm wasnt put into 'unknown' state and srv19 were allready up when engine booted, so no fence attempt ever made for it
Please open a BZ for that with all relevant logs attached and all info Also, the NPE worth a separate BZ Thanks
2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958

Bug already fixed in 3.3 https://bugzilla.redhat.com/show_bug.cgi?id=1074478 and 3.4 https://bugzilla.redhat.com/show_bug.cgi?id=1078553 Thanks. ----- Original Message ----- From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:29:54 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted On 05/19/2014 05:13 PM, Eli Mesika wrote:
----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:01:04 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 04:56 PM, Eli Mesika wrote:
but shouldn't engine restart corresponded vms after holding host came
up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them Are you sure that those VMs are defined as Highly Available VMs ???
yes, i'm sure. double checked in webinterface, plus log entries like: May this be related, I think that in your case host came up very fast while the fencing operation already started .... https://bugzilla.redhat.com/show_bug.cgi?id=1064860
doesn't seems so, as vm wasnt put into 'unknown' state and srv19 were allready up when engine booted, so no fence attempt ever made for it
2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi, sorry for delay, guess i'll gonna plan to upgrade to 3.4 soon Eli, Artyom, Omer - big thanks for your valuable help, it was important to me to understand what went wrong in that incident. Yuriy Demchenko On 05/19/2014 06:26 PM, Artyom Lukianov wrote:
Bug already fixed in 3.3 https://bugzilla.redhat.com/show_bug.cgi?id=1074478 and 3.4 https://bugzilla.redhat.com/show_bug.cgi?id=1078553 Thanks.
----- Original Message ----- From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:29:54 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 05:13 PM, Eli Mesika wrote:
----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org Sent: Monday, May 19, 2014 4:01:04 PM Subject: Re: [ovirt-users] power outage: HA vms not restarted
On 05/19/2014 04:56 PM, Eli Mesika wrote:
but shouldn't engine restart corresponded vms after holding host came
up? (without manual fence) because they up - so engine can query them about running/not running vms and get actual state of vms - running or not the only host were down at that point is srv5, which holded only 1 vm - and it were correctly put in 'unknown' state, other vms were just 'down' until we manually started them Are you sure that those VMs are defined as Highly Available VMs ???
yes, i'm sure. double checked in webinterface, plus log entries like: May this be related, I think that in your case host came up very fast while the fencing operation already started .... https://bugzilla.redhat.com/show_bug.cgi?id=1064860 doesn't seems so, as vm wasnt put into 'unknown' state and srv19 were allready up when engine booted, so no fence attempt ever made for it
2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958
Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (3)
-
Artyom Lukianov
-
Eli Mesika
-
Yuriy Demchenko