----- Original Message -----
From: "Yuriy Demchenko" <demchenko.ya(a)gmail.com>
To: users(a)ovirt.org
Sent: Monday, May 19, 2014 11:34:15 AM
Subject: [ovirt-users] power outage: HA vms not restarted
Hi,
i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
vms in 2 datacenters (3 hosts uses different storage with no
connectivity to first storage, that's why second DC)
Recently (2014-05-17) i had a double power outage: first blackout at
00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
When finally all went up (after approx. 10:16) - only 2 vms were
restarted from 46.
From browsing engine log i saw failed restart attemts of almost all vms
after first blackout with error 'Failed with error ENGINE and code
5001', but after second blackout i saw no attempts to restart vms, and
only error was 'connect timeout' (probably to srv5 - that host
physically died after blackouts).
And i cant figure why HA vms were not restarted? Please advice
engine and (supposedly) spm host logs in attach.
Hi Yuriy
What I see is that the log for 2014-05-17 is started at 2014-05-17 00:23:03 so I can not
track the first interval you had mentioned (00:19 to 00:26)
However, I can clearly see that at 2014-05-17 00:23:03 the engine was restarted and at
2014-05-17 00:23:09,423 we had started to get connection errors.
We had tried to solve the problem by doing Soft-Fencing (actually vdsmd service restart)
on the problematic hosts, but ssh to the host failed so we had tried
to hard-fence the host (restart/reboot), but this was due the configurable "quite
time" in which we are preventing fencing operation after an engine restart which
is set by default to 5 min (DisableFenceAtStartupInSec key in engine-config) and therefor
we had skipped the fencing operation...
For the first period as I said I only can guess that one of your hosts fencing attempts
was after those 5 minutes and therefor it was rebooted and the HA VMs were freed to run on
other host.
For the second period on which I have logs, the host fencing failed due to the required
"quite time" and in this situation the only hing you can do in order to have the
HA VMs running again is to
right-click on each host and press "Confirm that host has been rebooted"
Regards
Eli
--
Yuriy Demchenko
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users