Hello all,
we are currently in the process of evaluating oVirt as a basis for our
new virutalization environment. As far as our evaluation has processed
it seems to be the way to go, but when testing the high availability
features I ran into a serious problem:
Our testing setup looks like this: 2 hosts on Dell R210 and R210II machines,
a seperate machine running the managing application in JBoss and providing
storage space through NFS. Under normal conditions everything works fine:
I can migrate machines between the two nodes, I can add a third node,
access everything by VNC, monitor the VMs really nicely, the power management
feature of the R210s work just fine.
Then, when simulating the loss of a host by pulling the plug on the machine,
(yes, that is kind of a crude check) some things seem to go terribly wrong:
the system detects the host being unresponsive and assumes it is down. But
the host happens to be the SPM and the other does not take over this function.
This leaves the hole cluster in an unresponseive state and my datacenter
is gone. I tracked down the problem in the log files to the point where
the engine tries to migrate the SPM to another node:
2012-09-20 07:54:40,836 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(QuartzScheduler_Worker-60) SPM selection - vds seems as spm node03
2012-09-20 07:54:40,837 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(QuartzScheduler_Worker-60) spm vds is non responsive, stopping spm selection.
2012-09-20 07:54:44,344 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand]
(QuartzScheduler_Worker-51) XML RPC error in command GetCapabilitiesVDS ( Vds: node03 ),
the error was: java.util.concurrent.ExecutionException:
java.lang.reflect.InvocationTargetException, NoRouteToHostException: Keine Route zum
Zielrechner
2012-09-20 07:54:47,345 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand]
(QuartzScheduler_Worker-47) XML RPC error in command GetCapabilitiesVDS ( Vds: node03 ),
the error was: java.util.concurrent.ExecutionException:
java.lang.reflect.InvocationTargetException, NoRouteToHostException: Keine Route zum
Zielrechner
2012-09-20 07:54:50,869 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(QuartzScheduler_Worker-69) hostFromVds::selectedVds - node04, spmStatus Free, storage
pool ingenit
2012-09-20 07:54:50,892 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(QuartzScheduler_Worker-69) SPM Init: could not find reported vds or not up -
pool:ingenit vds_spm_id: 2
2012-09-20 07:54:50,905 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(QuartzScheduler_Worker-69) SPM selection - vds seems as spm node03
As far as I understand these logs, the engine detects node03 not being
responsive, starts electing a new SPM but does not find node04. That is
strange as the host is online, pingable and worked just fine as part of
the cluster.
What I can do to remedy the situation using the management interface to
set "Confirm Host has been rebooted" and switch the host into maintenance
mode after that. Than the responsive node takes over and the VMs are
being migrated, too.
Has anyone experienced a similar problem? Is this by design and killing
off the SPM is a bad coincident and always requires manual intervention?
I would hope not :-)
I tried to google some answers, but aside from a thread in May that did
not help I came up empty.
Thanks in advance for all the help...
Kind regards from Germany,
Marc
--
________________________________________________________________________
Dipl.-Inform. Marc-Christian Schröer schroeer(a)ingenit.com
Geschäftsführer / CEO
----------------------------------------------------------------------
ingenit GmbH & Co. KG Tel. +49 (0)231 58 698-120
Emil-Figge-Strasse 76-80 Fax. +49 (0)231 58 698-121
D-44227 Dortmund
www.ingenit.com
Registergericht: Amtsgericht Dortmund, HRA 13 914
Gesellschafter : Thomas Klute, Marc-Christian Schröer
________________________________________________________________________