[Users] SPM not selected after host failed

Itamar Heim iheim at redhat.com
Thu Sep 20 13:34:13 UTC 2012


On 09/20/2012 09:02 AM, "Marc-Christian Schröer | ingenit GmbH & Co. KG" 
wrote:
> Hello all,
>
> we are currently in the process of evaluating oVirt as a basis for our
> new virutalization environment. As far as our evaluation has processed
> it seems to be the way to go, but when testing the high availability
> features I ran into a serious problem:
>
> Our testing setup looks like this: 2 hosts on Dell R210 and R210II machines,
> a seperate machine running the managing application in JBoss and providing
> storage space through NFS. Under normal conditions everything works fine:
> I can migrate machines between the two nodes, I can add a third node,
> access everything by VNC, monitor the VMs really nicely, the power management
> feature of the R210s work just fine.
>
> Then, when simulating the loss of a host by pulling the plug on the machine,
> (yes, that is kind of a crude check) some things seem to go terribly wrong:
> the system detects the host being unresponsive and assumes it is down. But
> the host happens to be the SPM and the other does not take over this function.
> This leaves the hole cluster in an unresponseive state and my datacenter
> is gone. I tracked down the problem in the log files to the point where
> the engine tries to migrate the SPM to another node:
>
> 2012-09-20 07:54:40,836 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-60) SPM selection - vds seems as spm node03
> 2012-09-20 07:54:40,837 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-60) spm vds is non responsive, stopping spm selection.
> 2012-09-20 07:54:44,344 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-51) XML RPC error in command GetCapabilitiesVDS ( Vds: node03 ),
> the error was: java.util.concurrent.ExecutionException: java.lang.reflect.InvocationTargetException, NoRouteToHostException: Keine Route zum Zielrechner
> 2012-09-20 07:54:47,345 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-47) XML RPC error in command GetCapabilitiesVDS ( Vds: node03 ),
> the error was: java.util.concurrent.ExecutionException: java.lang.reflect.InvocationTargetException, NoRouteToHostException: Keine Route zum Zielrechner
> 2012-09-20 07:54:50,869 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-69) hostFromVds::selectedVds - node04, spmStatus Free, storage
> pool ingenit
> 2012-09-20 07:54:50,892 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-69) SPM Init: could not find reported vds or not up -
> pool:ingenit vds_spm_id: 2
> 2012-09-20 07:54:50,905 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-69) SPM selection - vds seems as spm node03
>
> As far as I understand these logs, the engine detects node03 not being
> responsive, starts electing a new SPM but does not find node04. That is
> strange as the host is online, pingable and worked just fine as part of
> the cluster.
>
> What I can do to remedy the situation using the management interface to
> set "Confirm Host has been rebooted" and switch the host into maintenance
> mode after that. Than the responsive node takes over and the VMs are
> being migrated, too.
>
> Has anyone experienced a similar problem? Is this by design and killing
> off the SPM is a bad coincident and always requires manual intervention?
> I would hope not :-)
>
> I tried to google some answers, but aside from a thread in May that did
> not help I came up empty.
>
> Thanks in advance for all the help...
>
> Kind regards from Germany,
>    Marc
>

is power management configured on both hosts?
since the non responsive node happened to be the SPM, it must be fenced.
engine should to this automatically (and this is what you did manually 
by 'confirm host has been rebooted'.
but engine can only do this automatically if power management is 
configured on both hosts.




More information about the Users mailing list