[ovirt-users] host status "Non Operational" - how to diagnose & fix?

Wed Jan 6 12:59:20 UTC 2016

Hi Will,

The engine relies on the status reported by VDSM for the management network
'ovirtmgmt' and for its underlying nics/vlans.

In order to see the configuration of 'ovirtmgmt' network please paste the
output of the following command to be executed on the host:
vdsClient -s 0 getVdsCaps

In addition, in order to see the reported status of the networks run and
paste on the host:
vdsClient -s 0 getVdsStats

That should give the indication of which nic is reported as down for
ovirtmgmt by vdsm.

On Wed, Jan 6, 2016 at 11:15 AM, Eliraz Levi <elevi at redhat.com> wrote:

> Hi Will how are you?
> The log is first pointing about certifications issues:
> 2016-01-04 00:02:11,259 ERROR
> [org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer]
> (DefaultQuartzScheduler_Worker-81) [] Failed to get peer certification for
> host 'ovirt-node-02': SSL session is invalid
> 2016-01-04 00:02:11,259 ERROR
> [org.ovirt.engine.core.bll.CertificationValidityChecker]
> (DefaultQuartzScheduler_Worker-81) [] Failed to retrieve peer
> certifications for host 'ovirt-node-02'
>
> So first thing we should do is to try and solve this problem.
> Please try to re install the host.
> Thanks.
> Eliraz :)
>
> ----- Original Message -----
> From: "Will Dennis" <wdennis at nec-labs.com>
> To: "Eliraz Levi" <elevi at redhat.com>, "users" <users at ovirt.org>
> Sent: Tuesday, 5 January, 2016 5:46:23 AM
> Subject: Re: [ovirt-users] host status "Non Operational" - how to diagnose
> & fix?
>
> I must admit I’m getting a bit weary of fighting oVirt problems at this
> point… Before I move on to deploying any VMs onto my new infra, I’d like to
> get the base infra working…
>
> I’m still experiencing a “Non Operational” problem on my “ovirt-node-02”
> host:
>
> http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png.html
>
> I have pored thru the logs (all the engine logs, plus the syslogs from the
> engine VM + and my three hypervisor/storage hosts) and I can’t pin down why
> the one node is having a problem… Of course with how voluminous all these
> logs are, it’s kind of like looking for a needle in a haystack, and I’m not
> even sure what the needle looks like, or if it’s even a needle :-/
>
> I have also rebooted this host in past days, this also did not fix the
> problem.
>
> Note that on the screenshot I posted above, that the webadmin hosts screen
> says that -node-01 has one VM running, and the others 0… You’d think that
> would be the HE VM running on there, but it’s actually on -node-02:
>
> $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine
> --vm-status | grep -e '^Hostname' -e '^Engine'"
> ovirt-node-01 | success | rc=0 >>
> Hostname                           : ovirt-node-01
> Engine status                      : {"reason": "bad vm status", "health":
> "bad", "vm": "down", "detail": "down"}
> Hostname                           : ovirt-node-02
> Engine status                      : {"health": "good", "vm": "up",
> "detail": "up"}
> Hostname                           : ovirt-node-03
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
>
> ovirt-node-02 | success | rc=0 >>
> Hostname                           : ovirt-node-01
> Engine status                      : {"reason": "bad vm status", "health":
> "bad", "vm": "down", "detail": "down"}
> Hostname                           : ovirt-node-02
> Engine status                      : {"health": "good", "vm": "up",
> "detail": "up"}
> Hostname                           : ovirt-node-03
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
>
> ovirt-node-03 | success | rc=0 >>
> Hostname                           : ovirt-node-01
> Engine status                      : {"reason": "bad vm status", "health":
> "bad", "vm": "down", "detail": "down"}
> Hostname                           : ovirt-node-02
> Engine status                      : {"health": "good", "vm": "up",
> "detail": "up"}
> Hostname                           : ovirt-node-03
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown”}
>
> So it looks like the webadmin UI is wrong as well…
>
> It would be awesome if the UI would give a reason for the “Non
> Operational” status somehow… Or if there was a troubleshooter that could be
> used to analyze the problem… As it is, being so new to all of this, I am
> completely at the list’s mercy to figure this out.
>
> This software has such promise, so I’ll keep working thru these issues,
> but it sure hasn’t been a smooth ride so far…
>
>
> On Jan 4, 2016, at 7:54 AM, Will Dennis <wdennis at nec-labs.com<mailto:
> wdennis at nec-labs.com>> wrote:
>
> I put all of the engine logs up there now… Try
> engine.log-20160103.gzhttp://
> i1096.photobucket.com/albums/g330/willdennis/ovirt-node-02_problem.png
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>

-- 
Regards,
Moti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160106/13a8ff63/attachment-0001.html>