Re: [ovirt-users] host status "Non Operational" - how to diagnose & fix?

I put all of the engine logs up there now… Try engine.log-20160103.gz
On Jan 4, 2016, at 7:48 AM, Eliraz Levi <elevi@redhat.com> wrote:
ok thanks :) It looks like you didn't press refresh capabilities. I can't learn a lot from this log. can you refresh the host's capabilities and then send the log? thanks :) Eliraz.
----- Original Message ----- From: "Will Dennis" <wdennis@nec-labs.com> To: "Eliraz Levi" <elevi@redhat.com> Sent: Monday, 4 January, 2016 2:22:03 PM Subject: RE: [ovirt-users] host status "Non Operational" - how to diagnose & fix?
If you try it again, should work now... Damn hackers...
Sent with Good (www.good.com)
-----Original Message----- From: Eliraz Levi [elevi@redhat.com<mailto:elevi@redhat.com>] Sent: Monday, January 04, 2016 07:17 AM Eastern Standard Time To: Will Dennis Subject: Re: [ovirt-users] host status "Non Operational" - how to diagnose & fix?
Hi Will :) The link is broken. Can you please send a valid one to the list? thanks :) Eliraz.
----- Original Message ----- From: "Will Dennis" <wdennis@nec-labs.com> To: "Eliraz Levi" <elevi@redhat.com> Sent: Sunday, 3 January, 2016 8:23:59 PM Subject: Re: [ovirt-users] host status "Non Operational" - how to diagnose & fix?
Digital Ocean droplet and Python SimpleHTTPServer FTW ;) http://c7-01.thiscant.fail
On Jan 3, 2016, at 9:33 AM, Eliraz Levi <elevi@redhat.com<mailto:elevi@redhat.com>> wrote:
vdsClient output: http://fpaste.org/306666/82858714/
The engine.log is very large (4496 lines) so cannot fpaste… Is there a file upload service that can be used to share these sorts of things with you?
Hi Will how are you? Perhaps you can upload the log to some sort of a cloud? say google and share the URL? I think it will be the fastest way around. Thanks :) please send the URL in the mailing list so everybody will be able to follow. Cheers! Eliraz :)

I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working… I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host: http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png... I have pored thru the logs (all the engine logs, plus the syslogs from the engine VM + and my three hypervisor/storage hosts) and I can’t pin down why the one node is having a problem… Of course with how voluminous all these logs are, it’s kind of like looking for a needle in a haystack, and I’m not even sure what the needle looks like, or if it’s even a needle :-/ I have also rebooted this host in past days, this also did not fix the problem. Note that on the screenshot I posted above, that the webadmin hosts screen says that -node-01 has one VM running, and the others 0… You’d think that would be the HE VM running on there, but it’s actually on -node-02: $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown”} So it looks like the webadmin UI is wrong as well… It would be awesome if the UI would give a reason for the “Non Operational” status somehow… Or if there was a troubleshooter that could be used to analyze the problem… As it is, being so new to all of this, I am completely at the list’s mercy to figure this out. This software has such promise, so I’ll keep working thru these issues, but it sure hasn’t been a smooth ride so far… On Jan 4, 2016, at 7:54 AM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> wrote: I put all of the engine logs up there now… Try engine.log-20160103.gzhttp://i1096.photobucket.com/albums/g330/willdennis/ovirt-node-02_problem.pn...

On 5-1-2016 4:46, Will Dennis wrote:
I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working…
I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host: http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png...
What you can do to get the logs down to a minimum is to do the following: - put node-02 in maintenance - on the host which is down (node-02) cd /var/log/vdsm; :>vdsm.log (this truncates the log) - activate node-02 - if it goes to non-operational: cp vdsm.log vdsm.log.error This will give you a much smaller log and maybe the error will be more visible. You could do this on the engine too: - cd /var/log/ovirt-engine - :>engine.log - activate - cp engine.log engine.log.error Regards, Joop

Feel like I’m in a bit of an echo chamber here… Is anyone out there? ;) Or have I worn out the oVirt crew? Anyhow, not sure if this is a cause, or an effect, but I noticed tonight that the data storage domain (which I’m using Gluster for in a hyperconverged way) is not mounted on the problem hypervisor host… $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "df -h | grep ':’" ovirt-node-01 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:_vmdata ovirt-node-02 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-03 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:_vmdata What causes this mount to occur, and is there a way to trigger the mount manually? On Jan 4, 2016, at 10:47 PM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> wrote: I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working… I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host: http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png...

On 01/06/2016 07:45 AM, Will Dennis wrote:
Feel like I’m in a bit of an echo chamber here… Is anyone out there? ;) Or have I worn out the oVirt crew?
Anyhow, not sure if this is a cause, or an effect, but I noticed tonight that the data storage domain (which I’m using Gluster for in a hyperconverged way) is not mounted on the problem hypervisor host…
$ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "df -h | grep ':’" ovirt-node-01 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:_vmdata
ovirt-node-02 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine
ovirt-node-03 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com<http://ovirt-node-01.nec-labs.com>:_vmdata
What causes this mount to occur, and is there a way to trigger the mount manually?
Activating the host from maintenance mode should ensure that the storage domain is mounted on the host, AFAIK. The reason why the host is Non-operational is usually in the General sub-tab for the host. Were you able to trim the logs (empty log, and activate host) like Joop suggested?
On Jan 4, 2016, at 10:47 PM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> wrote:
I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working…
I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host: http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png...
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On Jan 6, 2016, at 1:39 AM, Sahina Bose <sabose@redhat.com<mailto:sabose@redhat.com>> wrote: The reason why the host is Non-operational is usually in the General sub-tab for the host. Ah, did not know that… It does say at the bottom of that pane: “Host failed to attach one of the Storage Domains attached to it.” As previously reported to the list last evening, this is true - it has not mounted the data SD (which is a Gluster SD.) Any way to troubleshoot why?

On 01/06/2016 06:42 PM, Will Dennis wrote:
On Jan 6, 2016, at 1:39 AM, Sahina Bose <sabose@redhat.com<mailto:sabose@redhat.com>> wrote:
The reason why the host is Non-operational is usually in the General sub-tab for the host.
Ah, did not know that… It does say at the bottom of that pane:
“Host failed to attach one of the Storage Domains attached to it.”
As previously reported to the list last evening, this is true - it has not mounted the data SD (which is a Gluster SD.)
Any way to troubleshoot why?
The vdsm log from the non-operational host should have some information regarding this. Can you also check if there are errors in the gluster mount logs - at /var/log/glusterfs/rhev-data-center-mnt-glusterSD* Also, worth checking that glusterd ports are open on the gluster hosts (we had an issue where HE install overrides glusterd ports and gluster volume was inaccessible)

I actually had the opposite problem - I (not knowing) elected to say “Yes” (default answer) to “iptables was detected on your computer, do you wish setup to configure it?” which then put in the oVirt iptables rules, which assume the standard Gluster TCP ports… Since I am running hyperconverged and had followed the instructions found at: http://www.ovirt.org/Features/Self_Hosted_Engine_Hyper_Converged_Gluster_Sup... which ends up changing the Gluster ports, then I experienced a fault with Gluster where it lost quorum and went read-only since the firewall on the hosts were blocking Gluster communications... On Jan 6, 2016, at 11:10 AM, Sahina Bose <sabose@redhat.com> wrote: Also, worth checking that glusterd ports are open on the gluster hosts (we had an issue where HE install overrides glusterd ports and gluster volume was inaccessible)

Hi Will how are you? The log is first pointing about certifications issues: 2016-01-04 00:02:11,259 ERROR [org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer] (DefaultQuartzScheduler_Worker-81) [] Failed to get peer certification for host 'ovirt-node-02': SSL session is invalid 2016-01-04 00:02:11,259 ERROR [org.ovirt.engine.core.bll.CertificationValidityChecker] (DefaultQuartzScheduler_Worker-81) [] Failed to retrieve peer certifications for host 'ovirt-node-02' So first thing we should do is to try and solve this problem. Please try to re install the host. Thanks. Eliraz :) ----- Original Message ----- From: "Will Dennis" <wdennis@nec-labs.com> To: "Eliraz Levi" <elevi@redhat.com>, "users" <users@ovirt.org> Sent: Tuesday, 5 January, 2016 5:46:23 AM Subject: Re: [ovirt-users] host status "Non Operational" - how to diagnose & fix? I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working… I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host: http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png... I have pored thru the logs (all the engine logs, plus the syslogs from the engine VM + and my three hypervisor/storage hosts) and I can’t pin down why the one node is having a problem… Of course with how voluminous all these logs are, it’s kind of like looking for a needle in a haystack, and I’m not even sure what the needle looks like, or if it’s even a needle :-/ I have also rebooted this host in past days, this also did not fix the problem. Note that on the screenshot I posted above, that the webadmin hosts screen says that -node-01 has one VM running, and the others 0… You’d think that would be the HE VM running on there, but it’s actually on -node-02: $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown”} So it looks like the webadmin UI is wrong as well… It would be awesome if the UI would give a reason for the “Non Operational” status somehow… Or if there was a troubleshooter that could be used to analyze the problem… As it is, being so new to all of this, I am completely at the list’s mercy to figure this out. This software has such promise, so I’ll keep working thru these issues, but it sure hasn’t been a smooth ride so far… On Jan 4, 2016, at 7:54 AM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> wrote: I put all of the engine logs up there now… Try engine.log-20160103.gzhttp://i1096.photobucket.com/albums/g330/willdennis/ovirt-node-02_problem.pn...

Define “reinstall the host” - do you just mean 'yum remove ovirt* vdsm*’ then ‘yum install ovirt* vdsm*’, or completely reinstall the OS, reset-up Gluster, etc.? On Jan 6, 2016, at 4:15 AM, Eliraz Levi <elevi@redhat.com<mailto:elevi@redhat.com>> wrote: Hi Will how are you? The log is first pointing about certifications issues: 2016-01-04 00:02:11,259 ERROR [org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer] (DefaultQuartzScheduler_Worker-81) [] Failed to get peer certification for host 'ovirt-node-02': SSL session is invalid 2016-01-04 00:02:11,259 ERROR [org.ovirt.engine.core.bll.CertificationValidityChecker] (DefaultQuartzScheduler_Worker-81) [] Failed to retrieve peer certifications for host 'ovirt-node-02' So first thing we should do is to try and solve this problem. Please try to re install the host. Thanks. Eliraz :) ----- Original Message ----- From: "Will Dennis" <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> To: "Eliraz Levi" <elevi@redhat.com<mailto:elevi@redhat.com>>, "users" <users@ovirt.org<mailto:users@ovirt.org>> Sent: Tuesday, 5 January, 2016 5:46:23 AM Subject: Re: [ovirt-users] host status "Non Operational" - how to diagnose & fix? I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working… I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host: http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png... I have pored thru the logs (all the engine logs, plus the syslogs from the engine VM + and my three hypervisor/storage hosts) and I can’t pin down why the one node is having a problem… Of course with how voluminous all these logs are, it’s kind of like looking for a needle in a haystack, and I’m not even sure what the needle looks like, or if it’s even a needle :-/ I have also rebooted this host in past days, this also did not fix the problem. Note that on the screenshot I posted above, that the webadmin hosts screen says that -node-01 has one VM running, and the others 0… You’d think that would be the HE VM running on there, but it’s actually on -node-02: $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown”} So it looks like the webadmin UI is wrong as well… It would be awesome if the UI would give a reason for the “Non Operational” status somehow… Or if there was a troubleshooter that could be used to analyze the problem… As it is, being so new to all of this, I am completely at the list’s mercy to figure this out. This software has such promise, so I’ll keep working thru these issues, but it sure hasn’t been a smooth ride so far… On Jan 4, 2016, at 7:54 AM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com><mailto:wdennis@nec-labs.com>> wrote: I put all of the engine logs up there now… Try engine.log-20160103.gzhttp://i1096.photobucket.com/albums/g330/willdennis/ovirt-node-02_problem.pn...

Hi Will, The engine relies on the status reported by VDSM for the management network 'ovirtmgmt' and for its underlying nics/vlans. In order to see the configuration of 'ovirtmgmt' network please paste the output of the following command to be executed on the host: vdsClient -s 0 getVdsCaps In addition, in order to see the reported status of the networks run and paste on the host: vdsClient -s 0 getVdsStats That should give the indication of which nic is reported as down for ovirtmgmt by vdsm. On Wed, Jan 6, 2016 at 11:15 AM, Eliraz Levi <elevi@redhat.com> wrote:
Hi Will how are you? The log is first pointing about certifications issues: 2016-01-04 00:02:11,259 ERROR [org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer] (DefaultQuartzScheduler_Worker-81) [] Failed to get peer certification for host 'ovirt-node-02': SSL session is invalid 2016-01-04 00:02:11,259 ERROR [org.ovirt.engine.core.bll.CertificationValidityChecker] (DefaultQuartzScheduler_Worker-81) [] Failed to retrieve peer certifications for host 'ovirt-node-02'
So first thing we should do is to try and solve this problem. Please try to re install the host. Thanks. Eliraz :)
----- Original Message ----- From: "Will Dennis" <wdennis@nec-labs.com> To: "Eliraz Levi" <elevi@redhat.com>, "users" <users@ovirt.org> Sent: Tuesday, 5 January, 2016 5:46:23 AM Subject: Re: [ovirt-users] host status "Non Operational" - how to diagnose & fix?
I must admit I’m getting a bit weary of fighting oVirt problems at this point… Before I move on to deploying any VMs onto my new infra, I’d like to get the base infra working…
I’m still experiencing a “Non Operational” problem on my “ovirt-node-02” host:
http://s1096.photobucket.com/user/willdennis/media/ovirt-node-02_problem.png...
I have pored thru the logs (all the engine logs, plus the syslogs from the engine VM + and my three hypervisor/storage hosts) and I can’t pin down why the one node is having a problem… Of course with how voluminous all these logs are, it’s kind of like looking for a needle in a haystack, and I’m not even sure what the needle looks like, or if it’s even a needle :-/
I have also rebooted this host in past days, this also did not fix the problem.
Note that on the screenshot I posted above, that the webadmin hosts screen says that -node-01 has one VM running, and the others 0… You’d think that would be the HE VM running on there, but it’s actually on -node-02:
$ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown”}
So it looks like the webadmin UI is wrong as well…
It would be awesome if the UI would give a reason for the “Non Operational” status somehow… Or if there was a troubleshooter that could be used to analyze the problem… As it is, being so new to all of this, I am completely at the list’s mercy to figure this out.
This software has such promise, so I’ll keep working thru these issues, but it sure hasn’t been a smooth ride so far…
On Jan 4, 2016, at 7:54 AM, Will Dennis <wdennis@nec-labs.com<mailto: wdennis@nec-labs.com>> wrote:
I put all of the engine logs up there now… Try engine.log-20160103.gzhttp:// i1096.photobucket.com/albums/g330/willdennis/ovirt-node-02_problem.png _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Regards, Moti

On Jan 6, 2016, at 7:59 AM, Moti Asayag <masayag@redhat.com<mailto:masayag@redhat.com>> wrote: In order to see the configuration of 'ovirtmgmt' network please paste the output of the following command to be executed on the host: vdsClient -s 0 getVdsCaps http://fpaste.org/307742/20853451/ In addition, in order to see the reported status of the networks run and paste on the host: vdsClient -s 0 getVdsStats http://fpaste.org/307744/45208555/

I did what Joop suggested (put -node-02 into maint, clear vdsm.log on -node-02, clear engine.log on HE, then activate -node-02) and — what do you know, the node came up into an operational state! It was able to successfully mount the data SD this time: $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "df -h | grep ':'"ovirt-node-01 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com:_vmdata ovirt-node-02 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com:_vmdata ovirt-node-03 | success | rc=0 >> localhost:/engine 1.9T 3.0G 1.9T 1% /rhev/data-center/mnt/glusterSD/localhost:_engine ovirt-node-01.nec-labs.com:/vmdata 3.7T 70M 3.7T 1% /rhev/data-center/mnt/glusterSD/ovirt-node-01.nec-labs.com:_vmdata Wonder what the magic was? ;) I’ll take the result anyways :)

To follow up on this, after the migrations as a result of the troubleshooting, the webadmin UI of the hosts in my datacenter now has each host with “1” VM running… https://drive.google.com/file/d/0B88nnCy4LpFMYklDVDhFUV96Y00/view?usp=sharin... However, The only VM that is running currently is the hosted engine, which is currently running on host “ovirt-node-03” — $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-03 Engine status : {"health": "good", "vm": "up", "detail": "up"} ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-03 Engine status : {"health": "good", "vm": "up", "detail": "up"} ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-03 Engine status : {"health": "good", "vm": "up", "detail": "up”} Is this a UI bug of some sort? On Jan 4, 2016, at 10:47 PM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> wrote: Note that on the screenshot I posted above, that the webadmin hosts screen says that -node-01 has one VM running, and the others 0… You’d think that would be the HE VM running on there, but it’s actually on -node-02: $ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown”} So it looks like the webadmin UI is wrong as well…

On 06 Jan 2016, at 15:31, Will Dennis <wdennis@nec-labs.com> wrote:
To follow up on this, after the migrations as a result of the troubleshooting, the webadmin UI of the hosts in my datacenter now has each host with “1” VM running… https://drive.google.com/file/d/0B88nnCy4LpFMYklDVDhFUV96Y00/view?usp=sharin...
However, The only VM that is running currently is the hosted engine, which is currently running on host “ovirt-node-03” —
$ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-03 Engine status : {"health": "good", "vm": "up", "detail": "up"}
ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-03 Engine status : {"health": "good", "vm": "up", "detail": "up"}
ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-03 Engine status : {"health": "good", "vm": "up", "detail": "up”}
Is this a UI bug of some sort?
Might be, but I would doubt it. It merely reflects what hosts are reporting are there other VMs? migrations going on?
On Jan 4, 2016, at 10:47 PM, Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> wrote:
Note that on the screenshot I posted above, that the webadmin hosts screen says that -node-01 has one VM running, and the others 0… You’d think that would be the HE VM running on there, but it’s actually on -node-02:
$ ansible istgroup-ovirt -f 1 -i prod -u root -m shell -a "hosted-engine --vm-status | grep -e '^Hostname' -e '^Engine'" ovirt-node-01 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
ovirt-node-02 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
ovirt-node-03 | success | rc=0 >> Hostname : ovirt-node-01 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Hostname : ovirt-node-02 Engine status : {"health": "good", "vm": "up", "detail": "up"} Hostname : ovirt-node-03 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown”}
So it looks like the webadmin UI is wrong as well…
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

No, it was definitely wrong at the time - there were no other VMs other than the hosted engine, and no migrations… However, when I was able to finally re-import/activate the hosted_storage SD, and the HE VM correctly showed up in the UI, then it corrected itself, and has remained correct since… As shown in my post, it also did not agree at the time with the output of “hosted-engine —vm-status”, but since the re-import of the SD, it now is correctly in sync with that as well.
On Jan 7, 2016, at 4:42 AM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
Might be, but I would doubt it. It merely reflects what hosts are reporting are there other VMs? migrations going on?

On 07 Jan 2016, at 18:32, Will Dennis <wdennis@nec-labs.com> wrote:
No, it was definitely wrong at the time - there were no other VMs other than the hosted engine, and no migrations… However, when I was able to finally re-import/activate the hosted_storage SD, and the HE VM correctly showed up in the UI, then it corrected itself, and has remained correct since… As shown in my post, it also did not agree at the time with the output of “hosted-engine —vm-status”, but since the re-import of the SD, it now is correctly in sync with that as well.
Those reported values do rely on status of the host, which in turn is derived from its storage health. So it might be “wrong” when hosts are not happy. You would need to correlate it with other events, though…there are various timings in play, so it’s a bit tricky to know for sure until/unless it is stuck in a wrong way for a long time Thanks, mcihal
On Jan 7, 2016, at 4:42 AM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
Might be, but I would doubt it. It merely reflects what hosts are reporting are there other VMs? migrations going on?
participants (6)
-
Eliraz Levi
-
Joop
-
Michal Skrivanek
-
Moti Asayag
-
Sahina Bose
-
Will Dennis