On Thu, Aug 6, 2015 at 3:24 PM, Nicolas Ecarnot <nicolas@ecarnot.net> wrote:
Hi Vered,

Thanks for answering.

Le 06/08/2015 11:08, Vered Volansky a écrit :

But from times to times, there seem to appear a severe hicup which I
have great difficulties to diagnose.
The messages in the web gui are not very precise, and not consistent:
- some tell about some host having network issues, but I can ping it
from every place it needs to be reached (especially from the SPM and the
manager)
Ping doesn't say much as the ssh protocol is the one being used.
Please try this and report.

Try what?
ssh instead of ping.

Please attach logs (engine+vdsm). Log snippets would be helpful (but more important are full logs).

I guess that what will be most useful is to provide logs at or around the precise moment s**t is hitting the fan.
But this is very difficult to forecast :
There are times I'm trying hard to break it (see dumb user tests previously described) and oVirt is doing well at coping with these situations.
And at the opposite, there are times where even zero VM is running, and I see the DC appearing as non operational for some minutes.
So I'll send logs the next time I see such situation.
You can send logs and just point us to the time your problems occurred. They are rotated, so unless you removed them they should be available to you at any time. Just make sure they have the time in question and we'll dig in.


In general it smells like an ssh/firewall issue.

On this test setup, I disabled the firewall on my hosts.
And you're right, it appears I forgot to disable it on one of the three hosts.
On the one I forgot, a brief look at the iptables rules seemed like very conform with what I'm use to see as managed by oVirt, nothing weird. Anyway, it is now completely disabled. 
Good :)


"On host serv-vm-al01, Error: Network error during communication with
the Host"

This host had no firewall activated...

--
Nicolas ECARNOT