On 22-03-2019 12:04, Arif Ali wrote:

On 21-03-2019 17:47, Simone Tiraboschi wrote:

 

On Thu, Mar 21, 2019 at 3:47 PM Arif Ali <mail@arif-ali.co.uk> wrote:
Hi all,

Recently deployed oVirt version 4.3.1

It's in a self-hosted engine environment

Used the steps via cockpit to install the engine, and was able to add
the rest of the oVirt nodes without any specific problems

We tested the HA of the hosted-engine without a problem, and then at one
point of turn off the machine that was hosting the engine, to mimic
failure to see how it goes; the vm was able to move over successfully,
but some of the oVirt started to go into Unassigned. From a total of 6
oVirt hosts, I have 4 of them in this state.

Clicking on the host, I see the following message in the events. I can
get to the hosts via the engine, and ping the machine, so not sure what
it's doing that it's no longer working

VDSM <snip> command Get Host Capabilities failed: Message timeout which
can be caused by communication issues

Mind you, I have been trying to resolve this issue since Monday, and
have tried various things, like rebooting and re-installing the oVirt
hosts, without having much luck

So any assistance on this would be grateful, maybe I've missed something
really simple, and I am overlooking it
 
Can you please check that VDSM is correctly running on that nodes?
Are you able to correctly reach that nodes from the engine VM? 
 

So, I have gone back, and re-installed the whole solution again with the 4.3.2 now, and I again have the same issue

Checking the vdsm logs, I get the issue below in the logs. The host is either Unassigned or Connecting. I don't have the option to Activate or put the host into Maintenance mode. I have tried rebooting the node with no luck

Mar 22 10:53:27 scvirt02 vdsm[32481]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7efed4180610> at 0x7efed4180650> timeout=15, duration=30.00 at 0x7efed4180810> task#=2 at 0x7efef41987d0>, traceback:

                                      File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap

                                        self.__bootstrap_inner()

                                      File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner

                                        self.run()

                                      File: "/usr/lib64/python2.7/threading.py", line 765, in run

                                        self.__target(*self.__args, **self.__kwargs)

                                      File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run

                                        ret = func(*args, **kwargs)

                                      File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run

                                        self._execute_task()

                                      File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task

                                        task()

                                      File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__

                                        self._callable()

                                      File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__

                                        self._func()

                                      File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__

                                        stats = hostapi.get_stats(self._cif, self._samples.stats())

                                      File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 79, in get_stats

                                        ret['haStats'] = _getHaInfo()

                                      File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 177, in _getHaInfo

                                        stats = instance.get_all_stats()

                                      File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats

                                        stats = broker.get_stats_from_storage()

                                      File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 143, in get_stats_from_storage

                                        result = self._proxy.get_stats()

                                      File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__

                                        return self.__send(self.__name, args)

                                      File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request

                                        verbose=self.__verbose

                                      File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request

                                        return self.single_request(host, handler, request_body, verbose)

                                      File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request

                                        response = h.getresponse(buffering=True)

                                      File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse

                                        response.begin()

                                      File: "/usr/lib64/python2.7/httplib.py", line 444, in begin

                                        version, status, reason = self._read_status()

                                      File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status

                                        line = self.fp.readline(_MAXLINE + 1)

                                      File: "/usr/lib64/python2.7/socket.py", line 476, in readline

                                        data = self._sock.recv(self._rbufsize)

On the engine host, I continuously get the following messages too
 

Mar 22 11:02:32 <snip> ovsdb-server[4724]: ovs|01900|jsonrpc|WARN|Dropped 3 log messages in last 14 seconds (most recently, 7 seconds ago) due to excessive rate

Mar 22 11:02:32 <snip> ovsdb-server[4724]: ovs|01901|jsonrpc|WARN|ssl:[::ffff:192.168.203.205]:55658: send error: Protocol error

Mar 22 11:02:32 <snip> ovsdb-server[4724]: ovs|01902|reconnect|WARN|ssl:[::ffff:192.168.203.205]:55658: connection dropped (Protocol error)

Mar 22 11:02:34 <snip> ovsdb-server[4724]: ovs|01903|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:02:34 <snip> ovsdb-server[4724]: ovs|01904|reconnect|WARN|ssl:[::ffff:192.168.203.202]:49504: connection dropped (Protocol error)

Mar 22 11:02:40 <snip> ovsdb-server[4724]: ovs|01905|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:02:40 <snip> ovsdb-server[4724]: ovs|01906|jsonrpc|WARN|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate

Mar 22 11:02:40 <snip> ovsdb-server[4724]: ovs|01907|jsonrpc|WARN|ssl:[::ffff:192.168.203.203]:34114: send error: Protocol error

Mar 22 11:02:40 <snip> ovsdb-server[4724]: ovs|01908|reconnect|WARN|ssl:[::ffff:192.168.203.203]:34114: connection dropped (Protocol error)

Mar 22 11:02:41 <snip> ovsdb-server[4724]: ovs|01909|reconnect|WARN|ssl:[::ffff:192.168.203.204]:52034: connection dropped (Protocol error)

Mar 22 11:02:48 <snip> ovsdb-server[4724]: ovs|01910|stream_ssl|WARN|Dropped 1 log messages in last 7 seconds (most recently, 7 seconds ago) due to excessive rate

Mar 22 11:02:48 <snip> ovsdb-server[4724]: ovs|01911|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:02:48 <snip> ovsdb-server[4724]: ovs|01912|reconnect|WARN|ssl:[::ffff:192.168.203.205]:55660: connection dropped (Protocol error)

Mar 22 11:02:50 <snip> ovsdb-server[4724]: ovs|01913|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:02:50 <snip> ovsdb-server[4724]: ovs|01914|jsonrpc|WARN|Dropped 2 log messages in last 9 seconds (most recently, 2 seconds ago) due to excessive rate

Mar 22 11:02:50 <snip> ovsdb-server[4724]: ovs|01915|jsonrpc|WARN|ssl:[::ffff:192.168.203.202]:49506: send error: Protocol error

Mar 22 11:02:50 <snip> ovsdb-server[4724]: ovs|01916|reconnect|WARN|ssl:[::ffff:192.168.203.202]:49506: connection dropped (Protocol error)

Mar 22 11:02:56 <snip> ovsdb-server[4724]: ovs|01917|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:02:56 <snip> ovsdb-server[4724]: ovs|01918|reconnect|WARN|ssl:[::ffff:192.168.203.203]:34116: connection dropped (Protocol error)

Mar 22 11:02:57 <snip> ovsdb-server[4724]: ovs|01919|reconnect|WARN|ssl:[::ffff:192.168.203.204]:52036: connection dropped (Protocol error)

Mar 22 11:03:04 <snip> ovsdb-server[4724]: ovs|01920|stream_ssl|WARN|Dropped 1 log messages in last 7 seconds (most recently, 7 seconds ago) due to excessive rate

Mar 22 11:03:04 <snip> ovsdb-server[4724]: ovs|01921|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:03:04 <snip> ovsdb-server[4724]: ovs|01922|jsonrpc|WARN|Dropped 2 log messages in last 9 seconds (most recently, 7 seconds ago) due to excessive rate

Mar 22 11:03:04 <snip> ovsdb-server[4724]: ovs|01923|jsonrpc|WARN|ssl:[::ffff:192.168.203.205]:55662: send error: Protocol error

Mar 22 11:03:04 <snip> ovsdb-server[4724]: ovs|01924|reconnect|WARN|ssl:[::ffff:192.168.203.205]:55662: connection dropped (Protocol error)

Mar 22 11:03:06 <snip> ovsdb-server[4724]: ovs|01925|reconnect|WARN|ssl:[::ffff:192.168.203.202]:49508: connection dropped (Protocol error)

Mar 22 11:03:12 <snip> ovsdb-server[4724]: ovs|01926|stream_ssl|WARN|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate

Mar 22 11:03:12 <snip> ovsdb-server[4724]: ovs|01927|stream_ssl|WARN|SSL_accept: unexpected SSL connection close

Mar 22 11:03:12 <snip> ovsdb-server[4724]: ovs|01928|reconnect|WARN|ssl:[::ffff:192.168.203.203]:34118: connection dropped (Protocol error)

Mar 22 11:03:13 <snip> ovsdb-server[4724]: ovs|01929|reconnect|WARN|ssl:[::ffff:192.168.203.204]:52038: connection dropped (Protocol error)

 

I found my issue, and managed to resolve it nothing wrong with oVirt

The ovirtmgmt network is 10G, I by default set the MTU to 9000 as I would normally for these type of the network, but found out later that the network team at this site were not supporting 9000, so back to 1500 and all worked without a problem

Thanks to all for everyone's assistance

--
regards,

Arif Ali