Hi,
I sent this a while back and never got a response. We've since upgrade to 4.3 and the
issue persists.
2021-03-24 10:53:48,934+0000 ERROR (periodic/2) [virt.periodic.Operation]
<vdsm.virt.sampling.HostMonitor object at 0x7f5964398350> operation failed
(periodic:188)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in
__call__
self._func()
File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in
__call__
stats = hostapi.get_stats(self._cif, self._samples.stats())
File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in
get_stats
decStats = stats.produce(first_sample, last_sample)
File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in
produce
stats.update(get_interfaces_stats())
File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in
get_interfaces_stats
return net_api.network_stats()
File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in
network_stats
return netstats.report()
File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in
report
stats = link_stats.report()
File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34,
in report
for iface_properties in iface.list():
File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257,
in list
for properties in itertools.chain(link.iter_links(), dpdk_links):
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 47,
in iter_links
with _nl_link_cache(sock) as cache:
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line
108, in _cache_manager
cache = cache_allocator(sock)
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line
157, in _rtnl_link_alloc_cache
return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC)
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line
578, in rtnl_link_alloc_cache
raise IOError(-err, nl_geterror(err))
IOError: [Errno 16] Message sequence number mismatch
This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will sort it for a
while, but within 24 hours it occurs again. We run a number of clusters and it only occurs
on one so must be some specific corner case we're triggering.
I can find almost no information on this. The best I could find was this
https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-li... which
details a sequence number issue. I'm guessing I'm experiencing the same issue in
that the nl sequence numbers are getting out of sync and closing/re-opening the nl socket
(aka restart vdsm) is the only way to resolve.
I've completely hit a brick wall with it. We've had to disable fencing on both
nodes as sometimes they get erroneously fenced when vdsm stops function correctly. At this
point I'm thinking about replaced the severs with different models in-case it's
something in the NIC drivers...
Alan
---- On Mon, 06 Jan 2020 10:54:52 +0000 Alan G <mailto:alan+ovirt@griff.me.uk> wrote
----
Hi,
I have issues with one host where supervdsm is failing in network_caps.
I see the following trace in the log.
MainProcess|jsonrpc/1::ERROR::2020-01-06
03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in
network_caps
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in
wrapper
res = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in
network_caps
return netswitch.configurator.netcaps(compatibility=30600)
File
"/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line
317, in netcaps
net_caps = netinfo(compatibility=compatibility)
File
"/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line
325, in netinfo
_netinfo = netinfo_get(vdsmnets, compatibility)
File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line
150, in get
return _stringify_mtus(_get(vdsmnets))
File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line
59, in _get
ipaddrs = getIpAddrs()
File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py",
line 72, in getIpAddrs
for addr in nl_addr.iter_addrs():
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33,
in iter_addrs
with _nl_addr_cache(sock) as addr_cache:
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line
92, in _cache_manager
cache = cache_allocator(sock)
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line
469, in rtnl_addr_alloc_cache
raise IOError(-err, nl_geterror(err))
IOError: [Errno 16] Message sequence number mismatch
A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will
occur again. So I'm thinking it's resource exhaustion or a leak of some kind?
Running 4.2.8.2 with VDSM at 4.20.46.
I've had a look through the bugzilla and can't find an exact match, closest was
this one
https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only
fix.
Thanks,
Alan
_______________________________________________
Users mailing list -- mailto:users@ovirt.org
To unsubscribe send an email to mailto:users-leave@ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/4YGTPGGNZJ3...