Hi,

I have issues with one host where supervdsm is failing in network_caps.

I see the following trace in the log.

MainProcess|jsonrpc/1::ERROR::2020-01-06 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps
    return netswitch.configurator.netcaps(compatibility=30600)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 317, in netcaps
    net_caps = netinfo(compatibility=compatibility)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 325, in netinfo
    _netinfo = netinfo_get(vdsmnets, compatibility)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 150, in get
    return _stringify_mtus(_get(vdsmnets))
  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 59, in _get
    ipaddrs = getIpAddrs()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line 72, in getIpAddrs
    for addr in nl_addr.iter_addrs():
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33, in iter_addrs
    with _nl_addr_cache(sock) as addr_cache:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 92, in _cache_manager
    cache = cache_allocator(sock)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 469, in rtnl_addr_alloc_cache
    raise IOError(-err, nl_geterror(err))
IOError: [Errno 16] Message sequence number mismatch

A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will occur again. So I'm thinking it's resource exhaustion or a leak of some kind?

Running 4.2.8.2 with VDSM at 4.20.46.

I've had a look through the bugzilla and can't find an exact match, closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only fix.

Thanks,

Alan