supervdsm failing during network_caps

Hi, I have issues with one host where supervdsm is failing in network_caps. I see the following trace in the log. MainProcess|jsonrpc/1::ERROR::2020-01-06 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 317, in netcaps net_caps = netinfo(compatibility=compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 325, in netinfo _netinfo = netinfo_get(vdsmnets, compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 150, in get return _stringify_mtus(_get(vdsmnets)) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 59, in _get ipaddrs = getIpAddrs() File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line 72, in getIpAddrs for addr in nl_addr.iter_addrs(): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33, in iter_addrs with _nl_addr_cache(sock) as addr_cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 92, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 469, in rtnl_addr_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will occur again. So I'm thinking it's resource exhaustion or a leak of some kind? Running 4.2.8.2 with VDSM at 4.20.46. I've had a look through the bugzilla and can't find an exact match, closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only fix. Thanks, Alan

Hi, I sent this a while back and never got a response. We've since upgrade to 4.3 and the issue persists. 2021-03-24 10:53:48,934+0000 ERROR (periodic/2) [virt.periodic.Operation] <vdsm.virt.sampling.HostMonitor object at 0x7f5964398350> operation failed (periodic:188) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in get_stats decStats = stats.produce(first_sample, last_sample) File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in produce stats.update(get_interfaces_stats()) File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in get_interfaces_stats return net_api.network_stats() File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in network_stats return netstats.report() File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in report stats = link_stats.report() File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34, in report for iface_properties in iface.list(): File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257, in list for properties in itertools.chain(link.iter_links(), dpdk_links): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 47, in iter_links with _nl_link_cache(sock) as cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 108, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 157, in _rtnl_link_alloc_cache return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 578, in rtnl_link_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will sort it for a while, but within 24 hours it occurs again. We run a number of clusters and it only occurs on one so must be some specific corner case we're triggering. I can find almost no information on this. The best I could find was this https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl... which details a sequence number issue. I'm guessing I'm experiencing the same issue in that the nl sequence numbers are getting out of sync and closing/re-opening the nl socket (aka restart vdsm) is the only way to resolve. I've completely hit a brick wall with it. We've had to disable fencing on both nodes as sometimes they get erroneously fenced when vdsm stops function correctly. At this point I'm thinking about replaced the severs with different models in-case it's something in the NIC drivers... Alan ---- On Mon, 06 Jan 2020 10:54:52 +0000 Alan G <mailto:alan+ovirt@griff.me.uk> wrote ---- Hi, I have issues with one host where supervdsm is failing in network_caps. I see the following trace in the log. MainProcess|jsonrpc/1::ERROR::2020-01-06 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 317, in netcaps net_caps = netinfo(compatibility=compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 325, in netinfo _netinfo = netinfo_get(vdsmnets, compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 150, in get return _stringify_mtus(_get(vdsmnets)) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 59, in _get ipaddrs = getIpAddrs() File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line 72, in getIpAddrs for addr in nl_addr.iter_addrs(): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33, in iter_addrs with _nl_addr_cache(sock) as addr_cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 92, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 469, in rtnl_addr_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will occur again. So I'm thinking it's resource exhaustion or a leak of some kind? Running 4.2.8.2 with VDSM at 4.20.46. I've had a look through the bugzilla and can't find an exact match, closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only fix. Thanks, Alan _______________________________________________ Users mailing list -- mailto:users@ovirt.org To unsubscribe send an email to mailto:users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4YGTPGGNZJ3JT4...

Looking back in the logs, in fact the first error we get is Out of memory. So it seems we're hitting https://bugzilla.redhat.com/show_bug.cgi?id=1623851 It's not clear from the ticket. Is there an explicit fix for this is 4.4, or the problem just kind of went away? ---- On Wed, 24 Mar 2021 11:18:57 +0000 Alan G <alan+ovirt@griff.me.uk> wrote ---- Hi, I sent this a while back and never got a response. We've since upgrade to 4.3 and the issue persists. 2021-03-24 10:53:48,934+0000 ERROR (periodic/2) [virt.periodic.Operation] <vdsm.virt.sampling.HostMonitor object at 0x7f5964398350> operation failed (periodic:188) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in get_stats decStats = stats.produce(first_sample, last_sample) File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in produce stats.update(get_interfaces_stats()) File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in get_interfaces_stats return net_api.network_stats() File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in network_stats return netstats.report() File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in report stats = link_stats.report() File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34, in report for iface_properties in iface.list(): File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257, in list for properties in itertools.chain(link.iter_links(), dpdk_links): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 47, in iter_links with _nl_link_cache(sock) as cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 108, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 157, in _rtnl_link_alloc_cache return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 578, in rtnl_link_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will sort it for a while, but within 24 hours it occurs again. We run a number of clusters and it only occurs on one so must be some specific corner case we're triggering. I can find almost no information on this. The best I could find was this https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl... which details a sequence number issue. I'm guessing I'm experiencing the same issue in that the nl sequence numbers are getting out of sync and closing/re-opening the nl socket (aka restart vdsm) is the only way to resolve. I've completely hit a brick wall with it. We've had to disable fencing on both nodes as sometimes they get erroneously fenced when vdsm stops function correctly. At this point I'm thinking about replaced the severs with different models in-case it's something in the NIC drivers... Alan ---- On Mon, 06 Jan 2020 10:54:52 +0000 Alan G <mailto:alan+ovirt@griff.me.uk> wrote ---- Hi, I have issues with one host where supervdsm is failing in network_caps. I see the following trace in the log. MainProcess|jsonrpc/1::ERROR::2020-01-06 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 317, in netcaps net_caps = netinfo(compatibility=compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 325, in netinfo _netinfo = netinfo_get(vdsmnets, compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 150, in get return _stringify_mtus(_get(vdsmnets)) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 59, in _get ipaddrs = getIpAddrs() File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line 72, in getIpAddrs for addr in nl_addr.iter_addrs(): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33, in iter_addrs with _nl_addr_cache(sock) as addr_cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 92, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 469, in rtnl_addr_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will occur again. So I'm thinking it's resource exhaustion or a leak of some kind? Running 4.2.8.2 with VDSM at 4.20.46. I've had a look through the bugzilla and can't find an exact match, closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only fix. Thanks, Alan _______________________________________________ Users mailing list -- mailto:users@ovirt.org To unsubscribe send an email to mailto:users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4YGTPGGNZJ3JT4... _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIVD3XUU7JV4XA...

On Wed, Mar 24, 2021 at 1:24 PM Alan G <alan+ovirt@griff.me.uk> wrote:
Looking back in the logs, in fact the first error we get is Out of memory. So it seems we're hitting https://bugzilla.redhat.com/show_bug.cgi?id=1623851
It's not clear from the ticket. Is there an explicit fix for this is 4.4, or the problem just kind of went away?
If it is the described issue, the problem seems to go away in 4.4. The reason might be a newer kernel and libnl3.
---- On Wed, 24 Mar 2021 11:18:57 +0000 *Alan G <alan+ovirt@griff.me.uk <alan%2Bovirt@griff.me.uk>>* wrote ----
Hi,
I sent this a while back and never got a response. We've since upgrade to 4.3 and the issue persists.
2021-03-24 10:53:48,934+0000 ERROR (periodic/2) [virt.periodic.Operation] <vdsm.virt.sampling.HostMonitor object at 0x7f5964398350> operation failed (periodic:188) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in get_stats decStats = stats.produce(first_sample, last_sample) File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in produce stats.update(get_interfaces_stats()) File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in get_interfaces_stats return net_api.network_stats() File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in network_stats return netstats.report() File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in report stats = link_stats.report() File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34, in report for iface_properties in iface.list(): File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257, in list for properties in itertools.chain(link.iter_links(), dpdk_links): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 47, in iter_links with _nl_link_cache(sock) as cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 108, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 157, in _rtnl_link_alloc_cache return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 578, in rtnl_link_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch
This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will sort it for a while, but within 24 hours it occurs again. We run a number of clusters and it only occurs on one so must be some specific corner case we're triggering.
I can find almost no information on this. The best I could find was this https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl... which details a sequence number issue. I'm guessing I'm experiencing the same issue in that the nl sequence numbers are getting out of sync and closing/re-opening the nl socket (aka restart vdsm) is the only way to resolve.
I've completely hit a brick wall with it. We've had to disable fencing on both nodes as sometimes they get erroneously fenced when vdsm stops function correctly. At this point I'm thinking about replaced the severs with different models in-case it's something in the NIC drivers...
Alan
---- On Mon, 06 Jan 2020 10:54:52 +0000 *Alan G <alan+ovirt@griff.me.uk <alan+ovirt@griff.me.uk>>* wrote ----
Hi,
I have issues with one host where supervdsm is failing in network_caps.
I see the following trace in the log.
MainProcess|jsonrpc/1::ERROR::2020-01-06 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 317, in netcaps net_caps = netinfo(compatibility=compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 325, in netinfo _netinfo = netinfo_get(vdsmnets, compatibility) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 150, in get return _stringify_mtus(_get(vdsmnets)) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 59, in _get ipaddrs = getIpAddrs() File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line 72, in getIpAddrs for addr in nl_addr.iter_addrs(): File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33, in iter_addrs with _nl_addr_cache(sock) as addr_cache: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 92, in _cache_manager cache = cache_allocator(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 469, in rtnl_addr_alloc_cache raise IOError(-err, nl_geterror(err)) IOError: [Errno 16] Message sequence number mismatch
A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will occur again. So I'm thinking it's resource exhaustion or a leak of some kind?
Running 4.2.8.2 with VDSM at 4.20.46.
I've had a look through the bugzilla and can't find an exact match, closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only fix.
Thanks,
Alan
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4YGTPGGNZJ3JT4...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIVD3XUU7JV4XA...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4525CHJL7E7AXX...
-- Ales Musil Software Engineer - RHV Network Red Hat EMEA <https://www.redhat.com> amusil@redhat.com IM: amusil <https://red.ht/sig>
participants (2)
-
Alan G
-
Ales Musil