VM rebooted during OST test_hotplug_cpu

Hi all, I am trying to verify fixes for ovirt-engine-rename, specifically for OVN. Engine top patch is [1], OST patch [2]. Ran the manual job on these [3]. In previous patches, OST failed in earlier tests. Now, it passed these tests, so I hope that my patches are enough for what I am trying to do. However, [3] did fail later, during test_hotplug_cpu - it set the number of CPUs, then tried to connect to the VM, and timed out. The logs imply that right after it changed the number of CPUs, the VM was rebooted, apparently by libvirtd. Relevant log snippets: vdsm [4]: 2020-06-29 10:21:10,889-0400 DEBUG (jsonrpc/1) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') Setting number of cpus to : 2 (vm:3089) 2020-06-29 10:21:10,952-0400 INFO (jsonrpc/1) [api.virt] FINISH setNumberOfCpus return={'status': {'code': 0, 'message': 'Done'}, 'vmList': {}} from=::ffff:192.168.201.4,54576, flow_id=7f9503ed, vmId=7474280d-4501-4355-9425-63898757682b (api:54) 2020-06-29 10:21:11,111-0400 DEBUG (periodic/0) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 2925.602824355 elapsed 0.160 acquired True domains all (sampling:451) 2020-06-29 10:21:11,430-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'VM.setNumberOfCpus' in bridge with {} (__init__:356) 2020-06-29 10:21:11,432-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.setNumberOfCpus succeeded in 0.56 seconds (__init__:312) 2020-06-29 10:21:12,228-0400 INFO (libvirt/events) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') reboot event (vm:1033) qemu [5]: 2020-06-29T14:21:12.260303Z qemu-kvm: terminating on signal 15 from pid 42224 (<unknown process>) 2020-06-29 14:21:12.462+0000: shutting down, reason=destroyed libvirtd [6] itself does not log anything relevant AFAICT, but at least it shows that the above unknown process is itself: 2020-06-29 14:18:16.212+0000: 42224: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor (Note that above line is from 3 minutes before the reboot, and the only place in the log with '42224'. No other log there has 42224, other than these and audit.log). Any idea? Is this a bug in libvirt? vdsm? I'd at least expect something in the log for such a severe step. [1] https://gerrit.ovirt.org/109961 [2] https://gerrit.ovirt.org/109734 [3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [6] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... -- Didi

On 30 Jun 2020, at 08:30, Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
I am trying to verify fixes for ovirt-engine-rename, specifically for OVN. Engine top patch is [1], OST patch [2]. Ran the manual job on these [3].
In previous patches, OST failed in earlier tests. Now, it passed these tests, so I hope that my patches are enough for what I am trying to do. However, [3] did fail later, during test_hotplug_cpu - it set the number of CPUs, then tried to connect to the VM, and timed out.
The logs imply that right after it changed the number of CPUs, the VM was rebooted, apparently by libvirtd. Relevant log snippets:
vdsm [4]:
2020-06-29 10:21:10,889-0400 DEBUG (jsonrpc/1) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') Setting number of cpus to : 2 (vm:3089) 2020-06-29 10:21:10,952-0400 INFO (jsonrpc/1) [api.virt] FINISH setNumberOfCpus return={'status': {'code': 0, 'message': 'Done'}, 'vmList': {}} from=::ffff:192.168.201.4,54576, flow_id=7f9503ed, vmId=7474280d-4501-4355-9425-63898757682b (api:54) 2020-06-29 10:21:11,111-0400 DEBUG (periodic/0) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 2925.602824355 elapsed 0.160 acquired True domains all (sampling:451) 2020-06-29 10:21:11,430-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'VM.setNumberOfCpus' in bridge with {} (__init__:356) 2020-06-29 10:21:11,432-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.setNumberOfCpus succeeded in 0.56 seconds (__init__:312) 2020-06-29 10:21:12,228-0400 INFO (libvirt/events) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') reboot event (vm:1033)
qemu [5]:
2020-06-29T14:21:12.260303Z qemu-kvm: terminating on signal 15 from pid 42224 (<unknown process>) 2020-06-29 14:21:12.462+0000: shutting down, reason=destroyed
libvirtd [6] itself does not log anything relevant AFAICT, but at least it shows that the above unknown process is itself:
2020-06-29 14:18:16.212+0000: 42224: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor
(Note that above line is from 3 minutes before the reboot, and the only place in the log with '42224'. No other log there has 42224, other than these and audit.log).
Any idea? Is this a bug in libvirt? vdsm? I'd at least expect something in the log for such a severe step.
I’d suggest to rerun. I don’t trust the CI env at all. Could be any reason. It’s highly unlikely to be caused by your patch, and I can see on my infra that OST is running well on both CentOS and Stream.
[1] https://gerrit.ovirt.org/109961 [2] https://gerrit.ovirt.org/109734 [3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [6] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JEF5QWFZF4O2OG...

I’d suggest to rerun. I don’t trust the CI env at all. Could be any reason. It’s highly unlikely to be caused by your patch, and I can see on my infra that OST is running well on both CentOS and Stream.
If we don't trust the CI environment (I also don't really trust it that much) why not changing it to something we can trust and rely on? -- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours. <https://mojo.redhat.com/docs/DOC-1199578>*

On Tue, Jun 30, 2020 at 9:37 AM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 30 Jun 2020, at 08:30, Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
I am trying to verify fixes for ovirt-engine-rename, specifically for OVN. Engine top patch is [1], OST patch [2]. Ran the manual job on these [3].
In previous patches, OST failed in earlier tests. Now, it passed these tests, so I hope that my patches are enough for what I am trying to do. However, [3] did fail later, during test_hotplug_cpu - it set the number of CPUs, then tried to connect to the VM, and timed out.
The logs imply that right after it changed the number of CPUs, the VM was rebooted, apparently by libvirtd. Relevant log snippets:
vdsm [4]:
2020-06-29 10:21:10,889-0400 DEBUG (jsonrpc/1) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') Setting number of cpus to : 2 (vm:3089) 2020-06-29 10:21:10,952-0400 INFO (jsonrpc/1) [api.virt] FINISH setNumberOfCpus return={'status': {'code': 0, 'message': 'Done'}, 'vmList': {}} from=::ffff:192.168.201.4,54576, flow_id=7f9503ed, vmId=7474280d-4501-4355-9425-63898757682b (api:54) 2020-06-29 10:21:11,111-0400 DEBUG (periodic/0) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 2925.602824355 elapsed 0.160 acquired True domains all (sampling:451) 2020-06-29 10:21:11,430-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'VM.setNumberOfCpus' in bridge with {} (__init__:356) 2020-06-29 10:21:11,432-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.setNumberOfCpus succeeded in 0.56 seconds (__init__:312) 2020-06-29 10:21:12,228-0400 INFO (libvirt/events) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') reboot event (vm:1033)
qemu [5]:
2020-06-29T14:21:12.260303Z qemu-kvm: terminating on signal 15 from pid 42224 (<unknown process>) 2020-06-29 14:21:12.462+0000: shutting down, reason=destroyed
libvirtd [6] itself does not log anything relevant AFAICT, but at least it shows that the above unknown process is itself:
2020-06-29 14:18:16.212+0000: 42224: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor
(Note that above line is from 3 minutes before the reboot, and the only place in the log with '42224'. No other log there has 42224, other than these and audit.log).
Any idea? Is this a bug in libvirt? vdsm? I'd at least expect something in the log for such a severe step.
I’d suggest to rerun.
ok
I don’t trust the CI env at all. Could be any reason.
If the qemu process would have simply died, I'd agree it's likely to be an infra issue. But if libvirt indeed killed it, I'd say it's either a bug there, or some weird flow (which requires more logging, at least).
It’s highly unlikely to be caused by your patch, and I can see on my infra that OST is running well on both CentOS and Stream.
Well, also basic-suite is now passing for quite some time now without any failures: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/ Thanks,
[1] https://gerrit.ovirt.org/109961 [2] https://gerrit.ovirt.org/109734 [3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [6] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JEF5QWFZF4O2OG...
-- Didi

On Tue, Jun 30, 2020 at 10:37 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 30, 2020 at 9:37 AM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 30 Jun 2020, at 08:30, Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
I am trying to verify fixes for ovirt-engine-rename, specifically for OVN. Engine top patch is [1], OST patch [2]. Ran the manual job on these [3].
In previous patches, OST failed in earlier tests. Now, it passed these tests, so I hope that my patches are enough for what I am trying to do. However, [3] did fail later, during test_hotplug_cpu - it set the number of CPUs, then tried to connect to the VM, and timed out.
The logs imply that right after it changed the number of CPUs, the VM was rebooted, apparently by libvirtd. Relevant log snippets:
vdsm [4]:
2020-06-29 10:21:10,889-0400 DEBUG (jsonrpc/1) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') Setting number of cpus to : 2 (vm:3089) 2020-06-29 10:21:10,952-0400 INFO (jsonrpc/1) [api.virt] FINISH setNumberOfCpus return={'status': {'code': 0, 'message': 'Done'}, 'vmList': {}} from=::ffff:192.168.201.4,54576, flow_id=7f9503ed, vmId=7474280d-4501-4355-9425-63898757682b (api:54) 2020-06-29 10:21:11,111-0400 DEBUG (periodic/0) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 2925.602824355 elapsed 0.160 acquired True domains all (sampling:451) 2020-06-29 10:21:11,430-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'VM.setNumberOfCpus' in bridge with {} (__init__:356) 2020-06-29 10:21:11,432-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.setNumberOfCpus succeeded in 0.56 seconds (__init__:312) 2020-06-29 10:21:12,228-0400 INFO (libvirt/events) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') reboot event (vm:1033)
qemu [5]:
2020-06-29T14:21:12.260303Z qemu-kvm: terminating on signal 15 from pid 42224 (<unknown process>) 2020-06-29 14:21:12.462+0000: shutting down, reason=destroyed
libvirtd [6] itself does not log anything relevant AFAICT, but at least it shows that the above unknown process is itself:
2020-06-29 14:18:16.212+0000: 42224: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor
Is this from libvirt log? Why would libvirt log log libvirtd pid?
(Note that above line is from 3 minutes before the reboot, and the only place in the log with '42224'. No other log there has 42224, other than these and audit.log).
Any idea? Is this a bug in libvirt? vdsm? I'd at least expect something in the log for such a severe step.
Is this the hosted engine vm? If we had trouble with storage, maybe sanlock killed the vm because it could not renew the lease. What do we have in /var/log/sanlock.log? Also, do we have any WARN logs in vdsm? if there was an issue with storage you would see warnings about blocked checkers every 10 seconds.
I’d suggest to rerun.
ok
I don’t trust the CI env at all. Could be any reason.
If the qemu process would have simply died, I'd agree it's likely to be an infra issue. But if libvirt indeed killed it, I'd say it's either a bug there, or some weird flow (which requires more logging, at least).
It’s highly unlikely to be caused by your patch, and I can see on my infra that OST is running well on both CentOS and Stream.
Well, also basic-suite is now passing for quite some time now without any failures:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/
Thanks,
[1] https://gerrit.ovirt.org/109961 [2] https://gerrit.ovirt.org/109734 [3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [6] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JEF5QWFZF4O2OG...
-- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/HMOCTJZDQERXZE...

On Tue, Jun 30, 2020 at 9:30 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Jun 30, 2020 at 10:37 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 30, 2020 at 9:37 AM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 30 Jun 2020, at 08:30, Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
I am trying to verify fixes for ovirt-engine-rename, specifically for OVN. Engine top patch is [1], OST patch [2]. Ran the manual job on these [3].
In previous patches, OST failed in earlier tests. Now, it passed these tests, so I hope that my patches are enough for what I am trying to do. However, [3] did fail later, during test_hotplug_cpu - it set the number of CPUs, then tried to connect to the VM, and timed out.
The logs imply that right after it changed the number of CPUs, the VM was rebooted, apparently by libvirtd. Relevant log snippets:
vdsm [4]:
2020-06-29 10:21:10,889-0400 DEBUG (jsonrpc/1) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') Setting number of cpus to : 2 (vm:3089) 2020-06-29 10:21:10,952-0400 INFO (jsonrpc/1) [api.virt] FINISH setNumberOfCpus return={'status': {'code': 0, 'message': 'Done'}, 'vmList': {}} from=::ffff:192.168.201.4,54576, flow_id=7f9503ed, vmId=7474280d-4501-4355-9425-63898757682b (api:54) 2020-06-29 10:21:11,111-0400 DEBUG (periodic/0) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 2925.602824355 elapsed 0.160 acquired True domains all (sampling:451) 2020-06-29 10:21:11,430-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'VM.setNumberOfCpus' in bridge with {} (__init__:356) 2020-06-29 10:21:11,432-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.setNumberOfCpus succeeded in 0.56 seconds (__init__:312) 2020-06-29 10:21:12,228-0400 INFO (libvirt/events) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') reboot event (vm:1033)
qemu [5]:
2020-06-29T14:21:12.260303Z qemu-kvm: terminating on signal 15 from pid 42224 (<unknown process>) 2020-06-29 14:21:12.462+0000: shutting down, reason=destroyed
libvirtd [6] itself does not log anything relevant AFAICT, but at least it shows that the above unknown process is itself:
2020-06-29 14:18:16.212+0000: 42224: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor
Is this from libvirt log?
Yes
Why would libvirt log log libvirtd pid?
Perhaps because it forks several children, and thus has in its logging format string also the pid? That's standard practice.
(Note that above line is from 3 minutes before the reboot, and the only place in the log with '42224'. No other log there has 42224, other than these and audit.log).
Any idea? Is this a bug in libvirt? vdsm? I'd at least expect something in the log for such a severe step.
Is this the hosted engine vm?
No, it's "vm0". (It's the basic suite, not hosted-engine).
If we had trouble with storage, maybe sanlock killed the vm because it could not renew the lease.
What do we have in /var/log/sanlock.log?
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... Nothing that seems relevant - last line is: 2020-06-29 10:14:29 2523 [44292]: s4:r6 resource bdabb997-2075-4bb6-8217-b3a99d1bd599:da464550-1f81-443a-907d-39b763f13751:/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases:3145728
Also, do we have any WARN logs in vdsm? if there was an issue with storage you would see warnings about blocked checkers every 10 seconds.
20 WARN lines. Last one before the problem: 2020-06-29 10:18:33,569-0400 WARN (vm/da464550) [virt.vm] (vmId='da464550-1f81-443a-907d-39b763f13751') Cannot find device alias for _conf:{'type': 'lease', 'device': 'lease', 'lease_id': 'da464550-1f81-443a-907d-39b763f13751', 'sd_id': 'bdabb997-2075-4bb6-8217-b3a99d1bd599', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases', 'offset': '3145728'} _deviceXML:None alias:None config:<bound method Base.config of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> createXmlElem:<bound method Device.createXmlElem of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> custom:{} device:lease deviceId:None deviceType:None from_xml_tree:<bound method Device.from_xml_tree of <class 'vdsm.virt.vmdevices.lease.Device'>> getXML:<bound method Device.getXML of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> get_extra_xmls:<bound method Base.get_extra_xmls of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> get_identifying_attrs:<bound method Device.get_identifying_attrs of <class 'vdsm.virt.vmdevices.lease.Device'>> get_metadata:<bound method Base.get_metadata of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> hotunplug_event:<threading.Event object at 0x7f18c24ce780> is_hostdevice:False lease_id:da464550-1f81-443a-907d-39b763f13751 log:<SimpleLogAdapter virt.vm (DEBUG)> offset:3145728 path:/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases sd_id:bdabb997-2075-4bb6-8217-b3a99d1bd599 setup:<bound method Base.setup of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> specParams:{} teardown:<bound method Base.teardown of <lease.Device sd_id=bdabb997-2075-4bb6-8217-b3a99d1bd599, lease_id=da464550-1f81-443a-907d-39b763f13751, path=/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases, offset=3145728 at 0x7f18ec1c9d58>> type:lease update_device_info:<bound method Device.update_device_info of <class 'vdsm.virt.vmdevices.lease.Device'>> vmid:None (core:329) Next one after it: 2020-06-29 10:21:16,643-0400 WARN (libvirt/events) [root] Attempting to remove a non existing network: ovirtmgmt/7474280d-4501-4355-9425-63898757682b (libvirtnetwork:200)
I’d suggest to rerun.
ok
It passed: https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... Feel free to have a look and compare with the failed one. Thanks and best regards,
I don’t trust the CI env at all. Could be any reason.
If the qemu process would have simply died, I'd agree it's likely to be an infra issue. But if libvirt indeed killed it, I'd say it's either a bug there, or some weird flow (which requires more logging, at least).
It’s highly unlikely to be caused by your patch, and I can see on my infra that OST is running well on both CentOS and Stream.
Well, also basic-suite is now passing for quite some time now without any failures:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/
Thanks,
[1] https://gerrit.ovirt.org/109961 [2] https://gerrit.ovirt.org/109734 [3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [6] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JEF5QWFZF4O2OG...
-- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/HMOCTJZDQERXZE...
-- Didi

On Wed, Jul 1, 2020 at 8:47 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 30, 2020 at 9:30 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Jun 30, 2020 at 10:37 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 30, 2020 at 9:37 AM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 30 Jun 2020, at 08:30, Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
I am trying to verify fixes for ovirt-engine-rename, specifically for OVN. Engine top patch is [1], OST patch [2]. Ran the manual job on these [3].
In previous patches, OST failed in earlier tests. Now, it passed these tests, so I hope that my patches are enough for what I am trying to do. However, [3] did fail later, during test_hotplug_cpu - it set the number of CPUs, then tried to connect to the VM, and timed out.
The logs imply that right after it changed the number of CPUs, the VM was rebooted, apparently by libvirtd. Relevant log snippets:
vdsm [4]:
2020-06-29 10:21:10,889-0400 DEBUG (jsonrpc/1) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') Setting number of cpus to : 2 (vm:3089) 2020-06-29 10:21:10,952-0400 INFO (jsonrpc/1) [api.virt] FINISH setNumberOfCpus return={'status': {'code': 0, 'message': 'Done'}, 'vmList': {}} from=::ffff:192.168.201.4,54576, flow_id=7f9503ed, vmId=7474280d-4501-4355-9425-63898757682b (api:54) 2020-06-29 10:21:11,111-0400 DEBUG (periodic/0) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 2925.602824355 elapsed 0.160 acquired True domains all (sampling:451) 2020-06-29 10:21:11,430-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'VM.setNumberOfCpus' in bridge with {} (__init__:356) 2020-06-29 10:21:11,432-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.setNumberOfCpus succeeded in 0.56 seconds (__init__:312) 2020-06-29 10:21:12,228-0400 INFO (libvirt/events) [virt.vm] (vmId='7474280d-4501-4355-9425-63898757682b') reboot event (vm:1033)
qemu [5]:
2020-06-29T14:21:12.260303Z qemu-kvm: terminating on signal 15 from pid 42224 (<unknown process>) 2020-06-29 14:21:12.462+0000: shutting down, reason=destroyed
libvirtd [6] itself does not log anything relevant AFAICT, but at least it shows that the above unknown process is itself:
2020-06-29 14:18:16.212+0000: 42224: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor
Is this from libvirt log?
Yes
Why would libvirt log log libvirtd pid?
Perhaps because it forks several children, and thus has in its logging format string also the pid? That's standard practice.
Maybe.
(Note that above line is from 3 minutes before the reboot, and the only place in the log with '42224'. No other log there has 42224, other than these and audit.log).
Any idea? Is this a bug in libvirt? vdsm? I'd at least expect something in the log for such a severe step.
Is this the hosted engine vm?
No, it's "vm0". (It's the basic suite, not hosted-engine).
Maybe some tests add vm lease to this vm...
If we had trouble with storage, maybe sanlock killed the vm because it could not renew the lease.
What do we have in /var/log/sanlock.log?
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
Nothing that seems relevant - last line is:
2020-06-29 10:14:29 2523 [44292]: s4:r6 resource bdabb997-2075-4bb6-8217-b3a99d1bd599:da464550-1f81-443a-907d-39b763f13751:/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/bdabb997-2075-4bb6-8217-b3a99d1bd599/dom_md/xleases:3145728
This is a vm lease for vm da464550-1f81-443a-907d-39b763f13751. So theoretically this vm could be terminated by sanlock if sanlock could not renew the lease...
Also, do we have any WARN logs in vdsm? if there was an issue with storage you would see warnings about blocked checkers every 10 seconds.
20 WARN lines. Last one before the problem: ...
But none of these warnings is related, and there is nothing in sanlock log about renewal issues or about killing pids, so this is not the issue. This can also be vdsm asking libvirt to terminate the vm, for example if the vm is paused for long time, based on the restart policy.
I’d suggest to rerun.
ok
It passed:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
Feel free to have a look and compare with the failed one.
Thanks and best regards,
I don’t trust the CI env at all. Could be any reason.
If the qemu process would have simply died, I'd agree it's likely to be an infra issue. But if libvirt indeed killed it, I'd say it's either a bug there, or some weird flow (which requires more logging, at least).
It’s highly unlikely to be caused by your patch, and I can see on my infra that OST is running well on both CentOS and Stream.
Well, also basic-suite is now passing for quite some time now without any failures:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/
Thanks,
[1] https://gerrit.ovirt.org/109961 [2] https://gerrit.ovirt.org/109734 [3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [6] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JEF5QWFZF4O2OG...
-- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/HMOCTJZDQERXZE...
-- Didi
participants (4)
-
Michal Skrivanek
-
Nir Soffer
-
Sandro Bonazzola
-
Yedidyah Bar David