On Thu, Nov 29, 2018 at 11:21 AM Edward Haas <edwardh@redhat.com> wrote:


On Thu, Nov 29, 2018 at 10:41 AM Edward Haas <edwardh@redhat.com> wrote:


On Wed, Nov 28, 2018 at 8:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
We have this failure that pops randomly:

1. All tests pass

00:13:13.284 ___________________________________ summary ____________________________________
00:13:13.285   tests: commands succeeded
00:13:13.286   storage-py27: commands succeeded
00:13:13.286   storage-py36: commands succeeded
00:13:13.286   lib-py27: commands succeeded
00:13:13.287   lib-py36: commands succeeded
00:13:13.288   network-py27: commands succeeded
00:13:13.290   network-py36: commands succeeded
00:13:13.291   virt-py27: commands succeeded
00:13:13.292   virt-py36: commands succeeded
00:13:13.293   congratulations :)

2. But we fail to collect logs at the end

00:14:35.992 ##########################################################
00:14:35.995 ## Wed Nov 28 17:39:50 UTC 2018 Finished env: fc28:fedora-28-x86_64
00:14:35.996 ##      took 764 seconds
00:14:35.997 ##      rc = 1
00:14:35.997 ##########################################################
00:14:36.009 ##! ERROR vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
00:14:36.010 ##! Last 20 log entries: /tmp/mock_logs.Lcop4ZOq/script/stdout_stderr.log
00:14:36.011 ##!
00:14:36.012 journal/b087148aba6d49b9bbef488e52a48752/system.journal
00:14:36.013 tar: journal/b087148aba6d49b9bbef488e52a48752/system.journal: file changed as we read it
00:14:36.014 journal/b087148aba6d49b9bbef488e52a48752/user-1000.journal
00:14:36.015 lastlog
00:14:36.015 libvirt/
00:14:36.015 libvirt/lxc/
00:14:36.015 libvirt/libxl/
00:14:36.016 libvirt/qemu/
00:14:36.016 libvirt/qemu/LiveOS-f920001d-be4e-47ea-ac26-72480fd5be87.log
00:14:36.017 libvirt/uml/
00:14:36.017 ovirt-guest-agent/
00:14:36.017 ovirt-guest-agent/ovirt-guest-agent.log
00:14:36.017 README
00:14:36.018 samba/
00:14:36.018 samba/old/
00:14:36.018 sssd/
00:14:36.018 tallylog
00:14:36.018 wtmp
00:14:36.018 Took 678 seconds
00:14:36.018 ===================================
00:14:36.019 ##!
00:14:36.019 ##! ERROR ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00:14:36.019 ##!########################################################

This looks like an issue with vdsm check-patch.sh:

function collect_logs {
    res=$?
    [ "$res" -ne 0 ] && echo "*** err: $res"
    cd /var/log
    tar -cvzf "$EXPORT_DIR/mock_varlogs.tar.gz" *
    cd /var/host_log
    tar -cvzf "$EXPORT_DIR/host_varlogs.tar.gz" *
}

trap collect_logs EXIT

Seems that tar fail to collect log if the log is modified while copied, which makes sense.

We can ignore errors in tar, since log collection should not fail the build, but I think
a better solution is to avoid collecting any logs since vdsm writes its own logs during
tests - all the info must be in vdsm log.

Here is the list of collected logs:

00:13:47.280 + tar -cvzf /home/jenkins/workspace/vdsm_master_check-patch-fc28-x86_64/vdsm/exported-artifacts/mock_varlogs.tar.gz btmp dnf.librepo.log dnf.log dnf.rpm.log faillog glusterfs hawkey.log journal lastlog libvirt openvswitch README tallylog vdsm_tests.log wtmp yum.log
00:13:47.285 btmp
00:13:47.285 dnf.librepo.log
00:13:47.299 dnf.log
00:13:47.309 dnf.rpm.log
00:13:47.310 faillog
00:13:47.311 glusterfs/
00:13:47.312 hawkey.log
00:13:47.313 journal/
00:13:47.313 lastlog
00:13:47.315 libvirt/
00:13:47.315 libvirt/qemu/
00:13:47.316 openvswitch/
00:13:47.317 openvswitch/ovs-vswitchd.log
00:13:47.318 openvswitch/ovsdb-server.log
00:13:47.319 README
00:13:47.320 tallylog
00:13:47.321 vdsm_tests.log
00:13:47.342 wtmp
00:13:47.343 yum.log
00:13:47.349 + cd /var/host_log
00:13:47.350 + tar -cvzf /home/jenkins/workspace/vdsm_master_check-patch-fc28-x86_64/vdsm/exported-artifacts/host_varlogs.tar.gz anaconda audit boot.log btmp chrony cloud-init.log cloud-init-output.log cron dnf.librepo.log dnf.log dnf.rpm.log firewalld glusterfs hawkey.log journal lastlog libvirt ovirt-guest-agent README samba sssd tallylog wtmp
00:13:47.356 anaconda/
00:13:47.356 anaconda/ifcfg.log
00:13:47.357 anaconda/ks-script-l5qnynnj.log
00:13:47.358 anaconda/storage.log
00:13:47.359 anaconda/program.log
00:13:47.395 anaconda/ks-script-b5_08tmo.log
00:13:47.396 anaconda/ks-script-6uks8bp3.log
00:13:47.397 anaconda/hawkey.log
00:13:47.398 anaconda/syslog
00:13:47.406 anaconda/journal.log
00:13:47.449 anaconda/dnf.librepo.log
00:13:47.458 anaconda/packaging.log
00:13:47.465 anaconda/dbus.log
00:13:47.466 anaconda/anaconda.log
00:13:47.467 anaconda/ks-script-slrcz39_.log
00:13:47.503 audit/
00:13:47.504 audit/audit.log.3
00:13:47.657 audit/audit.log.2
00:13:47.814 audit/audit.log.1
00:13:47.981 audit/audit.log
00:13:48.008 audit/audit.log.4
00:13:48.155 boot.log
00:13:48.156 btmp
00:13:48.157 chrony/
00:13:48.159 cloud-init.log
00:13:48.159 cloud-init-output.log
00:13:48.161 cron
00:13:48.162 dnf.librepo.log
00:13:49.930 dnf.log
00:13:51.335 dnf.rpm.log
00:13:51.421 firewalld
00:13:51.423 glusterfs/
00:13:51.424 hawkey.log
00:13:51.704 journal/
00:13:51.708 journal/b087148aba6d49b9bbef488e52a48752/
00:13:51.709 journal/b087148aba6d49b9bbef488e52a48752/system.journal
00:13:55.817 tar: journal/b087148aba6d49b9bbef488e52a48752/system.journal: file changed as we read it
00:13:55.819 journal/b087148aba6d49b9bbef488e52a48752/user-1000.journal
00:13:55.915 lastlog
00:13:55.923 libvirt/
00:13:55.924 libvirt/lxc/
00:13:55.926 libvirt/libxl/
00:13:55.927 libvirt/qemu/
00:13:55.928 libvirt/qemu/LiveOS-f920001d-be4e-47ea-ac26-72480fd5be87.log
00:13:55.929 libvirt/uml/
00:13:55.930 ovirt-guest-agent/
00:13:55.930 ovirt-guest-agent/ovirt-guest-agent.log
00:13:55.932 README
00:13:55.933 samba/
00:13:55.933 samba/old/
00:13:55.935 sssd/
00:13:55.935 tallylog
00:13:55.935 wtmp

Most if not all are lot relevant to vdsm tests, and should not be collected.

This was added in:

commit 9c9c17297433e5a5a49aa19cde10b206e7db61e9
Author: Edward Haas <edwardh@redhat.com>
Date:   Tue Apr 17 10:53:11 2018 +0300

    automation: Collect logs even when check-patch fails
    
    Change-Id: Idfe07ce6fc55473b1db1d7f16754f559cc5c345a
    Signed-off-by: Edward Haas <edwardh@redhat.com>

Reviewed in:

Edward, can you explain why do we need to collect logs during check-patch,
and why do we need to collect all the logs in the system?

check-patch are running unit and integrations tests.
The integration tests are touching the OS and other packages (like openvswitch).
It was added so we can debug why tests failed.

I guess we can now separate the unit and integration tests, but it will not solve
the problem presented here.
Failing to collect the logs silently sounds a good enough solution to me.

Barak suggested to just exclude the journal: https://gerrit.ovirt.org/#/c/95850/

This is fixed now, thanks!

Any reason why we exclude journal only for /var/log, and do collect the binary journal
from /var/host_log? I guess it can fail in the same way.

If we collect logs for integration tests, the most important log is the journal, and now
we skip it.

We can get the journal in a reliable way like this:

    journalctl --since build-date > /tmp/journal.log

Nir