On Tue, Mar 20, 2018 at 11:24 AM, Gal Ben Haim <gbenhaim(a)redhat.com> wrote:
The failure happened again on "ovirt-srv04".
The suite wasn't run from "/dev/shm" since it was full of stale lago
environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2".
The reason for the stale envs is a timeout that was raised by Jenkins (the
suites were stuck for 6 hours), so OST's cleanup has not been called.
I'm going to add an internal timeout to OST.
On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <didi(a)redhat.com>
wrote:
>
> On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren(a)redhat.com> wrote:
> > On 20 March 2018 at 10:53, Yedidyah Bar David <didi(a)redhat.com> wrote:
> >> On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren(a)redhat.com>
> >> wrote:
> >>> On 20 March 2018 at 09:17, Yedidyah Bar David <didi(a)redhat.com>
wrote:
> >>>> On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler
<dholler(a)redhat.com>
> >>>> wrote:
> >>>>> Thanks Gal, I expect the problem is fixed until something eats
> >>>>> all space in /dev/shm.
> >>>>> But the usage of /dev/shm is logged in the output, so we would
be
> >>>>> able
> >>>>> to detect the problem next time instantly.
> >>>>>
> >>>>> From my point of view it would be good to know why /dev/shm was
> >>>>> full,
> >>>>> to prevent this situation in future.
> >>>>
> >>>> Gal already wrote below - it was because some build failed to clean
> >>>> up
> >>>> after itself.
> >>>>
> >>>> I don't know about this specific case, but I was told that I am
> >>>> personally causing such issues by using the 'cancel' button,
so I
> >>>> sadly stopped. Sadly, because our CI system is quite loaded and
when
> >>>> I
> >>>> know that some build is useless, I wish to kill it and save some
> >>>> load...
> >>>>
> >>>> Back to your point, perhaps we should make jobs check /dev/shm when
> >>>> they _start_, and either alert/fail/whatever if it's not almost
free,
> >>>> or, if we know what we are doing, just remove stuff there? That
might
> >>>> be much easier than fixing things to clean up in end, and/or
> >>>> debugging
> >>>> why this cleaning failed.
> >>>
> >>> Sure thing, patches to:
> >>>
> >>> [jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh
> >>>
> >>> Are welcome, we often find interesting stuff to add there...
> >>>
> >>> If constrained for time, please turn this comment into an orderly RFE
> >>> in Jira...
> >>
> >> Searched for '/dev/shm' and found way too many places to analyze
them
> >> all and add something to cleanup_slave to cover all.
> >
> > Where did you search?
>
> ovirt-system-tests, lago, lago-ost-plugin.
> ovirt-system-tests has 83 occurrences. I realize almost all are in
> lago guests, but looking still takes time...
>
> In theory I can patch cleanup_slave.sh as you suggested, removing
> _everything_ there.
> Not sure this is safe.
Well, pushed now:
https://gerrit.ovirt.org/89225
>
> >
> >>
> >> Pushed this for now:
> >>
> >>
https://gerrit.ovirt.org/89215
> >>
> >>>
> >>> --
> >>> Barak Korren
> >>> RHV DevOps team , RHCE, RHCi
> >>> Red Hat EMEA
> >>>
redhat.com | TRIED. TESTED. TRUSTED. |
redhat.com/trusted
> >>
> >>
> >>
> >> --
> >> Didi
> >
> >
> >
> > --
> > Barak Korren
> > RHV DevOps team , RHCE, RHCi
> > Red Hat EMEA
> >
redhat.com | TRIED. TESTED. TRUSTED. |
redhat.com/trusted
>
>
>
> --
> Didi
> _______________________________________________
> Infra mailing list
> Infra(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/infra
--
GAL bEN HAIM
RHV DEVOPS
--
Didi