The failure happened again on "ovirt-srv04". 
The suite wasn't run from "/dev/shm" since it was full of stale lago environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2".
The reason for the stale envs is a timeout that was raised by Jenkins (the suites were stuck for 6 hours), so OST's cleanup has not been called.
I'm going to add an internal timeout to OST.


On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren@redhat.com> wrote:
> On 20 March 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:
>> On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:
>>> On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:
>>>> On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:
>>>>> Thanks Gal, I expect the problem is fixed until something eats
>>>>> all space in /dev/shm.
>>>>> But the usage of /dev/shm is logged in the output, so we would be able
>>>>> to detect the problem next time instantly.
>>>>>
>>>>> From my point of view it would be good to know why /dev/shm was full,
>>>>> to prevent this situation in future.
>>>>
>>>> Gal already wrote below - it was because some build failed to clean up
>>>> after itself.
>>>>
>>>> I don't know about this specific case, but I was told that I am
>>>> personally causing such issues by using the 'cancel' button, so I
>>>> sadly stopped. Sadly, because our CI system is quite loaded and when I
>>>> know that some build is useless, I wish to kill it and save some
>>>> load...
>>>>
>>>> Back to your point, perhaps we should make jobs check /dev/shm when
>>>> they _start_, and either alert/fail/whatever if it's not almost free,
>>>> or, if we know what we are doing, just remove stuff there? That might
>>>> be much easier than fixing things to clean up in end, and/or debugging
>>>> why this cleaning failed.
>>>
>>> Sure thing, patches to:
>>>
>>>     [jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh
>>>
>>> Are welcome, we often find interesting stuff to add there...
>>>
>>> If constrained for time, please turn this comment into an orderly RFE in Jira...
>>
>> Searched for '/dev/shm' and found way too many places to analyze them
>> all and add something to cleanup_slave to cover all.
>
> Where did you search?

ovirt-system-tests, lago, lago-ost-plugin.
ovirt-system-tests has 83 occurrences. I realize almost all are in
lago guests, but looking still takes time...

In theory I can patch cleanup_slave.sh as you suggested, removing
_everything_ there.
Not sure this is safe.

>
>>
>> Pushed this for now:
>>
>> https://gerrit.ovirt.org/89215
>>
>>>
>>> --
>>> Barak Korren
>>> RHV DevOps team , RHCE, RHCi
>>> Red Hat EMEA
>>> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>>
>>
>>
>> --
>> Didi
>
>
>
> --
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted



--
Didi
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra



--
GAL bEN HAIM
RHV DEVOPS