OST Network suite is failing on "OSError: [Errno 28] No space left on device"

Tue Mar 20 09:27:48 UTC 2018

On Tue, Mar 20, 2018 at 11:24 AM, Gal Ben Haim <gbenhaim at redhat.com> wrote:
> The failure happened again on "ovirt-srv04".
> The suite wasn't run from "/dev/shm" since it was full of stale lago
> environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2".
> The reason for the stale envs is a timeout that was raised by Jenkins (the
> suites were stuck for 6 hours), so OST's cleanup has not been called.
> I'm going to add an internal timeout to OST.
>
>
> On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <didi at redhat.com>
> wrote:
>>
>> On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren at redhat.com> wrote:
>> > On 20 March 2018 at 10:53, Yedidyah Bar David <didi at redhat.com> wrote:
>> >> On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren at redhat.com>
>> >> wrote:
>> >>> On 20 March 2018 at 09:17, Yedidyah Bar David <didi at redhat.com> wrote:
>> >>>> On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler at redhat.com>
>> >>>> wrote:
>> >>>>> Thanks Gal, I expect the problem is fixed until something eats
>> >>>>> all space in /dev/shm.
>> >>>>> But the usage of /dev/shm is logged in the output, so we would be
>> >>>>> able
>> >>>>> to detect the problem next time instantly.
>> >>>>>
>> >>>>> From my point of view it would be good to know why /dev/shm was
>> >>>>> full,
>> >>>>> to prevent this situation in future.
>> >>>>
>> >>>> Gal already wrote below - it was because some build failed to clean
>> >>>> up
>> >>>> after itself.
>> >>>>
>> >>>> I don't know about this specific case, but I was told that I am
>> >>>> personally causing such issues by using the 'cancel' button, so I
>> >>>> sadly stopped. Sadly, because our CI system is quite loaded and when
>> >>>> I
>> >>>> know that some build is useless, I wish to kill it and save some
>> >>>> load...
>> >>>>
>> >>>> Back to your point, perhaps we should make jobs check /dev/shm when
>> >>>> they _start_, and either alert/fail/whatever if it's not almost free,
>> >>>> or, if we know what we are doing, just remove stuff there? That might
>> >>>> be much easier than fixing things to clean up in end, and/or
>> >>>> debugging
>> >>>> why this cleaning failed.
>> >>>
>> >>> Sure thing, patches to:
>> >>>
>> >>>     [jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh
>> >>>
>> >>> Are welcome, we often find interesting stuff to add there...
>> >>>
>> >>> If constrained for time, please turn this comment into an orderly RFE
>> >>> in Jira...
>> >>
>> >> Searched for '/dev/shm' and found way too many places to analyze them
>> >> all and add something to cleanup_slave to cover all.
>> >
>> > Where did you search?
>>
>> ovirt-system-tests, lago, lago-ost-plugin.
>> ovirt-system-tests has 83 occurrences. I realize almost all are in
>> lago guests, but looking still takes time...
>>
>> In theory I can patch cleanup_slave.sh as you suggested, removing
>> _everything_ there.
>> Not sure this is safe.

Well, pushed now:

https://gerrit.ovirt.org/89225

>>
>> >
>> >>
>> >> Pushed this for now:
>> >>
>> >> https://gerrit.ovirt.org/89215
>> >>
>> >>>
>> >>> --
>> >>> Barak Korren
>> >>> RHV DevOps team , RHCE, RHCi
>> >>> Red Hat EMEA
>> >>> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>> >>
>> >>
>> >>
>> >> --
>> >> Didi
>> >
>> >
>> >
>> > --
>> > Barak Korren
>> > RHV DevOps team , RHCE, RHCi
>> > Red Hat EMEA
>> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>>
>>
>>
>> --
>> Didi
>> _______________________________________________
>> Infra mailing list
>> Infra at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/infra
>
>
>
>
> --
> GAL bEN HAIM
> RHV DEVOPS

-- 
Didi