Re: OST Network suite is failing on "OSError: [Errno 28] No space left on device"

20 Mar 2018


      On Tue, Mar 20, 2018 at 11:24 AM, Gal Ben Haim <gbenhaim@redhat.com> wrote:
...
The failure happened again on "ovirt-srv04".
The suite wasn't run from "/dev/shm" since it was full of stale lago
environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2".
The reason for the stale envs is a timeout that was raised by Jenkins (the
suites were stuck for 6 hours), so OST's cleanup has not been called.
I'm going to add an internal timeout to OST.
On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <didi@redhat.com>
wrote:
...
On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren@redhat.com> wrote:
...
On 20 March 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:
...
On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com>
wrote:
...
On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:
...
On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com>
wrote:
> Thanks Gal, I expect the problem is fixed until something eats
> all space in /dev/shm.
> But the usage of /dev/shm is logged in the output, so we would be
> able
> to detect the problem next time instantly.
>
> From my point of view it would be good to know why /dev/shm was
> full,
> to prevent this situation in future.
Gal already wrote below - it was because some build failed to clean
up
after itself.
I don't know about this specific case, but I was told that I am
personally causing such issues by using the 'cancel' button, so I
sadly stopped. Sadly, because our CI system is quite loaded and when
I
know that some build is useless, I wish to kill it and save some
load...
Back to your point, perhaps we should make jobs check /dev/shm when
they _start_, and either alert/fail/whatever if it's not almost free,
or, if we know what we are doing, just remove stuff there? That might
be much easier than fixing things to clean up in end, and/or
debugging
why this cleaning failed.
Sure thing, patches to:
[jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh
Are welcome, we often find interesting stuff to add there...
If constrained for time, please turn this comment into an orderly RFE
in Jira...
Searched for '/dev/shm' and found way too many places to analyze them
all and add something to cleanup_slave to cover all.
Where did you search?
ovirt-system-tests, lago, lago-ost-plugin.
ovirt-system-tests has 83 occurrences. I realize almost all are in
lago guests, but looking still takes time...
In theory I can patch cleanup_slave.sh as you suggested, removing
_everything_ there.
Not sure this is safe.
Well, pushed now:

https://gerrit.ovirt.org/89225
...
...
...
...
Pushed this for now:
https://gerrit.ovirt.org/89215
...
--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
--
Didi
--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
--
Didi
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
--
GAL bEN HAIM
RHV DEVOPS
-- 
Didi