On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler(a)redhat.com> wrote:
Thanks Gal, I expect the problem is fixed until something eats
all space in /dev/shm.
But the usage of /dev/shm is logged in the output, so we would be able
to detect the problem next time instantly.
From my point of view it would be good to know why /dev/shm was full,
to prevent this situation in future.
Gal already wrote below - it was because some build failed to clean up
after itself.
I don't know about this specific case, but I was told that I am
personally causing such issues by using the 'cancel' button, so I
sadly stopped. Sadly, because our CI system is quite loaded and when I
know that some build is useless, I wish to kill it and save some
load...
Back to your point, perhaps we should make jobs check /dev/shm when
they _start_, and either alert/fail/whatever if it's not almost free,
or, if we know what we are doing, just remove stuff there? That might
be much easier than fixing things to clean up in end, and/or debugging
why this cleaning failed.
On Mon, 19 Mar 2018 18:44:54
+0200 Gal Ben Haim <gbenhaim(a)redhat.com> wrote:
> I see that this failure happens a lot on "ovirt-srv19.phx.ovirt.org
> <
http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org>", and by
> different projects that uses ansible.
> Not sure it relates, but I've found (and removed) a stale lago
> environment in "/dev/shm" that were created by
> ovirt-system-tests_he-basic-iscsi-suite -master
>
<
http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tes...
> .
> The stale environment caused the suite to not run in "/dev/shm".
> The maximum number of semaphore on both
ovirt-srv19.phx.ovirt.org
> <
http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> and
>
ovirt-srv23.phx.ovirt.org
> <
http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> (which
> run the ansible suite with success) is 128.
>
> On Mon, Mar 19, 2018 at 3:37 PM, Yedidyah Bar David <didi(a)redhat.com>
> wrote:
>
> > Failed also here:
> >
> >
http://jenkins.ovirt.org/job/ovirt-system-tests_master_
> > check-patch-el7-x86_64/4540/
> >
> > The patch trigerring this affects many suites, and the job failed
> > during ansible-suite-master .
> >
> > On Mon, Mar 19, 2018 at 3:10 PM, Eyal Edri <eedri(a)redhat.com> wrote:
> >
> >> Gal and Daniel are looking into it, strange its not affecting all
> >> suites.
> >>
> >> On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler
> >> <dholler(a)redhat.com> wrote:
> >>
> >>> Looks like /dev/shm is run out of space.
> >>>
> >>> On Mon, 19 Mar 2018 13:33:28 +0200
> >>> Leon Goldberg <lgoldber(a)redhat.com> wrote:
> >>>
> >>> > Hey, any updates?
> >>> >
> >>> > On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas
<ehaas(a)redhat.com>
> >>> > wrote:
> >>> >
> >>> > > We are doing nothing special there, just executing ansible
> >>> > > through their API.
> >>> > >
> >>> > > On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky
> >>> > > <dbelenky(a)redhat.com> wrote:
> >>> > >
> >>> > >> It's not a space issue. Other suites ran on that slave
after
> >>> > >> your suite successfully.
> >>> > >> I think that the problem is the setting for max
semaphores,
> >>> > >> though I don't know what you're doing to reach
that limit.
> >>> > >>
> >>> > >> [dbelenky@ovirt-srv18 ~]$ ipcs -ls
> >>> > >>
> >>> > >> ------ Semaphore Limits --------
> >>> > >> max number of arrays = 128
> >>> > >> max semaphores per array = 250
> >>> > >> max semaphores system wide = 32000
> >>> > >> max ops per semop call = 32
> >>> > >> semaphore max value = 32767
> >>> > >>
> >>> > >>
> >>> > >> On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas
> >>> > >> <ehaas(a)redhat.com> wrote:
> >>> > >>>
http://jenkins.ovirt.org/job/ovirt-system-tests_network-suit
> >>> e-master/
> >>> > >>>
> >>> > >>> On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky
> >>> > >>> <dbelenky(a)redhat.com> wrote:
> >>> > >>>
> >>> > >>>> Hi Edi,
> >>> > >>>>
> >>> > >>>> Are there any logs? where you're running the
suite? may I
> >>> > >>>> have a link?
> >>> > >>>>
> >>> > >>>> On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas
> >>> > >>>> <ehaas(a)redhat.com> wrote:
> >>> > >>>>> Good morning,
> >>> > >>>>>
> >>> > >>>>> We are running in the OST network suite a test
module with
> >>> > >>>>> Ansible and it started failing during the
weekend on
> >>> > >>>>> "OSError: [Errno 28] No space left on
device" when
> >>> > >>>>> attempting to take a lock in the
mutiprocessing python
> >>> > >>>>> module.
> >>> > >>>>>
> >>> > >>>>> It smells like a slave resource problem, could
someone
> >>> > >>>>> help investigate this?
> >>> > >>>>>
> >>> > >>>>> Thanks,
> >>> > >>>>> Edy.
> >>> > >>>>>
> >>> > >>>>> =================================== FAILURES
> >>> > >>>>> ===================================
______________________
> >>> > >>>>> test_ovn_provider_create_scenario
_______________________
> >>> > >>>>>
> >>> > >>>>> os_client_config = None
> >>> > >>>>>
> >>> > >>>>> def
> >>> > >>>>>
test_ovn_provider_create_scenario(os_client_config):
> >>> > >>>>> >
_test_ovn_provider('create_scenario.yml')
> >>> > >>>>>
> >>> > >>>>>
network-suite-master/tests/test_ovn_provider.py:68:
> >>> > >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _
> >>> > >>>>> _ _ _ _ _ _ _ _ _ _ _
> >>> > >>>>>
network-suite-master/tests/test_ovn_provider.py:78: in
> >>> > >>>>> _test_ovn_provider playbook.run()
> >>> > >>>>> network-suite-master/lib/ansiblelib.py:127: in
run
> >>> > >>>>> self._run_playbook_executor()
> >>> > >>>>> network-suite-master/lib/ansiblelib.py:138:
in
> >>> > >>>>> _run_playbook_executor pbex =
> >>> > >>>>> PlaybookExecutor(**self._pbex_args)
> >>> /usr/lib/python2.7/site-packages/ansible/executor/playbook_e
> >>> xecutor.py:60:
> >>> > >>>>> in __init__ self._tqm =
> >>> > >>>>> TaskQueueManager(inventory=inventory,
> >>> > >>>>> variable_manager=variable_manager,
loader=loader,
> >>> > >>>>> options=options,
> >>> > >>>>> passwords=self.passwords)
/usr/lib/python2.7/site-packag
> >>> es/ansible/executor/task_queue_manager.py:104:
> >>> > >>>>> in __init__ self._final_q =
> >>> > >>>>> multiprocessing.Queue()
/usr/lib64/python2.7/multiproc
> >>> essing/__init__.py:218:
> >>> > >>>>> in Queue return
> >>> > >>>>> Queue(maxsize) /usr/lib64/python2.7/multiproc
> >>> essing/queues.py:63:
> >>> > >>>>> in __init__ self._rlock =
> >>> > >>>>> Lock()
/usr/lib64/python2.7/multiprocessing/synchronize.py:147:
> >>> > >>>>> in __init__ SemLock.__init__(self, SEMAPHORE,
1, 1) _ _ _
> >>> > >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _
> >>> > >>>>> _ _ _ _ _ _ _ _
> >>> > >>>>>
> >>> > >>>>> self = <Lock(owner=unknown)>, kind = 1,
value = 1,
> >>> > >>>>> maxvalue = 1
> >>> > >>>>>
> >>> > >>>>> def __init__(self, kind, value,
maxvalue):
> >>> > >>>>> > sl = self._semlock =
> >>> > >>>>> > _multiprocessing.SemLock(kind, value,
maxvalue)
> >>> > >>>>> E OSError: [Errno 28] No space left on
device
> >>> > >>>>>
> >>> > >>>>>
/usr/lib64/python2.7/multiprocessing/synchronize.py:75:
> >>> > >>>>> OSError
> >>> > >>>>>
> >>> > >>>>>
> >>> > >>>>
> >>> > >>>>
> >>> > >>>> --
> >>> > >>>>
> >>> > >>>> DANIEL BELENKY
> >>> > >>>>
> >>> > >>>> RHV DEVOPS
> >>> > >>>>
> >>> > >>>
> >>> > >>>
> >>> > >>
> >>> > >>
> >>> > >> --
> >>> > >>
> >>> > >> DANIEL BELENKY
> >>> > >>
> >>> > >> RHV DEVOPS
> >>> > >>
> >>> > >
> >>> > >
> >>>
> >>> _______________________________________________
> >>> Infra mailing list
> >>> Infra(a)ovirt.org
> >>>
http://lists.ovirt.org/mailman/listinfo/infra
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Eyal edri
> >>
> >>
> >> MANAGER
> >>
> >> RHV DevOps
> >>
> >> EMEA VIRTUALIZATION R&D
> >>
> >>
> >> Red Hat EMEA <
https://www.redhat.com/>
> >> <
https://red.ht/sig> TRIED. TESTED. TRUSTED.
> >> <
https://redhat.com/trusted> phone: +972-9-7692018
> >> <+972%209-769-2018> irc: eedri (on #tlv #rhev-dev #rhev-integ)
> >>
> >> _______________________________________________
> >> Infra mailing list
> >> Infra(a)ovirt.org
> >>
http://lists.ovirt.org/mailman/listinfo/infra
> >>
> >>
> >
> >
> > --
> > Didi
> >
>
>
>