OST Network suite is failing on "OSError: [Errno 28] No space left on device"

Yedidyah Bar David didi at redhat.com
Tue Mar 20 07:17:13 UTC 2018


On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler at redhat.com> wrote:
> Thanks Gal, I expect the problem is fixed until something eats
> all space in /dev/shm.
> But the usage of /dev/shm is logged in the output, so we would be able
> to detect the problem next time instantly.
>
> From my point of view it would be good to know why /dev/shm was full,
> to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up
after itself.

I don't know about this specific case, but I was told that I am
personally causing such issues by using the 'cancel' button, so I
sadly stopped. Sadly, because our CI system is quite loaded and when I
know that some build is useless, I wish to kill it and save some
load...

Back to your point, perhaps we should make jobs check /dev/shm when
they _start_, and either alert/fail/whatever if it's not almost free,
or, if we know what we are doing, just remove stuff there? That might
be much easier than fixing things to clean up in end, and/or debugging
why this cleaning failed.

>
>
>  On Mon, 19 Mar 2018 18:44:54
> +0200 Gal Ben Haim <gbenhaim at redhat.com> wrote:
>
>> I see that this failure happens a lot on "ovirt-srv19.phx.ovirt.org
>> <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org>", and by
>> different projects that uses ansible.
>> Not sure it relates, but I've found (and removed) a stale lago
>> environment in "/dev/shm" that were created by
>> ovirt-system-tests_he-basic-iscsi-suite -master
>> <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-master/>
>> .
>> The stale environment caused the suite to not run in "/dev/shm".
>> The maximum number of semaphore on both  ovirt-srv19.phx.ovirt.org
>> <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> and
>> ovirt-srv23.phx.ovirt.org
>> <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> (which
>> run the ansible suite with success) is 128.
>>
>> On Mon, Mar 19, 2018 at 3:37 PM, Yedidyah Bar David <didi at redhat.com>
>> wrote:
>>
>> > Failed also here:
>> >
>> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_
>> > check-patch-el7-x86_64/4540/
>> >
>> > The patch trigerring this affects many suites, and the job failed
>> > during ansible-suite-master .
>> >
>> > On Mon, Mar 19, 2018 at 3:10 PM, Eyal Edri <eedri at redhat.com> wrote:
>> >
>> >> Gal and Daniel are looking into it, strange its not affecting all
>> >> suites.
>> >>
>> >> On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler
>> >> <dholler at redhat.com> wrote:
>> >>
>> >>> Looks like /dev/shm is run out of space.
>> >>>
>> >>> On Mon, 19 Mar 2018 13:33:28 +0200
>> >>> Leon Goldberg <lgoldber at redhat.com> wrote:
>> >>>
>> >>> > Hey, any updates?
>> >>> >
>> >>> > On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas at redhat.com>
>> >>> > wrote:
>> >>> >
>> >>> > > We are doing nothing special there, just executing ansible
>> >>> > > through their API.
>> >>> > >
>> >>> > > On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky
>> >>> > > <dbelenky at redhat.com> wrote:
>> >>> > >
>> >>> > >> It's not a space issue. Other suites ran on that slave after
>> >>> > >> your suite successfully.
>> >>> > >> I think that the problem is the setting for max semaphores,
>> >>> > >> though I don't know what you're doing to reach that limit.
>> >>> > >>
>> >>> > >> [dbelenky at ovirt-srv18 ~]$ ipcs -ls
>> >>> > >>
>> >>> > >> ------ Semaphore Limits --------
>> >>> > >> max number of arrays = 128
>> >>> > >> max semaphores per array = 250
>> >>> > >> max semaphores system wide = 32000
>> >>> > >> max ops per semop call = 32
>> >>> > >> semaphore max value = 32767
>> >>> > >>
>> >>> > >>
>> >>> > >> On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas
>> >>> > >> <ehaas at redhat.com> wrote:
>> >>> > >>> http://jenkins.ovirt.org/job/ovirt-system-tests_network-suit
>> >>> e-master/
>> >>> > >>>
>> >>> > >>> On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky
>> >>> > >>> <dbelenky at redhat.com> wrote:
>> >>> > >>>
>> >>> > >>>> Hi Edi,
>> >>> > >>>>
>> >>> > >>>> Are there any logs? where you're running the suite? may I
>> >>> > >>>> have a link?
>> >>> > >>>>
>> >>> > >>>> On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas
>> >>> > >>>> <ehaas at redhat.com> wrote:
>> >>> > >>>>> Good morning,
>> >>> > >>>>>
>> >>> > >>>>> We are running in the OST network suite a test module with
>> >>> > >>>>> Ansible and it started failing during the weekend on
>> >>> > >>>>> "OSError: [Errno 28] No space left on device" when
>> >>> > >>>>> attempting to take a lock in the mutiprocessing python
>> >>> > >>>>> module.
>> >>> > >>>>>
>> >>> > >>>>> It smells like a slave resource problem, could someone
>> >>> > >>>>> help investigate this?
>> >>> > >>>>>
>> >>> > >>>>> Thanks,
>> >>> > >>>>> Edy.
>> >>> > >>>>>
>> >>> > >>>>> =================================== FAILURES
>> >>> > >>>>> =================================== ______________________
>> >>> > >>>>> test_ovn_provider_create_scenario _______________________
>> >>> > >>>>>
>> >>> > >>>>> os_client_config = None
>> >>> > >>>>>
>> >>> > >>>>>     def
>> >>> > >>>>> test_ovn_provider_create_scenario(os_client_config):
>> >>> > >>>>> >       _test_ovn_provider('create_scenario.yml')
>> >>> > >>>>>
>> >>> > >>>>> network-suite-master/tests/test_ovn_provider.py:68:
>> >>> > >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>> >>> > >>>>> _ _ _ _ _ _ _ _ _ _ _
>> >>> > >>>>> network-suite-master/tests/test_ovn_provider.py:78: in
>> >>> > >>>>> _test_ovn_provider playbook.run()
>> >>> > >>>>> network-suite-master/lib/ansiblelib.py:127: in run
>> >>> > >>>>> self._run_playbook_executor()
>> >>> > >>>>> network-suite-master/lib/ansiblelib.py:138: in
>> >>> > >>>>> _run_playbook_executor pbex =
>> >>> > >>>>> PlaybookExecutor(**self._pbex_args)
>> >>> /usr/lib/python2.7/site-packages/ansible/executor/playbook_e
>> >>> xecutor.py:60:
>> >>> > >>>>> in __init__ self._tqm =
>> >>> > >>>>> TaskQueueManager(inventory=inventory,
>> >>> > >>>>> variable_manager=variable_manager, loader=loader,
>> >>> > >>>>> options=options,
>> >>> > >>>>> passwords=self.passwords) /usr/lib/python2.7/site-packag
>> >>> es/ansible/executor/task_queue_manager.py:104:
>> >>> > >>>>> in __init__ self._final_q =
>> >>> > >>>>> multiprocessing.Queue() /usr/lib64/python2.7/multiproc
>> >>> essing/__init__.py:218:
>> >>> > >>>>> in Queue return
>> >>> > >>>>> Queue(maxsize) /usr/lib64/python2.7/multiproc
>> >>> essing/queues.py:63:
>> >>> > >>>>> in __init__ self._rlock =
>> >>> > >>>>> Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147:
>> >>> > >>>>> in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _
>> >>> > >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>> >>> > >>>>> _ _ _ _ _ _ _ _
>> >>> > >>>>>
>> >>> > >>>>> self = <Lock(owner=unknown)>, kind = 1, value = 1,
>> >>> > >>>>> maxvalue = 1
>> >>> > >>>>>
>> >>> > >>>>>     def __init__(self, kind, value, maxvalue):
>> >>> > >>>>> >       sl = self._semlock =
>> >>> > >>>>> > _multiprocessing.SemLock(kind, value, maxvalue)
>> >>> > >>>>> E       OSError: [Errno 28] No space left on device
>> >>> > >>>>>
>> >>> > >>>>> /usr/lib64/python2.7/multiprocessing/synchronize.py:75:
>> >>> > >>>>> OSError
>> >>> > >>>>>
>> >>> > >>>>>
>> >>> > >>>>
>> >>> > >>>>
>> >>> > >>>> --
>> >>> > >>>>
>> >>> > >>>> DANIEL BELENKY
>> >>> > >>>>
>> >>> > >>>> RHV DEVOPS
>> >>> > >>>>
>> >>> > >>>
>> >>> > >>>
>> >>> > >>
>> >>> > >>
>> >>> > >> --
>> >>> > >>
>> >>> > >> DANIEL BELENKY
>> >>> > >>
>> >>> > >> RHV DEVOPS
>> >>> > >>
>> >>> > >
>> >>> > >
>> >>>
>> >>> _______________________________________________
>> >>> Infra mailing list
>> >>> Infra at ovirt.org
>> >>> http://lists.ovirt.org/mailman/listinfo/infra
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Eyal edri
>> >>
>> >>
>> >> MANAGER
>> >>
>> >> RHV DevOps
>> >>
>> >> EMEA VIRTUALIZATION R&D
>> >>
>> >>
>> >> Red Hat EMEA <https://www.redhat.com/>
>> >> <https://red.ht/sig> TRIED. TESTED. TRUSTED.
>> >> <https://redhat.com/trusted> phone: +972-9-7692018
>> >> <+972%209-769-2018> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>> >>
>> >> _______________________________________________
>> >> Infra mailing list
>> >> Infra at ovirt.org
>> >> http://lists.ovirt.org/mailman/listinfo/infra
>> >>
>> >>
>> >
>> >
>> > --
>> > Didi
>> >
>>
>>
>>
>



-- 
Didi


More information about the Infra mailing list