OST Network suite is failing on "OSError: [Errno 28] No space left on device"

older
[JIRA] (OVIRT-1938) [OST] Build...

Edward Haas

18 Mar 2018 18 Mar '18

7:20 a.m.

Good morning, We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module. It smells like a slave resource problem, could someone help investigate this? Thanks, Edy. =================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________ os_client_config = None def test_ovn_provider_create_scenario(os_client_config):

...

_test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1 def __init__(self, kind, value, maxvalue):

...

sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)

E OSError: [Errno 28] No space left on device /usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

Attachments:

attachment.html (text/html — 2.2 KB)

Show replies by date

Daniel Belenky

18 Mar 18 Mar

9:24 a.m.

Hi Edi, Are there any logs? where you're running the suite? may I have a link? On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote:

...

Good morning,

We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module.

It smells like a slave resource problem, could someone help investigate this?

Thanks, Edy.

=================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________

os_client_config = None

def test_ovn_provider_create_scenario(os_client_config):

...
_test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1

def __init__(self, kind, value, maxvalue):

...
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)

E OSError: [Errno 28] No space left on device

/usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

-- DANIEL BELENKY RHV DEVOPS

Edward Haas

9:31 a.m.

http://jenkins.ovirt.org/job/ovirt-system-tests_network-suite-master/ On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...

Hi Edi,

Are there any logs? where you're running the suite? may I have a link?

On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote:

...
Good morning,

We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module.

It smells like a slave resource problem, could someone help investigate this?

Thanks, Edy.

=================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________

os_client_config = None

def test_ovn_provider_create_scenario(os_client_config):

...
_test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1

def __init__(self, kind, value, maxvalue):

...
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)

E OSError: [Errno 28] No space left on device

/usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

--

DANIEL BELENKY

RHV DEVOPS

Daniel Belenky

9:42 a.m.

It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit. [dbelenky@ovirt-srv18 ~]$ ipcs -ls ------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767 On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote:

...

http://jenkins.ovirt.org/job/ovirt-system-tests_network-suite-master/

On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
Hi Edi,

Are there any logs? where you're running the suite? may I have a link?

On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote:

...
Good morning,

We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module.

It smells like a slave resource problem, could someone help investigate this?

Thanks, Edy.

=================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________

os_client_config = None

def test_ovn_provider_create_scenario(os_client_config):

...
_test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1

def __init__(self, kind, value, maxvalue):

...
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)

E OSError: [Errno 28] No space left on device

/usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

--

DANIEL BELENKY

RHV DEVOPS

-- DANIEL BELENKY RHV DEVOPS

Edward Haas

9:44 a.m.

We are doing nothing special there, just executing ansible through their API. On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...

It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit.

[dbelenky@ovirt-srv18 ~]$ ipcs -ls

------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767

On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote:

...
http://jenkins.ovirt.org/job/ovirt-system-tests_network-suite-master/

On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
Hi Edi,

Are there any logs? where you're running the suite? may I have a link?

On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote:

...
Good morning,

We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module.

It smells like a slave resource problem, could someone help investigate this?

Thanks, Edy.

=================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________

os_client_config = None

def test_ovn_provider_create_scenario(os_client_config):

...
_test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1

def __init__(self, kind, value, maxvalue):

...
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)

E OSError: [Errno 28] No space left on device

/usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

--

DANIEL BELENKY

RHV DEVOPS

--

DANIEL BELENKY

RHV DEVOPS

Leon Goldberg

19 Mar 19 Mar

12:33 p.m.

Hey, any updates? On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

...

We are doing nothing special there, just executing ansible through their API.

On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit.

[dbelenky@ovirt-srv18 ~]$ ipcs -ls

------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767

On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote:

...
http://jenkins.ovirt.org/job/ovirt-system-tests_network-suite-master/

On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
Hi Edi,

Are there any logs? where you're running the suite? may I have a link?

On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote:

...
Good morning,

We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module.

It smells like a slave resource problem, could someone help investigate this?

Thanks, Edy.

=================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________

os_client_config = None

def test_ovn_provider_create_scenario(os_client_config):

...
_test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1

def __init__(self, kind, value, maxvalue):

...
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)

E OSError: [Errno 28] No space left on device

/usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

--

DANIEL BELENKY

RHV DEVOPS

--

DANIEL BELENKY

RHV DEVOPS

Dominik Holler

1:11 p.m.

Looks like /dev/shm is run out of space. On Mon, 19 Mar 2018 13:33:28 +0200 Leon Goldberg <lgoldber@redhat.com> wrote:

...

Hey, any updates?

On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

...
We are doing nothing special there, just executing ansible through their API.

On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit.

[dbelenky@ovirt-srv18 ~]$ ipcs -ls

------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767

On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote:

...
http://jenkins.ovirt.org/job/ovirt-system-tests_network-suite-master/

On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
Hi Edi,

Are there any logs? where you're running the suite? may I have a link?

On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote:

...
Good morning,

We are running in the OST network suite a test module with Ansible and it started failing during the weekend on "OSError: [Errno 28] No space left on device" when attempting to take a lock in the mutiprocessing python module.

It smells like a slave resource problem, could someone help investigate this?

Thanks, Edy.

=================================== FAILURES =================================== ______________________ test_ovn_provider_create_scenario _______________________

os_client_config = None

def test_ovn_provider_create_scenario(os_client_config): > _test_ovn_provider('create_scenario.yml')

network-suite-master/tests/test_ovn_provider.py:68: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ network-suite-master/tests/test_ovn_provider.py:78: in _test_ovn_provider playbook.run() network-suite-master/lib/ansiblelib.py:127: in run self._run_playbook_executor() network-suite-master/lib/ansiblelib.py:138: in _run_playbook_executor pbex = PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py:60: in __init__ self._tqm = TaskQueueManager(inventory=inventory, variable_manager=variable_manager, loader=loader, options=options, passwords=self.passwords) /usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py:104: in __init__ self._final_q = multiprocessing.Queue() /usr/lib64/python2.7/multiprocessing/__init__.py:218: in Queue return Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: in __init__ self._rlock = Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1

def __init__(self, kind, value, maxvalue): > sl = self._semlock = _multiprocessing.SemLock(kind, > value, maxvalue) E OSError: [Errno 28] No space left on device

/usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError

--

DANIEL BELENKY

RHV DEVOPS

--

DANIEL BELENKY

RHV DEVOPS

Eyal Edri

2:10 p.m.

Gal and Daniel are looking into it, strange its not affecting all suites. On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler <dholler@redhat.com> wrote:

...

Looks like /dev/shm is run out of space.

On Mon, 19 Mar 2018 13:33:28 +0200 Leon Goldberg <lgoldber@redhat.com> wrote:

...
Hey, any updates?

On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

...
We are doing nothing special there, just executing ansible through their API.

On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit.

[dbelenky@ovirt-srv18 ~]$ ipcs -ls

------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767

...
http://jenkins.ovirt.org/job/ovirt-system-tests_network- suite-master/

On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
Hi Edi,

Are there any logs? where you're running the suite? may I have a link?

On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> wrote: > Good morning, > > We are running in the OST network suite a test module with > Ansible and it started failing during the weekend on "OSError: > [Errno 28] No space left on device" when attempting to take a > lock in the mutiprocessing python module. > > It smells like a slave resource problem, could someone help > investigate this? > > Thanks, > Edy. > > =================================== FAILURES > =================================== ______________________ > test_ovn_provider_create_scenario _______________________ > > os_client_config = None > > def test_ovn_provider_create_scenario(os_client_config): > > _test_ovn_provider('create_scenario.yml') > > network-suite-master/tests/test_ovn_provider.py:68: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ > network-suite-master/tests/test_ovn_provider.py:78: in > _test_ovn_provider playbook.run() > network-suite-master/lib/ansiblelib.py:127: in run > self._run_playbook_executor() > network-suite-master/lib/ansiblelib.py:138: in > _run_playbook_executor pbex = > PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-

...
...
> in __init__ self._tqm = TaskQueueManager(inventory=inventory, > variable_manager=variable_manager, loader=loader, > options=options, > passwords=self.passwords) /usr/lib/python2.7/site-

On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote: packages/ansible/executor/playbook_executor.py:60: packages/ansible/executor/task_queue_manager.py:104:

...
...
> in __init__ self._final_q = > multiprocessing.Queue() /usr/lib64/python2.7/ multiprocessing/__init__.py:218: > in Queue return > Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: > in __init__ self._rlock = > Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: > in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ > > self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1 > > def __init__(self, kind, value, maxvalue): > > sl = self._semlock = _multiprocessing.SemLock(kind, > > value, maxvalue) > E OSError: [Errno 28] No space left on device > > /usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError > >

--

DANIEL BELENKY

RHV DEVOPS

--

DANIEL BELENKY

RHV DEVOPS

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Eyal edri MANAGER RHV DevOps EMEA VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

Yedidyah Bar David

2:37 p.m.

Failed also here: http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... The patch trigerring this affects many suites, and the job failed during ansible-suite-master . On Mon, Mar 19, 2018 at 3:10 PM, Eyal Edri <eedri@redhat.com> wrote:

...

Gal and Daniel are looking into it, strange its not affecting all suites.

On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Looks like /dev/shm is run out of space.

On Mon, 19 Mar 2018 13:33:28 +0200 Leon Goldberg <lgoldber@redhat.com> wrote:

...
Hey, any updates?

On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

...
We are doing nothing special there, just executing ansible through their API.

On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit.

[dbelenky@ovirt-srv18 ~]$ ipcs -ls

------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767

On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote:

...
http://jenkins.ovirt.org/job/ovirt-system-tests_network-suit e-master/

On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

> Hi Edi, > > Are there any logs? where you're running the suite? may I have a > link? > > On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> > wrote: >> Good morning, >> >> We are running in the OST network suite a test module with >> Ansible and it started failing during the weekend on "OSError: >> [Errno 28] No space left on device" when attempting to take a >> lock in the mutiprocessing python module. >> >> It smells like a slave resource problem, could someone help >> investigate this? >> >> Thanks, >> Edy. >> >> =================================== FAILURES >> =================================== ______________________ >> test_ovn_provider_create_scenario _______________________ >> >> os_client_config = None >> >> def test_ovn_provider_create_scenario(os_client_config): >> > _test_ovn_provider('create_scenario.yml') >> >> network-suite-master/tests/test_ovn_provider.py:68: >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >> _ _ _ _ _ _ _ _ >> network-suite-master/tests/test_ovn_provider.py:78: in >> _test_ovn_provider playbook.run() >> network-suite-master/lib/ansiblelib.py:127: in run >> self._run_playbook_executor() >> network-suite-master/lib/ansiblelib.py:138: in >> _run_playbook_executor pbex = >> PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_ executor.py:60: >> in __init__ self._tqm = TaskQueueManager(inventory=inventory, >> variable_manager=variable_manager, loader=loader, >> options=options, >> passwords=self.passwords) /usr/lib/python2.7/site-packag es/ansible/executor/task_queue_manager.py:104: >> in __init__ self._final_q = >> multiprocessing.Queue() /usr/lib64/python2.7/multiproc essing/__init__.py:218: >> in Queue return >> Queue(maxsize) /usr/lib64/python2.7/multiprocessing/queues.py:63: >> in __init__ self._rlock = >> Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: >> in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >> _ _ >> >> self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1 >> >> def __init__(self, kind, value, maxvalue): >> > sl = self._semlock = _multiprocessing.SemLock(kind, >> > value, maxvalue) >> E OSError: [Errno 28] No space left on device >> >> /usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError >> >> > > > -- > > DANIEL BELENKY > > RHV DEVOPS >

--

DANIEL BELENKY

RHV DEVOPS

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 <+972%209-769-2018> irc: eedri (on #tlv #rhev-dev #rhev-integ)

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Didi

Gal Ben Haim

5:44 p.m.

I see that this failure happens a lot on "ovirt-srv19.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org>", and by different projects that uses ansible. Not sure it relates, but I've found (and removed) a stale lago environment in "/dev/shm" that were created by ovirt-system-tests_he-basic-iscsi-suite -master <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-master/> . The stale environment caused the suite to not run in "/dev/shm". The maximum number of semaphore on both ovirt-srv19.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> and ovirt-srv23.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> (which run the ansible suite with success) is 128. On Mon, Mar 19, 2018 at 3:37 PM, Yedidyah Bar David <didi@redhat.com> wrote:

...

Failed also here:

http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/4540/

The patch trigerring this affects many suites, and the job failed during ansible-suite-master .

On Mon, Mar 19, 2018 at 3:10 PM, Eyal Edri <eedri@redhat.com> wrote:

...
Gal and Daniel are looking into it, strange its not affecting all suites.

On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Looks like /dev/shm is run out of space.

On Mon, 19 Mar 2018 13:33:28 +0200 Leon Goldberg <lgoldber@redhat.com> wrote:

...
Hey, any updates?

On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

...
We are doing nothing special there, just executing ansible through their API.

On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

...
It's not a space issue. Other suites ran on that slave after your suite successfully. I think that the problem is the setting for max semaphores, though I don't know what you're doing to reach that limit.

[dbelenky@ovirt-srv18 ~]$ ipcs -ls

------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767

On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas <ehaas@redhat.com> wrote: > http://jenkins.ovirt.org/job/ovirt-system-tests_network-suit e-master/ > > On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky > <dbelenky@redhat.com> wrote: > >> Hi Edi, >> >> Are there any logs? where you're running the suite? may I have a >> link? >> >> On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas <ehaas@redhat.com> >> wrote: >>> Good morning, >>> >>> We are running in the OST network suite a test module with >>> Ansible and it started failing during the weekend on "OSError: >>> [Errno 28] No space left on device" when attempting to take a >>> lock in the mutiprocessing python module. >>> >>> It smells like a slave resource problem, could someone help >>> investigate this? >>> >>> Thanks, >>> Edy. >>> >>> =================================== FAILURES >>> =================================== ______________________ >>> test_ovn_provider_create_scenario _______________________ >>> >>> os_client_config = None >>> >>> def test_ovn_provider_create_scenario(os_client_config): >>> > _test_ovn_provider('create_scenario.yml') >>> >>> network-suite-master/tests/test_ovn_provider.py:68: >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>> _ _ _ _ _ _ _ _ >>> network-suite-master/tests/test_ovn_provider.py:78: in >>> _test_ovn_provider playbook.run() >>> network-suite-master/lib/ansiblelib.py:127: in run >>> self._run_playbook_executor() >>> network-suite-master/lib/ansiblelib.py:138: in >>> _run_playbook_executor pbex = >>> PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_e xecutor.py:60: >>> in __init__ self._tqm = TaskQueueManager(inventory=inventory, >>> variable_manager=variable_manager, loader=loader, >>> options=options, >>> passwords=self.passwords) /usr/lib/python2.7/site-packag es/ansible/executor/task_queue_manager.py:104: >>> in __init__ self._final_q = >>> multiprocessing.Queue() /usr/lib64/python2.7/multiproc essing/__init__.py:218: >>> in Queue return >>> Queue(maxsize) /usr/lib64/python2.7/multiproc essing/queues.py:63: >>> in __init__ self._rlock = >>> Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: >>> in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ _ _ _ >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>> _ _ >>> >>> self = <Lock(owner=unknown)>, kind = 1, value = 1, maxvalue = 1 >>> >>> def __init__(self, kind, value, maxvalue): >>> > sl = self._semlock = _multiprocessing.SemLock(kind, >>> > value, maxvalue) >>> E OSError: [Errno 28] No space left on device >>> >>> /usr/lib64/python2.7/multiprocessing/synchronize.py:75: OSError >>> >>> >> >> >> -- >> >> DANIEL BELENKY >> >> RHV DEVOPS >> > >

--

DANIEL BELENKY

RHV DEVOPS

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 <+972%209-769-2018> irc: eedri (on #tlv #rhev-dev #rhev-integ)

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Didi

-- *GAL bEN HAIM* RHV DEVOPS

Dominik Holler

5:56 p.m.

Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly. From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future. On Mon, 19 Mar 2018 18:44:54 +0200 Gal Ben Haim <gbenhaim@redhat.com> wrote:

...

I see that this failure happens a lot on "ovirt-srv19.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org>", and by different projects that uses ansible. Not sure it relates, but I've found (and removed) a stale lago environment in "/dev/shm" that were created by ovirt-system-tests_he-basic-iscsi-suite -master <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-master/> . The stale environment caused the suite to not run in "/dev/shm". The maximum number of semaphore on both ovirt-srv19.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> and ovirt-srv23.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> (which run the ansible suite with success) is 128.

On Mon, Mar 19, 2018 at 3:37 PM, Yedidyah Bar David <didi@redhat.com> wrote:

...
Failed also here:

http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/4540/

The patch trigerring this affects many suites, and the job failed during ansible-suite-master .

On Mon, Mar 19, 2018 at 3:10 PM, Eyal Edri <eedri@redhat.com> wrote:

...
Gal and Daniel are looking into it, strange its not affecting all suites.

On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Looks like /dev/shm is run out of space.

On Mon, 19 Mar 2018 13:33:28 +0200 Leon Goldberg <lgoldber@redhat.com> wrote:

...
Hey, any updates?

On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

...
We are doing nothing special there, just executing ansible through their API.

On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky <dbelenky@redhat.com> wrote:

> It's not a space issue. Other suites ran on that slave after > your suite successfully. > I think that the problem is the setting for max semaphores, > though I don't know what you're doing to reach that limit. > > [dbelenky@ovirt-srv18 ~]$ ipcs -ls > > ------ Semaphore Limits -------- > max number of arrays = 128 > max semaphores per array = 250 > max semaphores system wide = 32000 > max ops per semop call = 32 > semaphore max value = 32767 > > > On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas > <ehaas@redhat.com> wrote: >> http://jenkins.ovirt.org/job/ovirt-system-tests_network-suit e-master/ >> >> On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky >> <dbelenky@redhat.com> wrote: >> >>> Hi Edi, >>> >>> Are there any logs? where you're running the suite? may I >>> have a link? >>> >>> On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas >>> <ehaas@redhat.com> wrote: >>>> Good morning, >>>> >>>> We are running in the OST network suite a test module with >>>> Ansible and it started failing during the weekend on >>>> "OSError: [Errno 28] No space left on device" when >>>> attempting to take a lock in the mutiprocessing python >>>> module. >>>> >>>> It smells like a slave resource problem, could someone >>>> help investigate this? >>>> >>>> Thanks, >>>> Edy. >>>> >>>> =================================== FAILURES >>>> =================================== ______________________ >>>> test_ovn_provider_create_scenario _______________________ >>>> >>>> os_client_config = None >>>> >>>> def >>>> test_ovn_provider_create_scenario(os_client_config): >>>> > _test_ovn_provider('create_scenario.yml') >>>> >>>> network-suite-master/tests/test_ovn_provider.py:68: >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> _ _ _ _ _ _ _ _ _ _ _ >>>> network-suite-master/tests/test_ovn_provider.py:78: in >>>> _test_ovn_provider playbook.run() >>>> network-suite-master/lib/ansiblelib.py:127: in run >>>> self._run_playbook_executor() >>>> network-suite-master/lib/ansiblelib.py:138: in >>>> _run_playbook_executor pbex = >>>> PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_e xecutor.py:60: >>>> in __init__ self._tqm = >>>> TaskQueueManager(inventory=inventory, >>>> variable_manager=variable_manager, loader=loader, >>>> options=options, >>>> passwords=self.passwords) /usr/lib/python2.7/site-packag es/ansible/executor/task_queue_manager.py:104: >>>> in __init__ self._final_q = >>>> multiprocessing.Queue() /usr/lib64/python2.7/multiproc essing/__init__.py:218: >>>> in Queue return >>>> Queue(maxsize) /usr/lib64/python2.7/multiproc essing/queues.py:63: >>>> in __init__ self._rlock = >>>> Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: >>>> in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> _ _ _ _ _ _ _ _ >>>> >>>> self = <Lock(owner=unknown)>, kind = 1, value = 1, >>>> maxvalue = 1 >>>> >>>> def __init__(self, kind, value, maxvalue): >>>> > sl = self._semlock = >>>> > _multiprocessing.SemLock(kind, value, maxvalue) >>>> E OSError: [Errno 28] No space left on device >>>> >>>> /usr/lib64/python2.7/multiprocessing/synchronize.py:75: >>>> OSError >>>> >>>> >>> >>> >>> -- >>> >>> DANIEL BELENKY >>> >>> RHV DEVOPS >>> >> >> > > > -- > > DANIEL BELENKY > > RHV DEVOPS >

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 <+972%209-769-2018> irc: eedri (on #tlv #rhev-dev #rhev-integ)

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Didi

Yedidyah Bar David

20 Mar 20 Mar

8:17 a.m.

On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:

...

Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly.

From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself. I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load... Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

...

On Mon, 19 Mar 2018 18:44:54 +0200 Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
I see that this failure happens a lot on "ovirt-srv19.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org>", and by different projects that uses ansible. Not sure it relates, but I've found (and removed) a stale lago environment in "/dev/shm" that were created by ovirt-system-tests_he-basic-iscsi-suite -master <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-master/> . The stale environment caused the suite to not run in "/dev/shm". The maximum number of semaphore on both ovirt-srv19.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> and ovirt-srv23.phx.ovirt.org <http://jenkins.ovirt.org/computer/ovirt-srv19.phx.ovirt.org> (which run the ansible suite with success) is 128.

On Mon, Mar 19, 2018 at 3:37 PM, Yedidyah Bar David <didi@redhat.com> wrote:

...
Failed also here:

http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/4540/

The patch trigerring this affects many suites, and the job failed during ansible-suite-master .

On Mon, Mar 19, 2018 at 3:10 PM, Eyal Edri <eedri@redhat.com> wrote:

...
Gal and Daniel are looking into it, strange its not affecting all suites.

On Mon, Mar 19, 2018 at 2:11 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Looks like /dev/shm is run out of space.

On Mon, 19 Mar 2018 13:33:28 +0200 Leon Goldberg <lgoldber@redhat.com> wrote:

...
Hey, any updates?

On Sun, Mar 18, 2018 at 10:44 AM, Edward Haas <ehaas@redhat.com> wrote:

> We are doing nothing special there, just executing ansible > through their API. > > On Sun, Mar 18, 2018 at 10:42 AM, Daniel Belenky > <dbelenky@redhat.com> wrote: > >> It's not a space issue. Other suites ran on that slave after >> your suite successfully. >> I think that the problem is the setting for max semaphores, >> though I don't know what you're doing to reach that limit. >> >> [dbelenky@ovirt-srv18 ~]$ ipcs -ls >> >> ------ Semaphore Limits -------- >> max number of arrays = 128 >> max semaphores per array = 250 >> max semaphores system wide = 32000 >> max ops per semop call = 32 >> semaphore max value = 32767 >> >> >> On Sun, Mar 18, 2018 at 10:31 AM, Edward Haas >> <ehaas@redhat.com> wrote: >>> http://jenkins.ovirt.org/job/ovirt-system-tests_network-suit e-master/ >>> >>> On Sun, Mar 18, 2018 at 10:24 AM, Daniel Belenky >>> <dbelenky@redhat.com> wrote: >>> >>>> Hi Edi, >>>> >>>> Are there any logs? where you're running the suite? may I >>>> have a link? >>>> >>>> On Sun, Mar 18, 2018 at 8:20 AM, Edward Haas >>>> <ehaas@redhat.com> wrote: >>>>> Good morning, >>>>> >>>>> We are running in the OST network suite a test module with >>>>> Ansible and it started failing during the weekend on >>>>> "OSError: [Errno 28] No space left on device" when >>>>> attempting to take a lock in the mutiprocessing python >>>>> module. >>>>> >>>>> It smells like a slave resource problem, could someone >>>>> help investigate this? >>>>> >>>>> Thanks, >>>>> Edy. >>>>> >>>>> =================================== FAILURES >>>>> =================================== ______________________ >>>>> test_ovn_provider_create_scenario _______________________ >>>>> >>>>> os_client_config = None >>>>> >>>>> def >>>>> test_ovn_provider_create_scenario(os_client_config): >>>>> > _test_ovn_provider('create_scenario.yml') >>>>> >>>>> network-suite-master/tests/test_ovn_provider.py:68: >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>>> _ _ _ _ _ _ _ _ _ _ _ >>>>> network-suite-master/tests/test_ovn_provider.py:78: in >>>>> _test_ovn_provider playbook.run() >>>>> network-suite-master/lib/ansiblelib.py:127: in run >>>>> self._run_playbook_executor() >>>>> network-suite-master/lib/ansiblelib.py:138: in >>>>> _run_playbook_executor pbex = >>>>> PlaybookExecutor(**self._pbex_args) /usr/lib/python2.7/site-packages/ansible/executor/playbook_e xecutor.py:60: >>>>> in __init__ self._tqm = >>>>> TaskQueueManager(inventory=inventory, >>>>> variable_manager=variable_manager, loader=loader, >>>>> options=options, >>>>> passwords=self.passwords) /usr/lib/python2.7/site-packag es/ansible/executor/task_queue_manager.py:104: >>>>> in __init__ self._final_q = >>>>> multiprocessing.Queue() /usr/lib64/python2.7/multiproc essing/__init__.py:218: >>>>> in Queue return >>>>> Queue(maxsize) /usr/lib64/python2.7/multiproc essing/queues.py:63: >>>>> in __init__ self._rlock = >>>>> Lock() /usr/lib64/python2.7/multiprocessing/synchronize.py:147: >>>>> in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) _ _ _ >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>>> _ _ _ _ _ _ _ _ >>>>> >>>>> self = <Lock(owner=unknown)>, kind = 1, value = 1, >>>>> maxvalue = 1 >>>>> >>>>> def __init__(self, kind, value, maxvalue): >>>>> > sl = self._semlock = >>>>> > _multiprocessing.SemLock(kind, value, maxvalue) >>>>> E OSError: [Errno 28] No space left on device >>>>> >>>>> /usr/lib64/python2.7/multiprocessing/synchronize.py:75: >>>>> OSError >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> DANIEL BELENKY >>>> >>>> RHV DEVOPS >>>> >>> >>> >> >> >> -- >> >> DANIEL BELENKY >> >> RHV DEVOPS >> > >

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 <+972%209-769-2018> irc: eedri (on #tlv #rhev-dev #rhev-integ)

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Didi

-- Didi

Barak Korren

9:11 a.m.

On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:

...

On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly.

From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself.

I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load...

Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

Sure thing, patches to: [jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh Are welcome, we often find interesting stuff to add there... If constrained for time, please turn this comment into an orderly RFE in Jira... -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Yedidyah Bar David

9:53 a.m.

On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...

On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly.

From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself.

I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load...

Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

Sure thing, patches to:

[jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh

Are welcome, we often find interesting stuff to add there...

If constrained for time, please turn this comment into an orderly RFE in Jira...

Searched for '/dev/shm' and found way too many places to analyze them all and add something to cleanup_slave to cover all. Pushed this for now: https://gerrit.ovirt.org/89215

...

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi

Barak Korren

9:57 a.m.

On 20 March 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:

...

On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly.

From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself.

I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load...

Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

Sure thing, patches to:

[jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh

Are welcome, we often find interesting stuff to add there...

If constrained for time, please turn this comment into an orderly RFE in Jira...

Searched for '/dev/shm' and found way too many places to analyze them all and add something to cleanup_slave to cover all.

Where did you search?

...

Pushed this for now:

https://gerrit.ovirt.org/89215

...
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Yedidyah Bar David

10:03 a.m.

On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren@redhat.com> wrote:

...

On 20 March 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly.

From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself.

I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load...

Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

Sure thing, patches to:

[jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh

Are welcome, we often find interesting stuff to add there...

If constrained for time, please turn this comment into an orderly RFE in Jira...

Searched for '/dev/shm' and found way too many places to analyze them all and add something to cleanup_slave to cover all.

Where did you search?

ovirt-system-tests, lago, lago-ost-plugin. ovirt-system-tests has 83 occurrences. I realize almost all are in lago guests, but looking still takes time... In theory I can patch cleanup_slave.sh as you suggested, removing _everything_ there. Not sure this is safe.

...

...
Pushed this for now:

https://gerrit.ovirt.org/89215

...
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi

Gal Ben Haim

10:24 a.m.

The failure happened again on "ovirt-srv04". The suite wasn't run from "/dev/shm" since it was full of stale lago environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2". The reason for the stale envs is a timeout that was raised by Jenkins (the suites were stuck for 6 hours), so OST's cleanup has not been called. I'm going to add an internal timeout to OST. On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <didi@redhat.com> wrote:

...

On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 20 March 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote:

...
Thanks Gal, I expect the problem is fixed until something eats all space in /dev/shm. But the usage of /dev/shm is logged in the output, so we would be able to detect the problem next time instantly.

From my point of view it would be good to know why /dev/shm was full, to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself.

I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load...

Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

Sure thing, patches to:

[jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh

Are welcome, we often find interesting stuff to add there...

If constrained for time, please turn this comment into an orderly RFE in Jira...

Searched for '/dev/shm' and found way too many places to analyze them all and add something to cleanup_slave to cover all.

Where did you search?

ovirt-system-tests, lago, lago-ost-plugin. ovirt-system-tests has 83 occurrences. I realize almost all are in lago guests, but looking still takes time...

In theory I can patch cleanup_slave.sh as you suggested, removing _everything_ there. Not sure this is safe.

...
...
Pushed this for now:

https://gerrit.ovirt.org/89215

...
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- *GAL bEN HAIM* RHV DEVOPS

Yedidyah Bar David

10:27 a.m.

On Tue, Mar 20, 2018 at 11:24 AM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...

The failure happened again on "ovirt-srv04". The suite wasn't run from "/dev/shm" since it was full of stale lago environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2". The reason for the stale envs is a timeout that was raised by Jenkins (the suites were stuck for 6 hours), so OST's cleanup has not been called. I'm going to add an internal timeout to OST.

On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 20 March 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 20 March 2018 at 09:17, Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <dholler@redhat.com> wrote: > Thanks Gal, I expect the problem is fixed until something eats > all space in /dev/shm. > But the usage of /dev/shm is logged in the output, so we would be > able > to detect the problem next time instantly. > > From my point of view it would be good to know why /dev/shm was > full, > to prevent this situation in future.

Gal already wrote below - it was because some build failed to clean up after itself.

I don't know about this specific case, but I was told that I am personally causing such issues by using the 'cancel' button, so I sadly stopped. Sadly, because our CI system is quite loaded and when I know that some build is useless, I wish to kill it and save some load...

Back to your point, perhaps we should make jobs check /dev/shm when they _start_, and either alert/fail/whatever if it's not almost free, or, if we know what we are doing, just remove stuff there? That might be much easier than fixing things to clean up in end, and/or debugging why this cleaning failed.

Sure thing, patches to:

[jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh

Are welcome, we often find interesting stuff to add there...

If constrained for time, please turn this comment into an orderly RFE in Jira...

Searched for '/dev/shm' and found way too many places to analyze them all and add something to cleanup_slave to cover all.

Where did you search?

ovirt-system-tests, lago, lago-ost-plugin. ovirt-system-tests has 83 occurrences. I realize almost all are in lago guests, but looking still takes time...

In theory I can patch cleanup_slave.sh as you suggested, removing _everything_ there. Not sure this is safe.

Well, pushed now: https://gerrit.ovirt.org/89225

...

...
...
...
Pushed this for now:

https://gerrit.ovirt.org/89215

...
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

-- Didi _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- GAL bEN HAIM RHV DEVOPS

-- Didi

2818

Age (days ago)

2820

Last active (days ago)

List overview

Download

17 comments

8 participants

participants (8)

Barak Korren
Daniel Belenky
Dominik Holler
Edward Haas
Eyal Edri
Gal Ben Haim
Leon Goldberg
Yedidyah Bar David

OST Network suite is failing on "OSError: [Errno 28] No space left on device"

tags

participants (8)