[ovirt-devel] Re: ovirt-system-tests_he-basic-iscsi-suite-master fails due to timeout waiting for StoragePool to go up after vdsmd restart

16 Jul 2018

      Hi,

I reran the test on: https://gerrit.ovirt.org/#/c/92734/

3 tests passed and two failed on:

004_basic_sanity.vm_run

The error is:

detail: Cannot run VM. There is no host that satisfies current
scheduling constraints. See below for details:, The host
lago-he-basic-ansible-suite-master-host-0 did not satisfy internal
filter Memory because its available memory is too low (656 MB) to run
the VM.

It looks like there is not enough memory on the test environment and that
may be causing intermittent errors on the other tests as well.

Is it possible to increase the memory for the CI environment so that we can
ensure the errors are not due to lack of resources?

With Best Regards.

Steven Rosenberg.

On Mon, Jul 16, 2018 at 12:03 PM Andrej Krejcir <akrejcir@redhat.com> wrote:
...
Hi,
The jenkins job is no longer available, I will run the test manually and
see why it fails.
Maybe this patch will fix the problem with the test:
https://gerrit.ovirt.org/#/c/92734/
It ensures that all services are stopped before starting them again.
The purpose of the test is to try restarting the HE VM.
When the HE VM is run for the first time, the engine generates an OVF file
describing the HE VM. Then when the VM is started again, the OVF is used.
We wanted to test that the generated OVF is correct and the VM can be
started.  The test intended to cleanly shutdown and restart the vdsm.
Andrej
On Fri, 6 Jul 2018 at 11:38, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
...
Failing job is:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-iscsi-suite-master...
My findings:
2018-07-06 03:28:05,461::008_restart_he_vm.py::_shutdown_he_vm::123::root::INFO::    * VM is down.
2018-07-06 03:28:05,461::008_restart_he_vm.py::_restart_services::127::root::INFO::    * Restarting services...
2018-07-06 03:28:05,729::log_utils.py::__exit__::611::lago.ssh::DEBUG::end task:f7db7960-b541-4d18-9a73-45b3d0677f03:Get ssh client for lago-he-basic-iscsi-suite-master-host-0:
2018-07-06 03:28:06,031::ssh.py::ssh::58::lago.ssh::DEBUG::Running 98d0de9e on lago-he-basic-iscsi-suite-master-host-0: systemctl restart vdsmd ovirt-ha-broker ovirt-ha-agent
2018-07-06 03:28:22,887::ssh.py::ssh::81::lago.ssh::DEBUG::Command 98d0de9e on lago-he-basic-iscsi-suite-master-host-0 returned with 0
it then waits for engine to be up again till it gives up 10 minutes later:
2018-07-06 03:38:20,979::log_utils.py::__enter__::600::lago.ssh::DEBUG::start task:f44bb656-dfe3-4f57-a6c8-e7e208863054:Get ssh client for lago-he-basic-iscsi-suite-master-host-0:
2018-07-06 03:38:21,290::log_utils.py::__exit__::611::lago.ssh::DEBUG::end task:f44bb656-dfe3-4f57-a6c8-e7e208863054:Get ssh client for lago-he-basic-iscsi-suite-master-host-0:
2018-07-06 03:38:21,591::ssh.py::ssh::58::lago.ssh::DEBUG::Running 07b7f08a on lago-he-basic-iscsi-suite-master-host-0: hosted-engine --vm-status
2018-07-06 03:38:21,637::ssh.py::ssh::81::lago.ssh::DEBUG::Command 07b7f08a on lago-he-basic-iscsi-suite-master-host-0 returned with 1
2018-07-06 03:38:21,637::ssh.py::ssh::89::lago.ssh::DEBUG::Command 07b7f08a on lago-he-basic-iscsi-suite-master-host-0 output:
 The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.
On host 0, vdsm.log shows it's restarting:
2018-07-05 23:28:19,646-0400 INFO  (MainThread) [vds] Exiting (vdsmd:171)
2018-07-05 23:28:23,514-0400 INFO  (MainThread) [vds] (PID: 19942) I am the actual vdsm 4.30.0-465.git1ad18aa.el7 lago-he-basic-iscsi-suite-master-host-0 (3.10.0-862.2.3.el7.x86_64) (vdsmd:149)
vdsm then stays for the 10 minutes above mentioned waiting for the
storage pool to go up:
2018-07-05 23:38:20,739-0400 INFO  (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:704)
while the ha agent try to get the hosted engine stats
2018-07-05 23:38:30,363-0400 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getStats', 'id': u'cddde340-37a8-4f72-a471-b5bc40c06a16'} at 0x7f262814ae90> timeout=60, duration=600.00 at 0x7f262815c050> task#=0 at 0x7f262834b810>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 261, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 304, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 344, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 202, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getStats
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 49, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1409, in getStats
  multipath=True)}
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 78, in get_stats
  ret['haStats'] = _getHaInfo()
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 176, in _getHaInfo
  stats = instance.get_all_stats()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats
  stats = broker.get_stats_from_storage()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage
  result = self._proxy.get_stats()
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
  return self.__send(self.__name, args)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
  verbose=self.__verbose
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
  return self.single_request(host, handler, request_body, verbose)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request
  response = h.getresponse(buffering=True)
File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
  response.begin()
File: "/usr/lib64/python2.7/httplib.py", line 444, in begin
  version, status, reason = self._read_status()
File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
  line = self.fp.readline(_MAXLINE + 1)
File: "/usr/lib64/python2.7/socket.py", line 476, in readline
  data = self._sock.recv(self._rbufsize) (executor:363)
Andrej, I see the test has been added by you in
commit e8d32f7375f2033b73544f47c1e1ca67abe8d35a
I'm not sure about the purpose of this test but I don't understand why we
are restarting the services on the host.
Nir, Tal, any idea on why the storage pool is not getting up?
I see vdsm is in recovery mode, I'm not sure if this was what the test was supposed to do or if the intention was to cleanly shutdown vdsm and cleanly restart it.
--
SANDRO BONAZZOLA
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <https://www.redhat.com/>
sbonazzo@redhat.com
<https://red.ht/sig>