On Sun, Jul 8, 2018 at 12:23 PM, Yaniv Kaul <ykaul@redhat.com> wrote:


On Fri, Jul 6, 2018 at 1:01 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-4.2/326

fails on add host test with:
Error: The response content type 'text/html; charset=iso-8859-1' isn't the expected XML

Something bad happened during the deployment because the engine complains about an host not included in the cluster:

2018-07-05 21:34:47,768-04 WARN  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler6) [3009952a] Could not add brick 'lago-hc-basic-suite-4-2-host1:/rhs/brick1/engine' to volume 'c1146520-3bf7-4b81-b31a-7cc5475b6438' - server uuid '50e37ed8-86f3-4b50-9258-f516169025ea' not found in cluster '3125aa60-80bb-11e8-a143-00163e24d363'

In[2] we can see:
2018-07-05 22:03:42,975-0400 ERROR (monitor/f6c4ab4) [storage.Monitor] Error checking domain f6c4ab4a-005d-4ab7-acda-03810014c841 (monitor:424)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 391, in __init__
    validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 104, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 320, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 567, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 451, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 30] Read-only file system 

And just before that:
2018-07-05 22:03:33,214-0400 INFO  (libvirt/events) [virt.vm] (vmId='a2f514e6-81ca-4d41-acf9-77cc910f6eaf') abnormal vm stop device ua-c0592bd6-20e6-4dbf-9610-9a35e3f566ab error eother (vm:5116)
2018-07-05 22:03:33,214-0400 INFO  (libvirt/events) [virt.vm] (vmId='a2f514e6-81ca-4d41-acf9-77cc910f6eaf') CPU stopped: onIOError (vm:6157)
2018-07-05 22:03:33,222-0400 INFO  (libvirt/events) [virt.vm] (vmId='a2f514e6-81ca-4d41-acf9-77cc910f6eaf') CPU stopped: onSuspend (vm:6157)
2018-07-05 22:03:33,225-0400 WARN  (libvirt/events) [virt.vm] (vmId='a2f514e6-81ca-4d41-acf9-77cc910f6eaf') device vda reported I/O error (vm:4065)

And indeed, @[3]:

[2018-07-05 22:04:38,936] WARNING [utils - 298:publish_to_webhook] - Event push failed to URL: http://hc-engine:80/ovirt-engine/services/glusterevents, Event: {"event": "QUORUM_LOST", "message": {"volume": "vmstore"}, "nodeid": "59bf7956-60a4-4152-9cf9-99fcdccb211f", "ts": 1530842614}, Status: ('Connection aborted.', error(113, 'No route to host'))

And we can also see https://bugzilla.redhat.com/show_bug.cgi?id=1595436 there as well.


Sahina, Gobinda, can you please investigate?

Ondra, no idea why the engine is returning text/html instead of xml here, can you please check? 

Because of the exception[1].
Y.

Thanks Yaniv!

The failure to add hosts is because engine was down due to quorum loss.
I see that HC suite has failed in the past due to similar errors, and even in the runs that pass there are quorum loss messages (as glusterd is restarted whenever the host is added). I need to dig into the reason for quorum loss - if it's the parallel addition of hosts causing it, or something else. Will update this thread.