On Mon, Nov 20, 2017 at 3:10 PM, Dafna Ron <dron@redhat.com> wrote:

Hi,

We had a failure in OST for test 002_bootstrap.verify_add_all_hosts.

From the logs I can see that vdsm on host0 was reporting that it cannot find the physical volume but eventually the storage was created and is reported as responsive.

However, Host1 is reported to became non-operational with storage domain does not exist error and I think that there is a race.


I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1514906 on this. 
 

I think that we create the storage domain while host1 is being installed and if the domain is not created and reported as activated in time, host1 will become nonOperational.


And based on the above description, this is exactly the issue I've described in the BZ.
Y.
 

are we starting installation of host1 before host0 and storage are active?

Link to suspected patches: I do not think that the patch reported is related to the error

https://gerrit.ovirt.org/#/c/84133/


Link to Job:

http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3902/


Link to all logs:


http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3902/artifact/


(Relevant) error snippet from the log:

<error>


Lago log:

2017-11-18 11:15:25,472::log_utils.py::end_log_task::670::nose::INFO::  # add_master_storage_domain: ESC[32mSuccessESC[0m (in 0:01:09)
2017-11-18 11:15:25,472::log_utils.py::start_log_task::655::nose::INFO::  # add_secondary_storage_domains: ESC[0mESC[0m
2017-11-18 11:16:47,455::log_utils.py::end_log_task::670::nose::INFO::  # add_secondary_storage_domains: ESC[32mSuccessESC[0m (in 0:01:21)
2017-11-18 11:16:47,456::log_utils.py::start_log_task::655::nose::INFO::  # import_templates: ESC[0mESC[0m
2017-11-18 11:16:47,513::testlib.py::stopTest::198::nose::INFO::    * SKIPPED: Exported domain generation not supported yet
2017-11-18 11:16:47,514::log_utils.py::end_log_task::670::nose::INFO::  # import_templates: ESC[32mSuccessESC[0m (in 0:00:00)
2017-11-18 11:16:47,514::log_utils.py::start_log_task::655::nose::INFO::  # verify_add_all_hosts: ESC[0mESC[0m
2017-11-18 11:16:47,719::testlib.py::assert_equals_within::227::ovirtlago.testlib::ERROR::    * Unhandled exception in <function <lambda> at 0x2909230>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 219, in assert_equals_within
    res = func()
  File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/002_bootstrap.py", line 430, in <lambda>
    lambda: _all_hosts_up(hosts_service, total_hosts)
  File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/002_bootstrap.py", line 129, in _all_hosts_up
    _check_problematic_hosts(hosts_service)
  File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/002_bootstrap.py", line 149, in _check_problematic_hosts
    raise RuntimeError(dump_hosts)
RuntimeError: 1 hosts failed installation:
lago-basic-suite-master-host-1: non_operational

2017-11-18 11:16:47,722::utils.py::wrapper::480::lago.utils::DEBUG::Looking for a workdir
2017-11-18 11:16:47,722::workdir.py::resolve_workdir_path::361::lago.workdir::DEBUG::Checking if /dev/shm/ost/deployment-basic-suite-master is a workdir
2017-11-18 11:16:47,724::log_utils.py::__enter__::600::lago.prefix::INFO::    * Collect artifacts: ESC[0mESC[0m
2017-11-18 11:16:47,724::log_utils.py::__enter__::600::lago.prefix::INFO::    * Collect artifacts: ESC[0mESC[0m

vdsm host0:

2017-11-18 06:14:23,980-0500 INFO  (jsonrpc/0) [vdsm.api] START getDeviceList(storageType=3, guids=[u'360014059618895272774e97a2aaf5dd6'], checkStatus=False, options={}) from=::ffff:192.168.201.4,45636, flow_id=ed8310a1-a7af-4a67-b351-8ff
364766b8a, task_id=6ced0092-34cd-49f0-aa0f-6aae498af37f (api:46)
2017-11-18 06:14:24,353-0500 WARN  (jsonrpc/0) [storage.LVM] lvm pvs failed: 5 [] ['  Failed to find physical volume "/dev/mapper/360014059618895272774e97a2aaf5dd6".'] (lvm:322)
2017-11-18 06:14:24,353-0500 WARN  (jsonrpc/0) [storage.HSM] getPV failed for guid: 360014059618895272774e97a2aaf5dd6 (hsm:1973)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1970, in _getDeviceList
    pv = lvm.getPV(guid)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 852, in getPV
    raise se.InaccessiblePhysDev((pvName,))
InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=(u'360014059618895272774e97a2aaf5dd6',)"
2017-11-18 06:14:24,389-0500 INFO  (jsonrpc/0) [vdsm.api] FINISH getDeviceList return={'devList': [{'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': u'default', 'connection': u'192.168.200.4', 'iqn': u'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': u'username', 'password': '********', 'port': '3260'}, {'initiatorname': u'default', 'connection': u'192.168.201.4', 'iqn': u'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': u'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 1073741824, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sda', 'capacity': '21474836480', 'state': 'active', 'lun': '0'}, {'type': 'iSCSI', 'physdev': 'sdf', 'capacity': '21474836480', 'state': 'active', 'lun': '0'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun0_bdev_96188952-7277-4e97-a2aa-f5dd6aad6fc2', 'GUID': '360014059618895272774e97a2aaf5dd6', 'productID': 'lun0_bdev'}]} from=::ffff:192.168.201.4,45636, flow_id=ed8310a1-a7af-4a67-b351-8ff364766b8a, task_id=6ced0092-34cd-49f0-aa0f-6aae498af37f (api:52)


2017-11-18 06:14:31,788-0500 INFO  (jsonrpc/0) [vdsm.api] FINISH getStorageDomainInfo return={'info': {'uuid': 'cc61e074-a3b6-4371-9185-66079a39f123', 'vgMetadataDevice': '360014059618895272774e97a2aaf5dd6', 'vguuid': '7ifbmt-0elj-uWZZ-zS
LG-plA8-8hd3-JG298b', 'metadataDevice': '360014059618895272774e97a2aaf5dd6', 'state': 'OK', 'version': '4', 'role': 'Regular', 'type': 'ISCSI', 'class': 'Data', 'pool': [], 'name': 'iscsi'}} from=::ffff:192.168.201.4,45636, flow_id=2c1876
99, task_id=c2080b61-d4a5-4bdb-9d75-f81580a8257a (api:

vdsm host1:

2017-11-18 06:16:34,315-0500 ERROR (monitor/c65437c) [storage.Monitor] Setting up monitor for c65437ce-339f-4b01-aeb5-45c1d486bf49 failed (monitor:329)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 326, in _setupLoop
    self._setupMonitor()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 348, in _setupMonitor
    self._produceDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 177, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 366, in _produceDomain
    self.domain = sdCache.produce(self.sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 176, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'c65437ce-339f-4b01-aeb5-45c1d486bf49',)
2017-11-18 06:16:40,377-0500 INFO  (jsonrpc/7) [api.host] START getStats() from=::ffff:192.168.201.4,58722 (api:46)
2017-11-18 06:16:40,378-0500 INFO  (jsonrpc/7) [vdsm.api] START repoStats(domains=()) from=::ffff:192.168.201.4,58722, task_id=8fb74944-08c0-491e-ad55-a7a9f0a11ef8 (api:46)
2017-11-18 06:16:40,379-0500 INFO  (jsonrpc/7) [vdsm.api] FINISH repoStats return={u'c65437ce-339f-4b01-aeb5-45c1d486bf49': {'code': 358, 'actual': True, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '6.1', 'valid': False},
 u'cc61e074-a3b6-4371-9185-66079a39f123': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.00103987', 'lastCheck': '6.5', 'valid': True}} from=::ffff:192.168.201.4,58722, task_id=8fb74944-08c0-491e-ad55-a7a9f0a11ef8
 (api:52)

engine log:

2017-11-18 06:15:54,040-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-29) [4ce8aff3] Domain 'c65437ce-339f-4b01-aeb5-45c1d486bf49:nfs' was reported with error code '358'
2017-11-18 06:15:54,041-05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-29) [4ce8aff3] Storage Domain 'nfs' of pool 'test-dc' is in problem in host 'lago-basic-suite-master-host-1'
2017-11-18 06:15:54,045-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-29) [4ce8aff3] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host lago-basic-suite-master-host-1
 reports about one of the Active Storage Domains as Problematic.


</error>




_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel