oVirt master experimental system tests now failing on snapshot merge

After we've fixed the various system test issues that arose from the CentOS 7.3 release, we're now seeing a new failure that seems to has to do with snapshot merges. I'm guessing this may have to do with something the went in last week while we "weren't looking". Failing job can be seen here: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/4275 The test code snippet that is failing is as follows: 226 api.vms.get(VM0_NAME).snapshots.list()[-2].delete() 227 testlib.assert_true_within_short( 228 lambda: 229 (len(api.vms.get(VM0_NAME).snapshots.list()) == 2) and 230 (api.vms.get(VM0_NAME).snapshots.list()[-1].snapshot_status 231 == 'ok'), 232 ) The failure itself is a test timeout: ... File "/home/jenkins/workspace/test-repo_ovirt_experimental_master/ovirt-system-tests/basic-suite-master/test-scenarios/004_basic_sanity.py", line 228, in snapshot_merge lambda: File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 248, in assert_true_within_short allowed_exceptions=allowed_exceptions, File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 240, in assert_true_within raise AssertionError('Timed out after %s seconds' % timeout) AssertionError: Timed out after 180 seconds Engine log generated during the test can be found here: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/4275/artifa... Please have a look. Thanks, Barak. -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

On Sun, Dec 18, 2016 at 4:17 PM, Barak Korren <bkorren@redhat.com> wrote:
After we've fixed the various system test issues that arose from the CentOS 7.3 release, we're now seeing a new failure that seems to has to do with snapshot merges.
I'm guessing this may have to do with something the went in last week while we "weren't looking".
Failing job can be seen here: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/4275
The test code snippet that is failing is as follows:
226 api.vms.get(VM0_NAME).snapshots.list()[-2].delete() 227 testlib.assert_true_within_short( 228 lambda: 229 (len(api.vms.get(VM0_NAME).snapshots.list()) == 2) and 230 (api.vms.get(VM0_NAME).snapshots.list()[-1].snapshot_status 231 == 'ok'), 232 )
The failure itself is a test timeout:
... File "/home/jenkins/workspace/test-repo_ovirt_experimental_master/ovirt-system-tests/basic-suite-master/test-scenarios/004_basic_sanity.py", line 228, in snapshot_merge lambda: File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 248, in assert_true_within_short allowed_exceptions=allowed_exceptions, File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 240, in assert_true_within raise AssertionError('Timed out after %s seconds' % timeout) AssertionError: Timed out after 180 seconds
Engine log generated during the test can be found here:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/4275/artifa...
Quickly looking in the engine log does not reveal anything useful. Understanding issues like this require vdsm log from the host performing the operation. Looking in vdsm log: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/4275/artifa... 2016-12-18 07:54:44,325 INFO (jsonrpc/1) [dispatcher] Run and protect: getDeviceList(storageType=3, guids=(), checkStatus=False, options={}) (logUtils:49) 2016-12-18 07:54:44,565 INFO (jsonrpc/1) [dispatcher] Run and protect: getDeviceList, Return response: {'devList': [{'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sdd', 'capacity': '21474836480', 'state': 'active', 'lun': '3'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun3_bdev_b94d3d26-d6e4-4bfc-9d91-4898731b721f', 'GUID': '36001405b94d3d26d6e44bfc9d9148987', 'productID': 'lun3_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sdc', 'capacity': '21474836480', 'state': 'active', 'lun': '2'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun2_bdev_eddd3ce8-2ea8-4326-a9a8-ebc609a768db', 'GUID': '36001405eddd3ce82ea84326a9a8ebc60', 'productID': 'lun2_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sda', 'capacity': '21474836480', 'state': 'active', 'lun': '0'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun0_bdev_2e4947ab-dee9-47b5-b55c-9e28a18d6831', 'GUID': '360014052e4947abdee947b5b55c9e28a', 'productID': 'lun0_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sdb', 'capacity': '21474836480', 'state': 'active', 'lun': '1'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun1_bdev_c581fee8-4da8-42c8-8414-305e003cd57c', 'GUID': '36001405c581fee84da842c88414305e0', 'productID': 'lun1_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sde', 'capacity': '21474836480', 'state': 'active', 'lun': '4'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun4_bdev_18c78046-b159-4d14-a65e-5f9ad3b5b4f2', 'GUID': '3600140518c78046b1594d14a65e5f9ad', 'productID': 'lun4_bdev'}]} (logUtils:52) 2016-12-18 07:54:44,567 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getDeviceList succeeded in 0.24 seconds (__init__:515) 2016-12-18 07:54:45,692 INFO (jsonrpc/2) [dispatcher] Run and protect: getDeviceList(storageType=3, guids=[u'3600140518c78046b1594d14a65e5f9ad', u'360014052e4947abdee947b5b55c9e28a'], checkStatus=False, options={}) (logUtils:49) 2016-12-18 07:54:45,882 WARN (jsonrpc/2) [storage.LVM] lvm pvs failed: 5 [] [' WARNING: Not using lvmetad because config setting use_lvmetad=0.', ' WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).', ' Failed to find physical volume "/dev/mapper/3600140518c78046b1594d14a65e5f9ad".'] (lvm:324) 2016-12-18 07:54:45,883 WARN (jsonrpc/2) [storage.HSM] getPV failed for guid: 3600140518c78046b1594d14a65e5f9ad (hsm:1970) Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 1967, in _getDeviceList pv = lvm.getPV(guid) File "/usr/share/vdsm/storage/lvm.py", line 856, in getPV raise se.InaccessiblePhysDev((pvName,)) InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=(u'3600140518c78046b1594d14a65e5f9ad',)" 2016-12-18 07:54:45,944 WARN (jsonrpc/2) [storage.LVM] lvm pvs failed: 5 [] [' WARNING: Not using lvmetad because config setting use_lvmetad=0.', ' WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).', ' Failed to find physical volume "/dev/mapper/360014052e4947abdee947b5b55c9e28a".'] (lvm:324) 2016-12-18 07:54:45,944 WARN (jsonrpc/2) [storage.HSM] getPV failed for guid: 360014052e4947abdee947b5b55c9e28a (hsm:1970) Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 1967, in _getDeviceList pv = lvm.getPV(guid) File "/usr/share/vdsm/storage/lvm.py", line 856, in getPV raise se.InaccessiblePhysDev((pvName,)) InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=(u'360014052e4947abdee947b5b55c9e28a',)" 2016-12-18 07:54:45,977 INFO (jsonrpc/2) [dispatcher] Run and protect: getDeviceList, Return response: {'devList': [{'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sda', 'capacity': '21474836480', 'state': 'active', 'lun': '0'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun0_bdev_2e4947ab-dee9-47b5-b55c-9e28a18d6831', 'GUID': '360014052e4947abdee947b5b55c9e28a', 'productID': 'lun0_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sde', 'capacity': '21474836480', 'state': 'active', 'lun': '4'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun4_bdev_18c78046-b159-4d14-a65e-5f9ad3b5b4f2', 'GUID': '3600140518c78046b1594d14a65e5f9ad', 'productID': 'lun4_bdev'}]} (logUtils:52) 2016-12-18 07:54:45,978 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getDeviceList succeeded in 0.29 seconds (__init__:515) Looks a pv is not accessible - storage issue? Please check that the pvs are clean before using them for the test. Clearing the luns on the the storage before the test can be a good idea. 2016-12-18 07:54:47,165 INFO (jsonrpc/4) [dispatcher] Run and protect: createVG(vgname=u'097c3032-99da-4e18-b547-19ff340f846e', devlist=[u'3600140518c78046b1594d14a65e5f9ad', u'360014052e4947abdee947b5b55c9e28a'], force=False, options=None) (logUtils:49) Engine is trying to create a vg with the pv 360014052e4947abdee947b5b55c9e28a that fail before in getDeviceList. This looks like a bug in engine, ignoring the results of getDeviceList. 2016-12-18 07:54:47,552 INFO (jsonrpc/4) [dispatcher] Run and protect: createVG, Return response: {'uuid': 'zJyjdF-AvTC-Z6wN-XNRO-1jDd-2SUH-dVQTJW'} (logUtils:52) 2016-12-18 07:54:47,552 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call LVMVolumeGroup.create succeeded in 0.38 seconds (__init__:515) 2016-12-18 07:54:48,562 INFO (jsonrpc/5) [dispatcher] Run and protect: createStorageDomain(storageType=3, sdUUID=u'097c3032-99da-4e18-b547-19ff340f846e', domainName=u'iscsi', typeSpecificArg=u'zJyjdF-AvTC-Z6wN-XNRO-1jDd-2SUH-dVQTJW', domClass=1, domVersion=u'3', options=None) (logUtils:49) 2016-12-18 07:54:48,563 ERROR (jsonrpc/5) [storage.StorageDomainCache] looking for unfetched domain 097c3032-99da-4e18-b547-19ff340f846e (sdc:151) 2016-12-18 07:54:48,563 ERROR (jsonrpc/5) [storage.StorageDomainCache] looking for domain 097c3032-99da-4e18-b547-19ff340f846e (sdc:168) 2016-12-18 07:54:48,564 ERROR (jsonrpc/5) [storage.StorageDomainCache] domain 097c3032-99da-4e18-b547-19ff340f846e not found (sdc:157) Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 155, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 185, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'097c3032-99da-4e18-b547-19ff340f846e',) We a lot of these errors in the rest of the log. This meas something is wrong with this vg. Needs deeper investigation from storage developer on both engine and vdsm side, but I would start by making sure we use clean luns. We are not trying to test esoteric negative flows in the system tests. Did we change something in the system tests project or lago while we were not looking? Can we reproduce this issue manually with same engine and vdsm versions? Nir
Please have a look.
Thanks, Barak.
-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

On Sun, Dec 18, 2016 at 5:26 PM, Nir Soffer <nsoffer@redhat.com> wrote:
After we've fixed the various system test issues that arose from the CentOS 7.3 release, we're now seeing a new failure that seems to has to do with snapshot merges.
I'm guessing this may have to do with something the went in last week while we "weren't looking".
Failing job can be seen here: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/4275
The test code snippet that is failing is as follows:
226 api.vms.get(VM0_NAME).snapshots.list()[-2].delete() 227 testlib.assert_true_within_short( 228 lambda: 229 (len(api.vms.get(VM0_NAME).snapshots.list()) == 2) and 230 (api.vms.get(VM0_NAME).snapshots.list()[-1].snapshot_ status 231 == 'ok'), 232 )
The failure itself is a test timeout:
... File "/home/jenkins/workspace/test-repo_ovirt_experimental_ master/ovirt-system-tests/basic-suite-master/test- scenarios/004_basic_sanity.py", line 228, in snapshot_merge lambda: File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 248, in assert_true_within_short allowed_exceptions=allowed_exceptions, File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 240, in assert_true_within raise AssertionError('Timed out after %s seconds' % timeout) AssertionError: Timed out after 180 seconds
Engine log generated during the test can be found here:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/4275/artifact/exported-artifacts/basic_suite_master. sh-el7/exported-artifacts/test_logs/basic-suite-master/
On Sun, Dec 18, 2016 at 4:17 PM, Barak Korren <bkorren@redhat.com> wrote: post-004_basic_sanity.py/lago-basic-suite-master-engine/_ var_log_ovirt-engine/engine.log
Quickly looking in the engine log does not reveal anything useful.
Understanding issues like this require vdsm log from the host performing the operation.
Looking in vdsm log:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/4275/artifact/exported-artifacts/basic_suite_master. sh-el7/exported-artifacts/test_logs/basic-suite-master/ post-004_basic_sanity.py/lago-basic-suite-master-host1/_var_ log_vdsm/vdsm.log
2016-12-18 07:54:44,325 INFO (jsonrpc/1) [dispatcher] Run and protect: getDeviceList(storageType=3, guids=(), checkStatus=False, options={}) (logUtils:49)
2016-12-18 07:54:44,565 INFO (jsonrpc/1) [dispatcher] Run and protect: getDeviceList, Return response: {'devList': [{'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sdd', 'capacity': '21474836480', 'state': 'active', 'lun': '3'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun3_bdev_b94d3d26-d6e4-4bfc-9d91-4898731b721f', 'GUID': '36001405b94d3d26d6e44bfc9d9148987', 'productID': 'lun3_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sdc', 'capacity': '21474836480', 'state': 'active', 'lun': '2'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun2_bdev_eddd3ce8-2ea8-4326-a9a8-ebc609a768db', 'GUID': '36001405eddd3ce82ea84326a9a8ebc60', 'productID': 'lun2_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sda', 'capacity': '21474836480', 'state': 'active', 'lun': '0'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun0_bdev_2e4947ab-dee9-47b5-b55c-9e28a18d6831', 'GUID': '360014052e4947abdee947b5b55c9e28a', 'productID': 'lun0_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sdb', 'capacity': '21474836480', 'state': 'active', 'lun': '1'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun1_bdev_c581fee8-4da8-42c8-8414-305e003cd57c', 'GUID': '36001405c581fee84da842c88414305e0', 'productID': 'lun1_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sde', 'capacity': '21474836480', 'state': 'active', 'lun': '4'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun4_bdev_18c78046-b159-4d14-a65e-5f9ad3b5b4f2', 'GUID': '3600140518c78046b1594d14a65e5f9ad', 'productID': 'lun4_bdev'}]} (logUtils:52)
2016-12-18 07:54:44,567 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getDeviceList succeeded in 0.24 seconds (__init__:515)
2016-12-18 07:54:45,692 INFO (jsonrpc/2) [dispatcher] Run and protect: getDeviceList(storageType=3, guids=[u'3600140518c78046b1594d14a65e5f9ad', u'360014052e4947abdee947b5b55c9e28a'], checkStatus=False, options={}) (logUtils:49)
2016-12-18 07:54:45,882 WARN (jsonrpc/2) [storage.LVM] lvm pvs failed: 5 [] [' WARNING: Not using lvmetad because config setting use_lvmetad=0.', ' WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).', ' Failed to find physical volume "/dev/mapper/3600140518c78046b1594d14a65e5f9ad".'] (lvm:324) 2016-12-18 07:54:45,883 WARN (jsonrpc/2) [storage.HSM] getPV failed for guid: 3600140518c78046b1594d14a65e5f9ad (hsm:1970) Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 1967, in _getDeviceList pv = lvm.getPV(guid) File "/usr/share/vdsm/storage/lvm.py", line 856, in getPV raise se.InaccessiblePhysDev((pvName,)) InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=(u'3600140518c78046b1594d14a65e5f9ad',)" 2016-12-18 07:54:45,944 WARN (jsonrpc/2) [storage.LVM] lvm pvs failed: 5 [] [' WARNING: Not using lvmetad because config setting use_lvmetad=0.', ' WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).', ' Failed to find physical volume "/dev/mapper/360014052e4947abdee947b5b55c9e28a".'] (lvm:324) 2016-12-18 07:54:45,944 WARN (jsonrpc/2) [storage.HSM] getPV failed for guid: 360014052e4947abdee947b5b55c9e28a (hsm:1970) Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 1967, in _getDeviceList pv = lvm.getPV(guid) File "/usr/share/vdsm/storage/lvm.py", line 856, in getPV raise se.InaccessiblePhysDev((pvName,)) InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=(u'360014052e4947abdee947b5b55c9e28a',)" 2016-12-18 07:54:45,977 INFO (jsonrpc/2) [dispatcher] Run and protect: getDeviceList, Return response: {'devList': [{'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sda', 'capacity': '21474836480', 'state': 'active', 'lun': '0'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun0_bdev_2e4947ab-dee9-47b5-b55c-9e28a18d6831', 'GUID': '360014052e4947abdee947b5b55c9e28a', 'productID': 'lun0_bdev'}, {'status': 'unknown', 'vendorID': 'LIO-ORG', 'capacity': '21474836480', 'fwrev': '4.0', 'discard_zeroes_data': 0, 'vgUUID': '', 'pvsize': '', 'pathlist': [{'initiatorname': 'default', 'connection': '192.168.200.3', 'iqn': 'iqn.2014-07.org.ovirt:storage', 'portal': '1', 'user': 'username', 'password': '********', 'port': '3260'}], 'logicalblocksize': '512', 'discard_max_bytes': 0, 'pathstatus': [{'type': 'iSCSI', 'physdev': 'sde', 'capacity': '21474836480', 'state': 'active', 'lun': '4'}], 'devtype': 'iSCSI', 'physicalblocksize': '512', 'pvUUID': '', 'serial': 'SLIO-ORG_lun4_bdev_18c78046-b159-4d14-a65e-5f9ad3b5b4f2', 'GUID': '3600140518c78046b1594d14a65e5f9ad', 'productID': 'lun4_bdev'}]} (logUtils:52) 2016-12-18 07:54:45,978 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getDeviceList succeeded in 0.29 seconds (__init__:515)
Looks a pv is not accessible - storage issue?
Please check that the pvs are clean before using them for the test. Clearing the luns on the the storage before the test can be a good idea.
They should ALWAYS be clean, as we start from scratch (including creation of the devices on the storage) from scratch in every run.
2016-12-18 07:54:47,165 INFO (jsonrpc/4) [dispatcher] Run and protect: createVG(vgname=u'097c3032-99da-4e18-b547-19ff340f846e', devlist=[u'3600140518c78046b1594d14a65e5f9ad', u'360014052e4947abdee947b5b55c9e28a'], force=False, options=None) (logUtils:49)
Engine is trying to create a vg with the pv 360014052e4947abdee947b5b55c9e28a that fail before in getDeviceList.
This looks like a bug in engine, ignoring the results of getDeviceList.
2016-12-18 07:54:47,552 INFO (jsonrpc/4) [dispatcher] Run and protect: createVG, Return response: {'uuid': 'zJyjdF-AvTC-Z6wN-XNRO-1jDd-2SUH-dVQTJW'} (logUtils:52) 2016-12-18 07:54:47,552 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call LVMVolumeGroup.create succeeded in 0.38 seconds (__init__:515) 2016-12-18 07:54:48,562 INFO (jsonrpc/5) [dispatcher] Run and protect: createStorageDomain(storageType=3, sdUUID=u'097c3032-99da-4e18-b547-19ff340f846e', domainName=u'iscsi', typeSpecificArg=u'zJyjdF-AvTC-Z6wN-XNRO-1jDd-2SUH-dVQTJW', domClass=1, domVersion=u'3', options=None) (logUtils:49) 2016-12-18 07:54:48,563 ERROR (jsonrpc/5) [storage.StorageDomainCache] looking for unfetched domain 097c3032-99da-4e18-b547-19ff340f846e (sdc:151) 2016-12-18 07:54:48,563 ERROR (jsonrpc/5) [storage.StorageDomainCache] looking for domain 097c3032-99da-4e18-b547-19ff340f846e (sdc:168) 2016-12-18 07:54:48,564 ERROR (jsonrpc/5) [storage.StorageDomainCache] domain 097c3032-99da-4e18-b547-19ff340f846e not found (sdc:157) Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 155, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 185, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'097c3032-99da-4e18-b547-19ff340f846e',)
We a lot of these errors in the rest of the log. This meas something is wrong with this vg.
Needs deeper investigation from storage developer on both engine and vdsm side, but I would start by making sure we use clean luns. We are not trying to test esoteric negative flows in the system tests.
Did we change something in the system tests project or lago while we were not looking?
CentOS 7.3... ? Y.
Can we reproduce this issue manually with same engine and vdsm versions?
Nir
Please have a look.
Thanks, Barak.
-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

On 18 December 2016 at 17:26, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Dec 18, 2016 at 4:17 PM, Barak Korren <bkorren@redhat.com> wrote:
We a lot of these errors in the rest of the log. This meas something is wrong with this vg.
Needs deeper investigation from storage developer on both engine and vdsm side, but I would start by making sure we use clean luns. We are not trying to test esoteric negative flows in the system tests.
Here is the storage setup script: https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=blob;f=common/dep... All storage used in the system tests comes from the engine VM itself, and is placed on a newly allocated QCOW2 file (exposed as /dev/sde to the engine VM), so its unlikely the LUNs are not clean.
Did we change something in the system tests project or lago while we were not looking?
Not likely as well: https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=shortlog ovirt-system-tests project has got its own CI, testing against the last nigthly (we will move it to last build that passed the tests soon). So we are unlikely to merge breaking code there. Then again we're not gating the OS packages so some breakage may have gone in via CentOS repos...
Can we reproduce this issue manually with same engine and vdsm versions?
You have several options: 1: Get engine+vdsm builds from Jenkins: http://jenkins.ovirt.org/job/ovirt-engine_master_build-artifacts-fc24-x86_64... http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/ (Getting the exact builds that went into a given OST run takes tracing back the job invocation links from that run) 2: Use the latest experimental repo: http://resources.ovirt.org/repos/ovirt/experimental/master/latest/rpm/el7/ 3: Run lago and OST locally: (as documented here: http://ovirt-system-tests.readthedocs.io/en/latest/ you'd need to pass in the vdsm and engine packages to use) -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

On Sun, Dec 18, 2016 at 6:08 PM, Barak Korren <bkorren@redhat.com> wrote:
On 18 December 2016 at 17:26, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Dec 18, 2016 at 4:17 PM, Barak Korren <bkorren@redhat.com> wrote:
We a lot of these errors in the rest of the log. This meas something is wrong with this vg.
Needs deeper investigation from storage developer on both engine and vdsm side, but I would start by making sure we use clean luns. We are not trying to test esoteric negative flows in the system tests.
Here is the storage setup script: https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=blob;f=common/dep...
25 iscsiadm -m discovery -t sendtargets -p 127.0.0.1 26 iscsiadm -m node -L all This is alerting. Before we serve these luns, we should log out from these nodes, and remove the nodes.
All storage used in the system tests comes from the engine VM itself, and is placed on a newly allocated QCOW2 file (exposed as /dev/sde to the engine VM), so its unlikely the LUNs are not clean.
We did not change code related to getDeviceList lately, these getPV errors tell us that there is an issue in a lower level component or the storage server. Does this test pass with older version of vdsm? engine?
Did we change something in the system tests project or lago while we were not looking?
Not likely as well: https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=shortlog
ovirt-system-tests project has got its own CI, testing against the last nigthly (we will move it to last build that passed the tests soon). So we are unlikely to merge breaking code there.
It depends on the tests. Do you have test logging in to the target and creating a vg using the luns?
Then again we're not gating the OS packages so some breakage may have gone in via CentOS repos...
These failures are with centos 7.2 or 7.3? both?
Can we reproduce this issue manually with same engine and vdsm versions?
You have several options: 1: Get engine+vdsm builds from Jenkins: http://jenkins.ovirt.org/job/ovirt-engine_master_build-artifacts-fc24-x86_64... http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/ (Getting the exact builds that went into a given OST run takes tracing back the job invocation links from that run)
2: Use the latest experimental repo: http://resources.ovirt.org/repos/ovirt/experimental/master/latest/rpm/el7/
3: Run lago and OST locally: (as documented here: http://ovirt-system-tests.readthedocs.io/en/latest/ you'd need to pass in the vdsm and engine packages to use)
Do you know how to setup the system so it run all the setup code up to the code that cause the getPV errors? We need to inspect the system at this point. Nir

On Sun, Dec 18, 2016 at 7:17 PM, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Dec 18, 2016 at 6:08 PM, Barak Korren <bkorren@redhat.com> wrote:
On 18 December 2016 at 17:26, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Dec 18, 2016 at 4:17 PM, Barak Korren <bkorren@redhat.com> wrote:
We a lot of these errors in the rest of the log. This meas something is wrong with this vg.
Needs deeper investigation from storage developer on both engine and vdsm side, but I would start by making sure we use clean luns. We are not trying to test esoteric negative flows in the system tests.
Here is the storage setup script: https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests. git;a=blob;f=common/deploy-scripts/setup_storage_unified_ he_extra_iscsi_el7.sh;hb=refs/heads/master
25 iscsiadm -m discovery -t sendtargets -p 127.0.0.1 26 iscsiadm -m node -L all
This is alerting. Before we serve these luns, we should log out from these nodes, and remove the nodes.
This is show a non-up-to-date (or I have to update it) code. In an updated code, where it also happens, we do the following as well: iscsiadm -m node -U all iscsiadm -m node -o delete systemctl stop iscsi.service systemctl disable iscsi.service
All storage used in the system tests comes from the engine VM itself, and is placed on a newly allocated QCOW2 file (exposed as /dev/sde to the engine VM), so its unlikely the LUNs are not clean.
We did not change code related to getDeviceList lately, these getPV errors tell us that there is an issue in a lower level component or the storage server.
Does this test pass with older version of vdsm? engine?
We did not test that. It's not very easy to do it in ovirt-system-tests, though I reckon it is possible with some additional work. Note that I suspect cold and live merge were not actually tested for ages / ever in ovirt-system-tests.
Did we change something in the system tests project or lago while we were not looking?
Mainly CentOS 7.2 -> CentOS 7.3 change.
Not likely as well: https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=shortlog
ovirt-system-tests project has got its own CI, testing against the last nigthly (we will move it to last build that passed the tests soon). So we are unlikely to merge breaking code there.
It depends on the tests.
Do you have test logging in to the target and creating a vg using the luns?
Then again we're not gating the OS packages so some breakage may have gone in via CentOS repos...
These failures are with centos 7.2 or 7.3? both?
Unsure.
Can we reproduce this issue manually with same engine and vdsm versions?
You have several options: 1: Get engine+vdsm builds from Jenkins: http://jenkins.ovirt.org/job/ovirt-engine_master_build- artifacts-fc24-x86_64/ http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/ (Getting the exact builds that went into a given OST run takes tracing back the job invocation links from that run)
2: Use the latest experimental repo: http://resources.ovirt.org/repos/ovirt/experimental/ master/latest/rpm/el7/
3: Run lago and OST locally: (as documented here: http://ovirt-system-tests.readthedocs.io/en/latest/ you'd need to pass in the vdsm and engine packages to use)
That's what I do, on a daily basis.
Do you know how to setup the system so it run all the setup code up to the code that cause the getPV errors?
Yes, that should be fairly easy to do.
We need to inspect the system at this point.
Let me know and I'll set up a live system quickly tomorrow. Y.
Nir
participants (3)
-
Barak Korren
-
Nir Soffer
-
Yaniv Kaul