[ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

older
Name resolution not working after...

Barak Korren

4 Apr 2018 4 Apr '18

3:59 p.m.

Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ] Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't. Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/ Link to all logs: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp... Error snippet from log: <error> Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py", line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400. </error> -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Show replies by date

Dan Kenigsberg

4 Apr 4 Apr

4:27 p.m.

On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com> wrote:

...

Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]

Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain

This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp... but kernel messages stop at 08:33:23 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp... Does the lago VM of this host crash? pause?

...

Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/

Link to all logs: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp...

Error snippet from log:

<error>

Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py", line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400.

</error>

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Gal Ben Haim

6:01 p.m.

...

From lago's log, I see that lago collected the logs from the VMs using ssh (after the test failed), which means that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...

On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com> wrote:

...
Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]

Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain

This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/ 1537/artifact/exported-artifacts/basic-suit-4.2-el7/ test_logs/basic-suite-4.2/post-006_migrations.py/lago- basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log but kernel messages stop at 08:33:23 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/ 1537/artifact/exported-artifacts/basic-suit-4.2-el7/ test_logs/basic-suite-4.2/post-006_migrations.py/lago- basic-suite-4-2-host-0/_var_log/messages/*view*/

Does the lago VM of this host crash? pause?

...
Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/

Link to all logs: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/

1537/artifact/exported-artifacts/basic-suit-4.2-el7/ test_logs/basic-suite-4.2/post-006_migrations.py/

...
Error snippet from log:

<error>

Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in

runTest

...
self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/

ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",

...
line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400.

</error>

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

Dafna Ron

6 Apr 6 Apr

1:20 p.m.

Dan, was there a fix for the issues? can I have a link to the fix if there was? Thanks, Dafna On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...

From lago's log, I see that lago collected the logs from the VMs using ssh (after the test failed), which means that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com> wrote:

...
Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]

Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain

This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_ logs/basic-suite-4.2/post-006_migrations.py/lago-basic- suite-4-2-host-0/_var_log/vdsm/supervdsm.log but kernel messages stop at 08:33:23 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_ logs/basic-suite-4.2/post-006_migrations.py/lago-basic- suite-4-2-host-0/_var_log/messages/*view*/

Does the lago VM of this host crash? pause?

...
Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/

Link to all logs: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1

537/artifact/exported-artifacts/basic-suit-4.2-el7/test_ logs/basic-suite-4.2/post-006_migrations.py/

...
Error snippet from log:

<error>

Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in

runTest

...
self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt

-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",

...
line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt -system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400.

</error>

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Dan Kenigsberg

11:30 p.m.

No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week. Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)? On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron <dron@redhat.com> wrote:

...

Dan, was there a fix for the issues? can I have a link to the fix if there was?

Thanks, Dafna

On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
From lago's log, I see that lago collected the logs from the VMs using ssh (after the test failed), which means that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com> wrote:

...
Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]

Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain

This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp... but kernel messages stop at 08:33:23

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp...

Does the lago VM of this host crash? pause?

...
Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/

Link to all logs:

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp...

Error snippet from log:

<error>

Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py", line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400.

</error>

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- GAL bEN HAIM RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Barak Korren

8 Apr 8 Apr

7:59 a.m.

On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...

No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

at this point I think we should seriously consider disabling the relevant test, as its impacting a large number of changes.

...

On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron <dron@redhat.com> wrote:

...
Dan, was there a fix for the issues? can I have a link to the fix if there was?

Thanks, Dafna

On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
From lago's log, I see that lago collected the logs from the VMs using ssh (after the test failed), which means that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com> wrote:

...
Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]

Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain

This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp... but kernel messages stop at 08:33:23

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp...

Does the lago VM of this host crash? pause?

...
Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/

Link to all logs:

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exp...

Error snippet from log:

<error>

Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py", line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400.

</error>

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- GAL bEN HAIM RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Eyal Edri

8:15 a.m.

Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing? On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...

On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...
No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

at this point I think we should seriously consider disabling the relevant test, as its impacting a large number of changes.

...
...
Dan, was there a fix for the issues? can I have a link to the fix if there was?

Thanks, Dafna

On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
From lago's log, I see that lago collected the logs from the VMs using

ssh

...
(after the test failed), which means that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com>

wrote:

...
...
Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]

Link to suspected patches: (Probably unrelated) https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: export template to an export domain

This seems to happen multiple times sporadically, I thought this would be solved by https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/ 1537/artifact/exported-artifacts/basic-suit-4.2-el7/ test_logs/basic-suite-4.2/post-006_migrations.py/lago- basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log but kernel messages stop at 08:33:23

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/ 1537/artifact/exported-artifacts/basic-suit-4.2-el7/ test_logs/basic-suite-4.2/post-006_migrations.py/lago- basic-suite-4-2-host-0/_var_log/messages/*view*/

Does the lago VM of this host crash? pause?

...
Link to Job: http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/

Link to all logs:

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/

1537/artifact/exported-artifacts/basic-suit-4.2-el7/ test_logs/basic-suite-4.2/post-006_migrations.py/

...
Error snippet from log:

<error>

Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197,

in

...
runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",

...
...
...
...
129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",

...
...
...
...
59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",

...
...
...
...
78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py", line 139, in prepare_migration_attachments_ipv6 engine, host_service, MIGRATION_NETWORK, ip_configuration) File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", line 71, in modify_ip_config check_connectivity=True) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 36729, in setup_networks return self._internal_action(action, 'setupnetworks', None, headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
...
299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
...
55, in wait return self._code(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
...
296, in callback self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
...
132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron <dron@redhat.com> wrote: line line line line line line line line

...
...
...
...
118, in _raise_error raise error Error: Fault reason is "Operation Failed". Fault detail is "[Network error during communication with the Host.]". HTTP response code is 400.

</error>

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- GAL bEN HAIM RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- Eyal edri MANAGER RHV DevOps EMEA VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

Edward Haas

8:21 a.m.

On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...

Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...
No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem. Has the RPC connection between Engine and VDSM failed?

...

...
at this point I think we should seriously consider disabling the relevant test, as its impacting a large number of changes.

...
...
Dan, was there a fix for the issues? can I have a link to the fix if there was?

Thanks, Dafna

On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
From lago's log, I see that lago collected the logs from the VMs

using ssh

...
(after the test failed), which means that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren <bkorren@redhat.com>

wrote:

...
> Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ] > > Link to suspected patches: > (Probably unrelated) > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples: > export template to an export domain > > This seems to happen multiple times sporadically, I thought this would > be solved by > https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks). here, however, the host dies while setting setupNetworks of an ipv6 address. Setup network waits for Engine's confirmation at 08:33:00,711

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_ logs/basic-suite-4.2/post-006_migrations.py/lago-basic- suite-4-2-host-0/_var_log/vdsm/supervdsm.log but kernel messages stop at 08:33:23

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_ logs/basic-suite-4.2/post-006_migrations.py/lago-basic- suite-4-2-host-0/_var_log/messages/*view*/

Does the lago VM of this host crash? pause?

> > Link to Job: > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/ > > Link to all logs: > > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_ logs/basic-suite-4.2/post-006_migrations.py/ > > Error snippet from log: > > <error> > > Traceback (most recent call last): > File "/usr/lib64/python2.7/unittest/case.py", line 369, in run > testMethod() > File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in > runTest > self.test(*self.arg) > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",

...
...
...
> 129, in wrapped_test > test() > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",

...
...
...
> 59, in wrapper > return func(get_test_prefix(), *args, **kwargs) > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",

...
...
...
> 78, in wrapper > prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs > File > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt -system-tests/basic-suite-4.2/test-scenarios/006_migrations.py", > line 139, in prepare_migration_attachments_ipv6 > engine, host_service, MIGRATION_NETWORK, ip_configuration) > File > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt -system-tests/basic-suite-4.2/test_utils/network_utils_v4.py", > line 71, in modify_ip_config > check_connectivity=True) > File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", > line 36729, in setup_networks > return self._internal_action(action, 'setupnetworks', None, > headers, query, wait) > File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
> 299, in _internal_action > return future.wait() if wait else future > File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
> 55, in wait > return self._code(response) > File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
> 296, in callback > self._check_fault(response) > File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

...
...
...
> 132, in _check_fault > self._raise_error(response, body) > File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",

On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron <dron@redhat.com> wrote: line line line line line line line line

...
...
...
> 118, in _raise_error > raise error > Error: Fault reason is "Operation Failed". Fault detail is "[Network > error during communication with the Host.]". HTTP response code is > 400. > > > > </error> > > > > -- > Barak Korren > RHV DevOps team , RHCE, RHCi > Red Hat EMEA > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted > _______________________________________________ > Devel mailing list > Devel@ovirt.org > http://lists.ovirt.org/mailman/listinfo/devel > > _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- GAL bEN HAIM RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

--

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Dan Kenigsberg

12:21 p.m.

On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...

On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...
No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

Alona Kaplan

10 Apr 10 Apr

3:14 p.m.

Hi all, Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess. - * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.* *- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).* * before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl) * "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception. catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; } * The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);') * 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*. The following warning is printed to the log - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' *- 08:30:51 a successful getCapabilitiesAsync is sent.* *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). * * SetupNetworks takes the monitoring lock. *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.* * When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).* * *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release. * The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause. Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed. Thanks, Alona. On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...

On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...
No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Gal Ben Haim

5:52 p.m.

I'm seeing the same error in [1], during 006_migrations.migrate_vm. [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/ On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...

Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...
No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

Alona Kaplan

11 Apr 11 Apr

11:45 a.m.

On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...

I'm seeing the same error in [1], during 006_migrations.migrate_vm.

[1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/

Seems like another bug. The migration failed since for some reason the vm is already defined on the destination host. 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create error=Virtual machine already exists (api:129) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create raise exception.VMExists() VMExists: Virtual machine already exists

...

On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: > No, I am afraid that we have not managed to understand why setting and > ipv6 address too the host off the grid. We shall continue researching > this next week. > > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but > could it possibly be related (I really doubt that)? >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

Arik Hadas

12:01 p.m.

On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...

On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
I'm seeing the same error in [1], during 006_migrations.migrate_vm.

[1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/

Seems like another bug. The migration failed since for some reason the vm is already defined on the destination host.

2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create error=Virtual machine already exists (api:129) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create raise exception.VMExists() VMExists: Virtual machine already exists

Milan, Francesco, could it be that because of [1] that appears on the destination host right after shutting down the VM, it remained defined on that host? [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:683) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None) AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'

...

...
On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: > > No, I am afraid that we have not managed to understand why setting > and > > ipv6 address too the host off the grid. We shall continue > researching > > this next week. > > > > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, > but > > could it possibly be related (I really doubt that)? > > >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Milan Zamazal

2:52 p.m.

Arik Hadas <ahadas@redhat.com> writes:

...

On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:

...
I'm seeing the same error in [1], during 006_migrations.migrate_vm.

[1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/

Seems like another bug. The migration failed since for some reason the vm is already defined on the destination host.

2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create error=Virtual machine already exists (api:129) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create raise exception.VMExists() VMExists: Virtual machine already exists

Milan, Francesco, could it be that because of [1] that appears on the destination host right after shutting down the VM, it remained defined on that host?

I can't see any destroy call in the logs after the successful preceding migration from the given host. That would explain “VMExists” error.

...

[1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:683)

Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in dispatchLibvirtEvents

v.onLibvirtLifecycleEvent(event, detail, None)

AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'

That means that a life cycle event on an unknown VM has arrived, in this case apparently destroy event, following the destroy call after the failed incoming migration. The reported AttributeError is a minor bug, already fixed in master. So it's most likely unrelated to the discussed problem.

...

...
...
On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. > Is it still failing? > > On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> > wrote: > >> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: >> > No, I am afraid that we have not managed to understand why setting >> and >> > ipv6 address too the host off the grid. We shall continue >> researching >> > this next week. >> > >> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, >> but >> > could it possibly be related (I really doubt that)? >> > >> > Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Dafna Ron

12 Apr 12 Apr

5:15 p.m.

hi, we are failing randomly on test 006_migrations.migrate_vm with what seems to be the same issue. the vm seems to be migrated successfully but engine thinks that it failed and re-calls migration getting a response of vm already exists. I don't think this is an issue with the test but rather a regression so I opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1566594 Thanks, Dafna On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <mzamazal@redhat.com> wrote:

...

Arik Hadas <ahadas@redhat.com> writes:

...
On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim@redhat.com>

wrote:

...
...
I'm seeing the same error in [1], during 006_migrations.migrate_vm.

[1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/

Seems like another bug. The migration failed since for some reason the

vm

...
is already defined on the destination host.

2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create error=Virtual machine already exists (api:129) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create raise exception.VMExists() VMExists: Virtual machine already exists

Milan, Francesco, could it be that because of [1] that appears on the destination host right after shutting down the VM, it remained defined on that host?

I can't see any destroy call in the logs after the successful preceding migration from the given host. That would explain “VMExists” error.

...
[1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:683)

Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in dispatchLibvirtEvents

v.onLibvirtLifecycleEvent(event, detail, None)

AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'

That means that a life cycle event on an unknown VM has arrived, in this case apparently destroy event, following the destroy call after the failed incoming migration. The reported AttributeError is a minor bug, already fixed in master. So it's most likely unrelated to the discussed problem.

...
...
...
On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2- c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2- c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

> > > On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote: > >> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >> Is it still failing? >> >> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> >> wrote: >> >>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: >>> > No, I am afraid that we have not managed to understand why setting >>> and >>> > ipv6 address too the host off the grid. We shall continue >>> researching >>> > this next week. >>> > >>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, >>> but >>> > could it possibly be related (I really doubt that)? >>> > >>> >> > Sorry, but I do not see how this problem is related to VDSM. > There is nothing that indicates that there is a VDSM problem. > > Has the RPC connection between Engine and VDSM failed? > > Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- *GAL bEN HAIM* RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Michal Skrivanek

6:30 p.m.

--Apple-Mail=_709B8D33-A1EA-4482-9A7A-2C59BF76F652 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8

...

On 12 Apr 2018, at 17:15, Dafna Ron <dron@redhat.com> wrote: =20 hi, =20 we are failing randomly on test 006_migrations.migrate_vm with what = seems to be the same issue.=20 =20 the vm seems to be migrated successfully but engine thinks that it = failed and re-calls migration getting a response of vm already exists.=20=

...

=20 I don't think this is an issue with the test but rather a regression = so I opened a bug:=20

...

=20 https://bugzilla.redhat.com/show_bug.cgi?id=3D1566594 = <https://bugzilla.redhat.com/show_bug.cgi?id=3D1566594> =20 Thanks,=20 Dafna =20 =20 On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <mzamazal@redhat.com = <mailto:mzamazal@redhat.com>> wrote: Arik Hadas <ahadas@redhat.com <mailto:ahadas@redhat.com>> writes: =20

...
On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkaplan@redhat.com = <mailto:alkaplan@redhat.com>> wrote:

...
On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim@redhat.com =

<mailto:gbenhaim@redhat.com>> wrote:

...
...
I'm seeing the same error in [1], during =

006_migrations.migrate_vm.

...
...
[1] =

http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/ = <http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/>

...
Seems like another bug. The migration failed since for some reason =

...

...
...
is already defined on the destination host.

2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create error=3DVirtual machine already exists (api:129) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line = 122, in method ret =3D func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in = create raise exception.VMExists() VMExists: Virtual machine already exists

Milan, Francesco, could it be that because of [1] that appears on =

...

...
destination host right after shutting down the VM, it remained = defined on that host? =20 I can't see any destroy call in the logs after the successful =

...

migration from the given host. That would explain =E2=80=9CVMExists=E2=80= =9D error. =20

...
[1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error = running VM callback (clientIF:683)

Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line = 646, in dispatchLibvirtEvents

v.onLibvirtLifecycleEvent(event, detail, None)

AttributeError: 'NoneType' object has no attribute = 'onLibvirtLifecycleEvent' =20 That means that a life cycle event on an unknown VM has arrived, in =

...

case apparently destroy event, following the destroy call after the failed incoming migration. The reported AttributeError is a minor = bug, already fixed in master. So it's most likely unrelated to the = discussed problem. =20

...
...
...
On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com = <mailto:alkaplan@redhat.com>> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host = 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to =

i do not think so, I=E2=80=99ve heard someone removed a test in between = migrating A->B and B->A? If that=E2=80=99s the case that is the real issue.=20 You can=E2=80=99t migrate back to A without waiting for A to be cleared = out properly https://gerrit.ovirt.org/#/c/90166/ = <https://gerrit.ovirt.org/#/c/90166/> should fix it the vm the preceding this the host (unsuccessfully).*

...

...
...
...
...
* before each "getCapabilitiesAsync" the monitoring lock is =

taken (VdsManager,refreshImpl)

...
* "getCapabilitiesAsync" immediately fails and throws =

'VDSNetworkException: java.net.ConnectException: Connection refused'. = The exception is caught by = 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls = 'onFailure' of the callback and re-throws the exception.

...
catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the =

"monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if = (!succeeded) lockManager.releaseLock(monitoringLock);')

...
* 'VdsManager,refreshImpl' catches the network exception, =

marks 'releaseLock =3D true' and *tries to release the already released = lock*.

...
The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] =

(EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release = exclusive lock which does not exist, lock key: = 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

...
*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting =

ipv6). *

...
* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync =

requests from 4 minutes ago from its queue and prints a = VDSNetworkException: Vds timeout occured.*

...
* When the first request is removed from the queue =

('ResponseTracker.remove()'), the

...
*'Callback.onFailure' is invoked (for the second time) -> = monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to = release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN = [org.ovirt.engine.core.bll.lock.InMemoryLockManager] = (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release = exclusive lock which does not exist, lock key: = 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is = started*. Why? I'm not 100% sure but I guess the late processing of the = 'getCapabilitiesAsync' that causes losing of the monitoring lock and the = late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the = lock is trying to be released three times. Please share your opinion = regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com = <mailto:danken@redhat.com>> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com = <mailto:ehaas@redhat.com>> wrote:

> > > On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com = <mailto:eedri@redhat.com>> wrote: > >> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851 = <https://gerrit.ovirt.org/#/c/89851>. >> Is it still failing? >> >> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren = <bkorren@redhat.com <mailto:bkorren@redhat.com>> >> wrote: >> >>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com = <mailto:danken@redhat.com>> wrote: >>> > No, I am afraid that we have not managed to understand why = setting >>> and >>> > ipv6 address too the host off the grid. We shall continue >>> researching >>> > this next week. >>> > >>> > Edy, https://gerrit.ovirt.org/#/c/88637/ = <https://gerrit.ovirt.org/#/c/88637/> is already 4 weeks old, >>> but >>> > could it possibly be related (I really doubt that)? >>> > >>> >> > Sorry, but I do not see how this problem is related to VDSM. > There is nothing that indicates that there is a VDSM problem. > > Has the RPC connection between Engine and VDSM failed? > > Further up the thread, Piotr noticed that (at least on one = failure of this test) that the Vdsm host lost connectivity to its storage, = and Vdsm process was restarted. However, this does not seems to happen in = all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel>

_______________________________________________ Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel>

-- *GAL bEN HAIM* RHV DEVOPS

_______________________________________________ Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel>

Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel> _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

--Apple-Mail=_709B8D33-A1EA-4482-9A7A-2C59BF76F652 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""> <div> <blockquote type=3D"cite" class=3D""><div = class=3D"">On 12 Apr 2018, at 17:15, Dafna Ron <<a = href=3D"mailto:dron@redhat.com" class=3D"">dron@redhat.com</a>> = wrote:</div> <div class=3D""><div = dir=3D"ltr" class=3D""><div class=3D"">hi, </div><div class=3D"">we = are failing randomly on test 006_migrations.migrate_vm with what seems = to be the same issue. </div><div class=3D""> </div><div class=3D"">the vm seems to be migrated = successfully but engine thinks that it failed and re-calls migration = getting a response of vm already exists. </div><div class=3D"">I don't think this is an issue with the = test but rather a regression so I opened a bug: </div></div></div></blockquote><div> </div>i do = not think so, I=E2=80=99ve heard someone removed a test in between = migrating A->B and B->A?</div><div>If that=E2=80=99s the case that = is the real issue. </div><div>You can=E2=80=99t migrate back to A = without waiting for A to be cleared out properly</div><div><a = href=3D"https://gerrit.ovirt.org/#/c/90166/" = class=3D"">https://gerrit.ovirt.org/#/c/90166/</a> should fix = it</div><div> <blockquote type=3D"cite" class=3D""><div = class=3D""><div dir=3D"ltr" class=3D""><div class=3D""> <a = href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1566594" = class=3D"">https://bugzilla.redhat.com/show_bug.cgi?id=3D1566594</a> </div><div class=3D"">Thanks, </div><div class=3D"">Dafna </div></div><div = class=3D"gmail_extra"> <div class=3D"gmail_quote">On Wed, = Apr 11, 2018 at 1:52 PM, Milan Zamazal <<a= href=3D"mailto:mzamazal@redhat.com" target=3D"_blank" = class=3D"">mzamazal@redhat.com</a>> wrote: <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex">Arik = Hadas <<a href=3D"mailto:ahadas@redhat.com" = class=3D"">ahadas@redhat.com</a>> writes: > On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <<a = href=3D"mailto:alkaplan@redhat.com" class=3D"">alkaplan@redhat.com</a>>= wrote: > >> >> >> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <<a = href=3D"mailto:gbenhaim@redhat.com" class=3D"">gbenhaim@redhat.com</a>>= wrote: >> >>> I'm seeing the same error in [1], during = 006_migrations.migrate_vm. >>> >>> [1] <a = href=3D"http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/</a> >>> >> >> Seems like another bug. The migration failed since for some = reason the vm >> is already defined on the destination host. >> >> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH = create >> error=3DVirtual machine already exists (api:129) >> Traceback (most recent call last): >> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in >> method >> ret =3D func(*args, **kwargs) >> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create >> raise exception.VMExists() >> VMExists: Virtual machine already exists >> >> > Milan, Francesco, could it be that because of [1] that appears on = the > destination host right after shutting down the VM, it remained = defined on > that host? I can't see any destroy call in the logs after the successful = preceding migration from the given host. That would explain =E2=80=9CVMExists=E2= =80=9D error. > [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error = running > VM callback (clientIF:683) > > Traceback (most recent call last): > > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in > dispatchLibvirtEvents > > v.onLibvirtLifecycleEvent(event, = detail, None) > > AttributeError: 'NoneType' object has no attribute = 'onLibvirtLifecycleEvent' That means that a life cycle event on an unknown VM has arrived, = in this case apparently destroy event, following the destroy call after the failed incoming migration. The reported AttributeError is a minor = bug, already fixed in master. So it's most likely unrelated to the = discussed problem. >>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <<a = href=3D"mailto:alkaplan@redhat.com" = class=3D"">alkaplan@redhat.com</a>> >>> wrote: >>> >>>> Hi all, >>>> >>>> Looking at the log it seems that the new = GetCapabilitiesAsync is >>>> responsible for the mess. >>>> >>>> - >>>> * 08:29:47 - engine loses connectivity to host = 'lago-basic-suite-4-2-host-0'.* >>>> >>>> >>>> >>>> *- Every 3 seconds a getCapabalititiesAsync request is = sent to the host (unsuccessfully).* >>>> >>>> * before each = "getCapabilitiesAsync" the monitoring lock is taken = (VdsManager,refreshImpl) >>>> >>>> * "getCapabilitiesAsync" = immediately fails and throws 'VDSNetworkException: = java.net.ConnectException: Connection refused'. The exception is caught = by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the = callback and re-throws the exception. >>>> >>>> catch (Throwable t) = { >>>> = getParameters().getCallback().onFailure(t); >>>> throw = t; >>>> } >>>> >>>> * The 'onFailure' of the callback = releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) = lockManager.releaseLock(monitoringLock);') >>>> >>>> * 'VdsManager,refreshImpl' = catches the network exception, marks 'releaseLock =3D true' and *tries = to release the already released lock*. >>>> >>>> The following warning is = printed to the log - >>>> >>>> WARN = [org.ovirt.engine.core.bll.lock.InMemoryLockManager] = (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] = Trying to release exclusive lock which does not exist, lock key: = 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >>>> >>>> >>>> >>>> >>>> *- 08:30:51 a successful getCapabilitiesAsync is = sent.* >>>> >>>> >>>> *- 08:32:55 - The failing test starts (Setup Networks = for setting ipv6). * >>>> >>>> * SetupNetworks takes the monitoring = lock. >>>> >>>> *- 08:33:00 - ResponseTracker cleans the = getCapabilitiesAsync requests from 4 minutes ago from its queue and = prints a VDSNetworkException: Vds timeout occured.* >>>> >>>> * When the first request is = removed from the queue ('ResponseTracker.remove()'), the >>>> *'Callback.onFailure' is invoked (for the second = time) -> monitoring lock is released (the lock taken by the = SetupNetworks!).* >>>> >>>> * *The other requests removed = from the queue also try to release the monitoring lock*, but there is = nothing to release. >>>> >>>> * The following warning log = is printed - >>>> WARN = [org.ovirt.engine.core.bll.lock.InMemoryLockManager] = (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] = Trying to release exclusive lock which does not exist, lock key: = 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >>>> >>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 = seconds after is started*. Why? I'm not 100% sure but I guess the late = processing of the 'getCapabilitiesAsync' that causes losing of the = monitoring lock and the late + mupltiple processing of failure is root = cause. <div class=3D""><div class=3D"h5">>>>> >>>> >>>> Ravi, 'getCapabilitiesAsync' failure is treated twice = and the lock is trying to be released three times. Please share your = opinion regarding how it should be fixed. >>>> >>>> >>>> Thanks, >>>> >>>> Alona. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <<a = href=3D"mailto:danken@redhat.com" class=3D"">danken@redhat.com</a>> >>>> wrote: >>>> >>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <<a = href=3D"mailto:ehaas@redhat.com" class=3D"">ehaas@redhat.com</a>> = wrote: >>>>> >>>>>> >>>>>> >>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri = <<a href=3D"mailto:eedri@redhat.com" = class=3D"">eedri@redhat.com</a>> wrote: >>>>>> >>>>>>> Was already done by Yaniv - <a = href=3D"https://gerrit.ovirt.org/#/c/89851" rel=3D"noreferrer" = target=3D"_blank" class=3D"">https://gerrit.ovirt.org/#/c/89851</a>. >>>>>>> Is it still failing? >>>>>>> >>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak = Korren <<a href=3D"mailto:bkorren@redhat.com" = class=3D"">bkorren@redhat.com</a>> >>>>>>> wrote: >>>>>>> >>>>>>>> On 7 April 2018 at 00:30, Dan = Kenigsberg <<a href=3D"mailto:danken@redhat.com" = class=3D"">danken@redhat.com</a>> wrote: >>>>>>>> > No, I am afraid that we have not = managed to understand why setting >>>>>>>> and >>>>>>>> > ipv6 address too the host off the = grid. We shall continue >>>>>>>> researching >>>>>>>> > this next week. >>>>>>>> > >>>>>>>> > Edy, <a = href=3D"https://gerrit.ovirt.org/#/c/88637/" rel=3D"noreferrer" = target=3D"_blank" class=3D"">https://gerrit.ovirt.org/#/c/88637/</a> is already 4 weeks old, >>>>>>>> but >>>>>>>> > could it possibly be related (I = really doubt that)? >>>>>>>> > >>>>>>>> >>>>>>> >>>>>> Sorry, but I do not see how this problem is = related to VDSM. >>>>>> There is nothing that indicates that there is a = VDSM problem. >>>>>> >>>>>> Has the RPC connection between Engine and VDSM = failed? >>>>>> >>>>>> >>>>> Further up the thread, Piotr noticed that (at least = on one failure of >>>>> this test) that the Vdsm host lost connectivity to = its storage, and Vdsm >>>>> process was restarted. However, this does not seems = to happen in all cases >>>>> where this test fails. >>>>> >>>>> _______________________________________________ >>>>> Devel mailing list >>>>> <a href=3D"mailto:Devel@ovirt.org" = class=3D"">Devel@ovirt.org</a> >>>>> <a = href=3D"http://lists.ovirt.org/mailman/listinfo/devel" rel=3D"noreferrer" = target=3D"_blank" class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</a> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Devel mailing list >>>> <a href=3D"mailto:Devel@ovirt.org" = class=3D"">Devel@ovirt.org</a> >>>> <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel"= rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</a> >>>> >>> >>> >>> >>> -- </div></div>>>> *GAL bEN HAIM* >>> RHV DEVOPS >>> >> >> >> _______________________________________________ >> Devel mailing list >> <a href=3D"mailto:Devel@ovirt.org" = class=3D"">Devel@ovirt.org</a> >> <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</a> >> <div class=3D"HOEnZb"><div = class=3D"h5">_______________________________________________ Devel mailing list <a href=3D"mailto:Devel@ovirt.org" class=3D"">Devel@ovirt.org</a> <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</a></div></div></blockquote></div> </div> _______________________________________________ Devel = mailing list <a href=3D"mailto:Devel@ovirt.org" = class=3D"">Devel@ovirt.org</a> http://lists.ovirt.org/mailman/listinfo/devel</div></blockquote=

...

</div> </body></html>=

--Apple-Mail=_709B8D33-A1EA-4482-9A7A-2C59BF76F652--

Ravi Shankar Nori

10 Apr 10 Apr

9:53 p.m.

Working on a patch will post a fix Thanks Ravi On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com> wrote:

...

Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:

...
No, I am afraid that we have not managed to understand why setting and ipv6 address too the host off the grid. We shall continue researching this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but could it possibly be related (I really doubt that)?

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Ravi Shankar Nori

11 Apr 11 Apr

12:04 a.m.

This [1] should fix the multiple release lock issue [1] https://gerrit.ovirt.org/#/c/90077/ On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: > No, I am afraid that we have not managed to understand why setting and > ipv6 address too the host off the grid. We shall continue researching > this next week. > > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but > could it possibly be related (I really doubt that)? >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Alona Kaplan

10:41 a.m.

Hi Ravi, Added comments to the patch. Regarding the lock - the lock shouldn't be released until the command and its callbacks were finished. The treatment of a delayed failure should be under lock since it doesn't make sense to take care of the failure while other monitoring process is possibly running. Besides the locking issue, IMO the main problem is the delayed failures. In case the vdsm is down, why is there an immediate exception and a delayed one? The delayed one is redundant. Anyway, 'callback.onFailure' shouldn't be executed twice. Thanks, Alona. On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*

*- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

*- 08:30:51 a successful getCapabilitiesAsync is sent.*

*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*

* When the first request is removed from the queue ('ResponseTracker.remove()'), the *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*

* *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote:

> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: > > No, I am afraid that we have not managed to understand why setting > and > > ipv6 address too the host off the grid. We shall continue > researching > > this next week. > > > > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, > but > > could it possibly be related (I really doubt that)? > > >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Dan Kenigsberg

24 Apr 24 Apr

1 p.m.

Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place. It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection. I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up. On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com> wrote:

...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.

- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and tries to release the already released lock.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:30:51 a successful getCapabilitiesAsync is sent.

- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).

* SetupNetworks takes the monitoring lock.

- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.

* When the first request is removed from the queue ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).

* The other requests removed from the queue also try to release the monitoring lock, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:

...
Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> wrote: > > On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: > > No, I am afraid that we have not managed to understand why setting > > and > > ipv6 address too the host off the grid. We shall continue > > researching > > this next week. > > > > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, > > but > > could it possibly be related (I really doubt that)? > >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Yaniv Kaul

2:14 p.m.

On Tue, Apr 24, 2018 at 2:00 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...

Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync

- Why is VDSM down? Y.

...

requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

...
This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com>

wrote:

...
...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.

- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'.

The

...
exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and tries to release the already released lock.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:30:51 a successful getCapabilitiesAsync is sent.

- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).

* SetupNetworks takes the monitoring lock.

- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.

* When the first request is removed from the queue ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for

...
...
...
second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).

* The other requests removed from the queue also try to release

...
...
...
monitoring lock, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and

On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com> wrote: the the the

...
...
...
late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote: > > Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. > Is it still failing? > > On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> > wrote: >> >> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
>> > No, I am afraid that we have not managed to understand why setting >> > and >> > ipv6 address too the host off the grid. We shall continue >> > researching >> > this next week. >> > >> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, >> > but >> > could it possibly be related (I really doubt that)? >> >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Dan Kenigsberg

2:40 p.m.

On Tue, Apr 24, 2018 at 3:14 PM, Yaniv Kaul <ykaul@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 2:00 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync

- Why is VDSM down?

vdsm_recovery test, I presume.

Ravi Shankar Nori

3:17 p.m.

On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...

Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan, Can I have the link to the job on jenkins so I can look at the logs

...

...
This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com>

wrote:

...
...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.

- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'.

The

...
exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and tries to release the already released lock.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:30:51 a successful getCapabilitiesAsync is sent.

- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).

* SetupNetworks takes the monitoring lock.

- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.

* When the first request is removed from the queue ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for

...
...
...
second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).

* The other requests removed from the queue also try to release

...
...
...
monitoring lock, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and

On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com> wrote: the the the

...
...
...
late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote: > > Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. > Is it still failing? > > On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> > wrote: >> >> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
>> > No, I am afraid that we have not managed to understand why setting >> > and >> > ipv6 address too the host off the grid. We shall continue >> > researching >> > this next week. >> > >> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, >> > but >> > could it possibly be related (I really doubt that)? >> >

Sorry, but I do not see how this problem is related to VDSM. There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Martin Perina

3:24 p.m.

On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/

...

...
...
This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com>

wrote:

...
...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.

- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'.

The

...
exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and tries to release the already released lock.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:30:51 a successful getCapabilitiesAsync is sent.

- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).

* SetupNetworks takes the monitoring lock.

- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.

* When the first request is removed from the queue ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).

* The other requests removed from the queue also try to release

...
...
...
monitoring lock, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and

...
...
...
late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com>

wrote:

...
> > > > On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote: >> >> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >> Is it still failing? >> >> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> >> wrote: >>> >>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: >>> > No, I am afraid that we have not managed to understand why setting >>> > and >>> > ipv6 address too the host off the grid. We shall continue >>> > researching >>> > this next week. >>> > >>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks

On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com> wrote: the the old,

...
...
...
...
>>> > but >>> > could it possibly be related (I really doubt that)? >>> > > > > Sorry, but I do not see how this problem is related to VDSM. > There is nothing that indicates that there is a VDSM problem. > > Has the RPC connection between Engine and VDSM failed? >

Further up the thread, Piotr noticed that (at least on one failure of this test) that the Vdsm host lost connectivity to its storage, and Vdsm process was restarted. However, this does not seems to happen in all cases where this test fails.

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

-- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Ravi Shankar Nori

3:36 p.m.

On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/

...

From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

...

...
...
...
This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com>

wrote:

...
...
Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.

- 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.

- Every 3 seconds a getCapabalititiesAsync request is sent to the

host

...
(unsuccessfully).

* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)

* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.

catch (Throwable t) { getParameters().getCallback().onFailure(t); throw t; }

* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and tries to release the already released lock.

The following warning is printed to the log -

WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:30:51 a successful getCapabilitiesAsync is sent.

- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).

* SetupNetworks takes the monitoring lock.

- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.

* When the first request is removed from the queue ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).

* The other requests removed from the queue also try to release the monitoring lock, but there is nothing to release.

* The following warning log is printed - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and

...
...
...
late + mupltiple processing of failure is root cause.

Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.

Thanks,

Alona.

On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com> wrote: > > On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote: >> >> >> >> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote: >>> >>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >>> Is it still failing? >>> >>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com> >>> wrote: >>>> >>>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote: >>>> > No, I am afraid that we have not managed to understand why setting >>>> > and >>>> > ipv6 address too the host off the grid. We shall continue >>>> > researching >>>> > this next week. >>>> > >>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks

On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com> wrote: the old,

...
...
...
>>>> > but >>>> > could it possibly be related (I really doubt that)? >>>> > >> >> >> Sorry, but I do not see how this problem is related to VDSM. >> There is nothing that indicates that there is a VDSM problem. >> >> Has the RPC connection between Engine and VDSM failed? >> > > Further up the thread, Piotr noticed that (at least on one failure of > this test) that the Vdsm host lost connectivity to its storage, and Vdsm > process was restarted. However, this does not seems to happen in all cases > where this test fails. > > _______________________________________________ > Devel mailing list > Devel@ovirt.org > http://lists.ovirt.org/mailman/listinfo/devel

-- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Dan Kenigsberg

3:47 p.m.

On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Would you relate to this analysis ^^^ ?

...

...
...
...
Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/

From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?

Ravi Shankar Nori

4:09 p.m.

On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com>

wrote:

...
...
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Would you relate to this analysis ^^^ ?

The HostMonitoring lock issue has been fixed by https://gerrit.ovirt.org/#/c/90189/

...

...
...
...
...
Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard- check-patch/346/

From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?

Not sure, from the logs VDS_FENCE is the only other VDS lock that is being released

Dan Kenigsberg

4:29 p.m.

On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Would you relate to this analysis ^^^ ?

The HostMonitoring lock issue has been fixed by https://gerrit.ovirt.org/#/c/90189/

Is there still a chance that a host moves to Up while former getCapsAsync request are still in-flight?

...

...
...
...
...
...
Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/

From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?

Not sure, from the logs VDS_FENCE is the only other VDS lock that is being released

Ravi Shankar Nori

4:46 p.m.

On Tue, Apr 24, 2018 at 10:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com

...
...
...
wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote: > > Ravi's patch is in, but a similar problem remains, and the test > cannot > be put back into its place. > > It seems that while Vdsm was taken down, a couple of getCapsAsync > requests queued up. At one point, the host resumed its connection, > before the requests have been cleared of the queue. After the host

is

...
> up, the following tests resume, and at a pseudorandom point in time, > an old getCapsAsync request times out and kills our connection. > > I believe that as long as ANY request is on flight, the monitoring > lock should not be released, and the host should not be declared as > up.

Would you relate to this analysis ^^^ ?

The HostMonitoring lock issue has been fixed by https://gerrit.ovirt.org/#/c/90189/

Is there still a chance that a host moves to Up while former getCapsAsync request are still in-flight?

Should not happen. Is there a way to execute/reproduce the failing test on Dev env?

...

...
...
...
...
...
> >

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-

check-patch/346/

...
...
...
...
From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?

Not sure, from the logs VDS_FENCE is the only other VDS lock that is being released

Ravi Shankar Nori

9:27 p.m.

On Tue, Apr 24, 2018 at 10:46 AM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 10:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <

rnori@redhat.com>

...
...
...
wrote: > > > > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com

...
...
...
> wrote: >> >> Ravi's patch is in, but a similar problem remains, and the test >> cannot >> be put back into its place. >> >> It seems that while Vdsm was taken down, a couple of getCapsAsync >> requests queued up. At one point, the host resumed its connection, >> before the requests have been cleared of the queue. After the host is >> up, the following tests resume, and at a pseudorandom point in time, >> an old getCapsAsync request times out and kills our connection. >> >> I believe that as long as ANY request is on flight, the monitoring >> lock should not be released, and the host should not be declared as >> up.

Would you relate to this analysis ^^^ ?

The HostMonitoring lock issue has been fixed by https://gerrit.ovirt.org/#/c/90189/

Is there still a chance that a host moves to Up while former getCapsAsync request are still in-flight?

Should not happen. Is there a way to execute/reproduce the failing test on Dev env?

...
...
...
...
...
>> >> > > > Hi Dan, > > Can I have the link to the job on jenkins so I can look at the logs

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-che

ck-patch/346/

...
...
...
...
From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?

Not sure, from the logs VDS_FENCE is the only other VDS lock that is being released

Would be helpful if I can get the exact flow that is failing and also the steps if any needed to reproduce the issue

Dan Kenigsberg

25 Apr 25 Apr

8:06 a.m.

On Tue, Apr 24, 2018 at 10:27 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 10:46 AM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 10:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote: > > > > On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori > <rnori@redhat.com> > wrote: >> >> >> >> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg >> <danken@redhat.com> >> wrote: >>> >>> Ravi's patch is in, but a similar problem remains, and the test >>> cannot >>> be put back into its place. >>> >>> It seems that while Vdsm was taken down, a couple of getCapsAsync >>> requests queued up. At one point, the host resumed its >>> connection, >>> before the requests have been cleared of the queue. After the >>> host is >>> up, the following tests resume, and at a pseudorandom point in >>> time, >>> an old getCapsAsync request times out and kills our connection. >>> >>> I believe that as long as ANY request is on flight, the >>> monitoring >>> lock should not be released, and the host should not be declared >>> as >>> up.

Would you relate to this analysis ^^^ ?

The HostMonitoring lock issue has been fixed by https://gerrit.ovirt.org/#/c/90189/

Is there still a chance that a host moves to Up while former getCapsAsync request are still in-flight?

Should not happen. Is there a way to execute/reproduce the failing test on Dev env?

...
...
...
...
>>> >>> >> >> >> Hi Dan, >> >> Can I have the link to the job on jenkins so I can look at the >> logs > > > > > http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ >

From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?

Not sure, from the logs VDS_FENCE is the only other VDS lock that is being released

Would be helpful if I can get the exact flow that is failing and also the steps if any needed to reproduce the issue

By now the logs of http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ have been garbage-collected, so I cannot point you to the location in the logs. Maybe Alona has a local copy. According to her analysis the issue manifest itself when setupNetworks follows vdsm restart. Have you tried running OST with prepare_migration_attachments_ipv6 reintroduced? It should always pass. Regards, Dan.

Dan Kenigsberg

24 Apr 24 Apr

3:28 p.m.

On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/ Its most recent failure http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

Martin Perina

25 Apr 25 Apr

4:57 p.m.

On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/

Its most recent failure http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

So https://bugzilla.redhat.com/1571768 was created to cover this issue discovered during Alona's and Piotr's conversation. But after further discussion we have found out that this issue is not related to non-blocking thread changes in engine 4.2 and this behavior exists from beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along with other locking changes he already posted to see if they will help network OST to succeed. But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that on master and let's see if it doesn't introduce any regressions. If not, then we can backport to 4.2.4. -- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Dan Kenigsberg

5:34 p.m.

On Wed, Apr 25, 2018 at 5:57 PM, Martin Perina <mperina@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/

Its most recent failure http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

So https://bugzilla.redhat.com/1571768 was created to cover this issue discovered during Alona's and Piotr's conversation. But after further discussion we have found out that this issue is not related to non-blocking thread changes in engine 4.2 and this behavior exists from beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along with other locking changes he already posted to see if they will help network OST to succeed.

But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that on master and let's see if it doesn't introduce any regressions. If not, then we can backport to 4.2.4.

I sense as if there is a regression in connection management, that coincided with the introduction of async monitoring. I am not alone: Gal Ben Haim was reluctant to take our test back. Do you think that it is now safe to take it in https://gerrit.ovirt.org/#/c/90264/ ? I'd appreciate your support there. I don't want any test to be skipped without a very good reason.

Ravi Shankar Nori

6:20 p.m.

On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina <mperina@redhat.com> wrote:

...

On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/9 0264/

Its most recent failure http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

So https://bugzilla.redhat.com/1571768 was created to cover this issue discovered during Alona's and Piotr's conversation. But after further discussion we have found out that this issue is not related to non-blocking thread changes in engine 4.2 and this behavior exists from beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along with other locking changes he already posted to see if they will help network OST to succeed.

But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that on master and let's see if it doesn't introduce any regressions. If not, then we can backport to 4.2.4.

-- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the OST issue with enabling 006_migrations.prepare_migration_attachments_ipv6. I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back 006_migrations.prepare_migration_attachments_ipv6 [3] and the jobs succeeded thrice [4][5][6] [1] https://gerrit.ovirt.org/#/c/90646/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768 [3] https://gerrit.ovirt.org/#/c/90264/ [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/ [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/ [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/

Martin Perina

7:06 p.m.

Ravi/Piotr, so what's the connection between non-blocking threads, jsonrpc-java connection closing and failing this network test? Does it mean that non-blocking threads change just revealed the jsonrpc-java issue which we haven't noticed before? And did the test really works with code prior to non-blocking threads changes and we are missing something else? On Wed, 25 Apr 2018, 18:21 Ravi Shankar Nori, <rnori@redhat.com> wrote:

...

On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/

Its most recent failure http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

So https://bugzilla.redhat.com/1571768 was created to cover this issue discovered during Alona's and Piotr's conversation. But after further discussion we have found out that this issue is not related to non-blocking thread changes in engine 4.2 and this behavior exists from beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along with other locking changes he already posted to see if they will help network OST to succeed.

But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that on master and let's see if it doesn't introduce any regressions. If not, then we can backport to 4.2.4.

-- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the OST issue with enabling 006_migrations.prepare_migration_attachments_ipv6.

I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back 006_migrations.prepare_migration_attachments_ipv6 [3] and the jobs succeeded thrice [4][5][6]

[1] https://gerrit.ovirt.org/#/c/90646/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768 [3] https://gerrit.ovirt.org/#/c/90264/ [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/ [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/ [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/

Piotr Kliczewski

7:15 p.m.

śr., 25 kwi 2018, 19:07 użytkownik Martin Perina <mperina@redhat.com> napisał:

...

Ravi/Piotr, so what's the connection between non-blocking threads, jsonrpc-java connection closing and failing this network test? Does it mean that non-blocking threads change just revealed the jsonrpc-java issue which we haven't noticed before? And did the test really works with code prior to non-blocking threads changes and we are missing something else?

I think that the test found something not related to non-blocking threads. This behavior was in the code since the beginning.

...

On Wed, 25 Apr 2018, 18:21 Ravi Shankar Nori, <rnori@redhat.com> wrote:

...
On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com>

wrote:

...
...
Ravi's patch is in, but a similar problem remains, and the test

cannot

...
be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/

Its most recent failure

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

So https://bugzilla.redhat.com/1571768 was created to cover this issue discovered during Alona's and Piotr's conversation. But after further discussion we have found out that this issue is not related to non-blocking thread changes in engine 4.2 and this behavior exists from beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along with other locking changes he already posted to see if they will help network OST to succeed.

But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that on master and let's see if it doesn't introduce any regressions. If not, then we can backport to 4.2.4.

-- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the OST issue with enabling 006_migrations.prepare_migration_attachments_ipv6.

I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back 006_migrations.prepare_migration_attachments_ipv6 [3] and the jobs succeeded thrice [4][5][6]

[1] https://gerrit.ovirt.org/#/c/90646/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768 [3] https://gerrit.ovirt.org/#/c/90264/ [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/ [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/ [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/

Dan Kenigsberg

26 Apr 26 Apr

8:50 a.m.

On Wed, Apr 25, 2018 at 8:15 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:

...

śr., 25 kwi 2018, 19:07 użytkownik Martin Perina <mperina@redhat.com> napisał:

...
Ravi/Piotr, so what's the connection between non-blocking threads, jsonrpc-java connection closing and failing this network test? Does it mean that non-blocking threads change just revealed the jsonrpc-java issue which we haven't noticed before? And did the test really works with code prior to non-blocking threads changes and we are missing something else?

I think that the test found something not related to non-blocking threads. This behavior was in the code since the beginning.

ipv6 migration test is quite old too commit dea714ba85d5c0b77859f504bb80840275277e68 Author: Ondřej Svoboda <osvoboda@redhat.com> Date: Tue Mar 14 11:21:53 2017 +0100 Re-enable IPv6 migration. and vdsm_recovery test is not so young: commit 315cc100bc9bc0d81c01728a3576a10bfa576ac4 Author: Milan Zamazal <mzamazal@redhat.com> Date: Fri Sep 23 10:08:47 2016 +0200 vdsm_recovery test

Dan Kenigsberg

8:51 a.m.

On Wed, Apr 25, 2018 at 7:20 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...

On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina <mperina@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori@redhat.com> wrote:

...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com> wrote:

...
Ravi's patch is in, but a similar problem remains, and the test cannot be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync requests queued up. At one point, the host resumed its connection, before the requests have been cleared of the queue. After the host is up, the following tests resume, and at a pseudorandom point in time, an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring lock should not be released, and the host should not be declared as up.

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged. Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/

Its most recent failure http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ has been discussed by Alona and Piotr over IRC.

So https://bugzilla.redhat.com/1571768 was created to cover this issue discovered during Alona's and Piotr's conversation. But after further discussion we have found out that this issue is not related to non-blocking thread changes in engine 4.2 and this behavior exists from beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along with other locking changes he already posted to see if they will help network OST to succeed.

But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that on master and let's see if it doesn't introduce any regressions. If not, then we can backport to 4.2.4.

-- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.

Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the OST issue with enabling 006_migrations.prepare_migration_attachments_ipv6.

I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back 006_migrations.prepare_migration_attachments_ipv6 [3] and the jobs succeeded thrice [4][5][6]

[1] https://gerrit.ovirt.org/#/c/90646/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768 [3] https://gerrit.ovirt.org/#/c/90264/ [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/ [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/ [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/

Eyal, Gal: would you please take the test back?

2775

Age (days ago)

2797

Last active (days ago)

List overview

Download

38 comments

14 participants

participants (14)

Alona Kaplan
Arik Hadas
Barak Korren
Dafna Ron
Dan Kenigsberg
Edward Haas
Eyal Edri
Gal Ben Haim
Martin Perina
Michal Skrivanek
Milan Zamazal
Piotr Kliczewski
Ravi Shankar Nori
Yaniv Kaul

[ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

tags

participants (14)