[ OST Failure Report ] [ master ] [ 03.03.2017 ] [006_migrations host is in Connecting state]

Hi, Migration is failed because host is in Connecting state. Seems like another jsonrpc related issue (Unrecognized message received). Job: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/5654/ Logs: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/5654/artifa... -- Pavel [LOGS SNIPPET] 2017-03-03 06:00:40,882-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [3155acab] Unable to process messages Unrecognized message received : org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Unrecognized message received 42965:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host1 command SpmStatusVDS failed: Unrecognized message received 42966:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler4) [] Command 'SpmStatusVDSCommand(HostName = lago-basic-suite-master-host1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='bba0ec26-4856-4389-982d-2ad68cb3f682', storagePoolId='8cc227da-70e7-4557-aa01-6d8ddee6f847'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received 42999:2017-03-03 06:00:40,925-05 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [6c7fb340] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6c7fb340, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center test-dc. Setting Data Center status to Non Responsive (On host lago-basic-suite-master-host1, Error: Network error during communication with the Host.). 43457:2017-03-03 06:00:44,466-05 ERROR [org.ovirt.engine.core.bll.network.host.HostValidator] (default task-17) [14be0e43-97b8-4882-bbcc-27392543fae6] Unable to setup network: operation can only be done when Host status is one of: Maintenance, Up, NonOperational; current status is Connecting 43460:2017-03-03 06:00:44,476-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-17) [] Operation Failed: [Cannot setup Networks. Operation can be performed only when Host status is Maintenance, Up, NonOperational.]

This one we say already. The cause of it that during network setup engine sends messages which may fail or partially arrive. We used to have host level lock to protect this kind of situation but it seems like we do not have it anymore. Previous failure was triggered by host monitoring now it was SpmStatus. On Fri, Mar 3, 2017 at 2:54 PM, Pavel Zhukov <pzhukov@redhat.com> wrote:
Hi,
Migration is failed because host is in Connecting state. Seems like another jsonrpc related issue (Unrecognized message received).
Job: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/5654/ Logs: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/5654/artifact/exported-artifacts/*zip*/exported-artifacts.zip
-- Pavel
[LOGS SNIPPET] 2017-03-03 06:00:40,882-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [3155acab] Unable to process messages Unrecognized message received : org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Unrecognized message received 42965:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.dal. dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host1 command SpmStatusVDS failed: Unrecognized message received 42966:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core. vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler4) [] Command 'SpmStatusVDSCommand(HostName = lago-basic-suite-master-host1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='bba0ec26-4856-4389-982d-2ad68cb3f682', storagePoolId='8cc227da-70e7-4557-aa01-6d8ddee6f847'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received 42999:2017-03-03 06:00:40,925-05 WARN [org.ovirt.engine.core.dal. dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [6c7fb340] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_ STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6c7fb340, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center test-dc. Setting Data Center status to Non Responsive (On host lago-basic-suite-master-host1, Error: Network error during communication with the Host.). 43457:2017-03-03 06:00:44,466-05 ERROR [org.ovirt.engine.core.bll.network.host.HostValidator] (default task-17) [14be0e43-97b8-4882-bbcc-27392543fae6] Unable to setup network: operation can only be done when Host status is one of: Maintenance, Up, NonOperational; current status is Connecting 43460:2017-03-03 06:00:44,476-05 ERROR [org.ovirt.engine.api.restapi. resource.AbstractBackendResource] (default task-17) [] Operation Failed: [Cannot setup Networks. Operation can be performed only when Host status is Maintenance, Up, NonOperational.]

Piotr, could you provide more information? Which setupNetworks action triggers this problem? Any idea which lock did we use to take and when did we drop it? Is it related to your https://gerrit.ovirt.org/#/q/Idaa54767bb7e54bf13e89887ca34fa8e01ade420 (jsonrpc incomplete message)? On Fri, Mar 3, 2017 at 4:10 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
This one we say already. The cause of it that during network setup engine sends messages which may fail or partially arrive. We used to have host level lock to protect this kind of situation but it seems like we do not have it anymore.
Previous failure was triggered by host monitoring now it was SpmStatus.
On Fri, Mar 3, 2017 at 2:54 PM, Pavel Zhukov <pzhukov@redhat.com> wrote:
Hi,
Migration is failed because host is in Connecting state. Seems like another jsonrpc related issue (Unrecognized message received).
Job: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/5654/ Logs: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/5654/artifa...
-- Pavel
[LOGS SNIPPET] 2017-03-03 06:00:40,882-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [3155acab] Unable to process messages Unrecognized message received : org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Unrecognized message received 42965:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host1 command SpmStatusVDS failed: Unrecognized message received 42966:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler4) [] Command 'SpmStatusVDSCommand(HostName = lago-basic-suite-master-host1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='bba0ec26-4856-4389-982d-2ad68cb3f682', storagePoolId='8cc227da-70e7-4557-aa01-6d8ddee6f847'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received 42999:2017-03-03 06:00:40,925-05 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [6c7fb340] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6c7fb340, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center test-dc. Setting Data Center status to Non Responsive (On host lago-basic-suite-master-host1, Error: Network error during communication with the Host.). 43457:2017-03-03 06:00:44,466-05 ERROR [org.ovirt.engine.core.bll.network.host.HostValidator] (default task-17) [14be0e43-97b8-4882-bbcc-27392543fae6] Unable to setup network: operation can only be done when Host status is one of: Maintenance, Up, NonOperational; current status is Connecting 43460:2017-03-03 06:00:44,476-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-17) [] Operation Failed: [Cannot setup Networks. Operation can be performed only when Host status is Maintenance, Up, NonOperational.]
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:
Piotr, could you provide more information?
Which setupNetworks action triggers this problem? Any idea which lock did we use to take and when did we drop it?
I though that this [1] would make sure that setupNetworks is exclusive operation on a host which seems not to be the case. In the logs I saw following message sent: {"jsonrpc":"2.0","method":"Host.setupNetworks","params":{"networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"} Few seconds later there was: {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"} and still while we were calling pings there was: {"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"} My assumption was that those calls should not happen and calls them selves could be corrupted or their responses. What do you think? [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bl...
Is it related to your https://gerrit.ovirt.org/#/q/Idaa54767bb7e54bf13e89887ca34fa8e01ade420 (jsonrpc incomplete message)?
Seems not to be related.
On Fri, Mar 3, 2017 at 4:10 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
This one we say already. The cause of it that during network setup engine sends messages which may fail or partially arrive. We used to have host level lock to protect this kind of situation but it seems like we do not have it anymore.
Previous failure was triggered by host monitoring now it was SpmStatus.
On Fri, Mar 3, 2017 at 2:54 PM, Pavel Zhukov <pzhukov@redhat.com> wrote:
Hi,
Migration is failed because host is in Connecting state. Seems like another jsonrpc related issue (Unrecognized message
received).
Job: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/5654/ Logs: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/5654/artifact/exported-artifacts/*zip*/exported-artifacts.zip
-- Pavel
[LOGS SNIPPET] 2017-03-03 06:00:40,882-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [3155acab] Unable to process messages Unrecognized message received : org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Unrecognized message received 42965:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,
802),
Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host1 command SpmStatusVDS failed: Unrecognized message received 42966:2017-03-03 06:00:40,889-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler4) [] Command 'SpmStatusVDSCommand(HostName = lago-basic-suite-master-host1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='bba0ec26-4856-4389-982d-2ad68cb3f682', storagePoolId='8cc227da-70e7-4557-aa01-6d8ddee6f847'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received 42999:2017-03-03 06:00:40,925-05 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [6c7fb340] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6c7fb340, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center test-dc. Setting Data Center status to Non Responsive (On host lago-basic-suite-master-host1, Error: Network error during communication with the Host.). 43457:2017-03-03 06:00:44,466-05 ERROR [org.ovirt.engine.core.bll.network.host.HostValidator] (default task-17) [14be0e43-97b8-4882-bbcc-27392543fae6] Unable to setup network: operation can only be done when Host status is one of: Maintenance, Up, NonOperational; current status is Connecting 43460:2017-03-03 06:00:44,476-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-17) [] Operation Failed: [Cannot setup Networks. Operation can be performed only when Host status is Maintenance, Up, NonOperational.]
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:
Piotr, could you provide more information?
Which setupNetworks action triggers this problem? Any idea which lock did we use to take and when did we drop it?
I though that this [1] would make sure that setupNetworks is exclusive operation on a host which seems not to be the case. In the logs I saw following message sent:
{"jsonrpc":"2.0","method":"Host.setupNetworks","params":{"networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"}
Few seconds later there was:
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
and still while we were calling pings there was:
{"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
My assumption was that those calls should not happen and calls them selves could be corrupted or their responses. What do you think?
[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bl...
I suspect that getVmStats and getSpmStatus simply do not take the hostmonitoring lock, and I don't see anything wrong in that. Note that during 006_migration, we set only a mere migration network, not the management network. This operation should not interfere with Engine-Vdsm communication in any way; I don't yet understand why you suspect that it does.

On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <danken@redhat.com> wrote:
On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com>
Piotr, could you provide more information?
Which setupNetworks action triggers this problem? Any idea which lock did we use to take and when did we drop it?
I though that this [1] would make sure that setupNetworks is exclusive operation on a host which seems not to be the case. In the logs I saw following message sent:
{"jsonrpc":"2.0","method":"Host.setupNetworks","params":{ "networks":{"VLAN200_Network":{"vlan":"200","netmask":"255. 255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":" false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500," switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120," connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"}
Few seconds later there was:
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{ },"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
and still while we were calling pings there was:
{"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","
wrote: params":{"storagepoolID":"8cc227da-70e7-4557-aa01- 6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
My assumption was that those calls should not happen and calls them
selves
could be corrupted or their responses. What do you think?
[1] https://github.com/oVirt/ovirt-engine/blob/master/ backend/manager/modules/bll/src/main/java/org/ovirt/ engine/core/bll/network/host/HostSetupNetworksCommand.java#L285
I suspect that getVmStats and getSpmStatus simply do not take the hostmonitoring lock, and I don't see anything wrong in that.
Note that during 006_migration, we set only a mere migration network, not the management network. This operation should not interfere with Engine-Vdsm communication in any way; I don't yet understand why you suspect that it does.
My assumption here is that I saw this failure 2 times and both were during setupNetworks. The pattern is that always a call fails which "should not" occur during such operation.

On Mon, Mar 6, 2017 at 10:11 AM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <danken@redhat.com> wrote:
On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:
Piotr, could you provide more information?
Which setupNetworks action triggers this problem? Any idea which lock did we use to take and when did we drop it?
I though that this [1] would make sure that setupNetworks is exclusive operation on a host which seems not to be the case. In the logs I saw following message sent:
{"jsonrpc":"2.0","method":"Host.setupNetworks","params":{"networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"}
Few seconds later there was:
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
and still while we were calling pings there was:
{"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
My assumption was that those calls should not happen and calls them selves could be corrupted or their responses. What do you think?
[1]
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bl...
I suspect that getVmStats and getSpmStatus simply do not take the hostmonitoring lock, and I don't see anything wrong in that.
Note that during 006_migration, we set only a mere migration network, not the management network. This operation should not interfere with Engine-Vdsm communication in any way; I don't yet understand why you suspect that it does.
My assumption here is that I saw this failure 2 times and both were during setupNetworks. The pattern is that always a call fails which "should not" occur during such operation.
It is fair to suspect an interaction with setupNetworks, but let us put some substance into it. What is the mode of failure of the other command?

On Mon, Mar 6, 2017 at 9:46 AM, Dan Kenigsberg <danken@redhat.com> wrote:
On Mon, Mar 6, 2017 at 10:11 AM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <danken@redhat.com>
On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:
Piotr, could you provide more information?
Which setupNetworks action triggers this problem? Any idea which lock did we use to take and when did we drop it?
I though that this [1] would make sure that setupNetworks is exclusive operation on a host which seems not to be the case. In the logs I saw following message sent:
{"jsonrpc":"2.0","method":"Host.setupNetworks","params":{
"networks":{"VLAN200_Network":{"vlan":"200","netmask":"255. 255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":" false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500," switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120," connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"}
Few seconds later there was:
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{
},"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
and still while we were calling pings there was:
{"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","
wrote: params":{"storagepoolID":"8cc227da-70e7-4557-aa01- 6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
My assumption was that those calls should not happen and calls them selves could be corrupted or their responses. What do you think?
[1]
backend/manager/modules/bll/src/main/java/org/ovirt/ engine/core/bll/network/host/HostSetupNetworksCommand.java#L285
I suspect that getVmStats and getSpmStatus simply do not take the hostmonitoring lock, and I don't see anything wrong in that.
Note that during 006_migration, we set only a mere migration network, not the management network. This operation should not interfere with Engine-Vdsm communication in any way; I don't yet understand why you suspect that it does.
My assumption here is that I saw this failure 2 times and both were during setupNetworks. The pattern is that always a call fails which "should not" occur during such operation.
It is fair to suspect an interaction with setupNetworks, but let us put some substance into it. What is the mode of failure of the other command?
I am not sure what do you mean. Can you please explain?

On Mon, Mar 6, 2017 at 10:10 AM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Mon, Mar 6, 2017 at 9:46 AM, Dan Kenigsberg <danken@redhat.com> wrote:
On Mon, Mar 6, 2017 at 10:11 AM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <danken@redhat.com>
wrote:
On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com> wrote:
Piotr, could you provide more information?
Which setupNetworks action triggers this problem? Any idea which
lock
did we use to take and when did we drop it?
I though that this [1] would make sure that setupNetworks is exclusive operation on a host which seems not to be the case. In the logs I saw following message sent:
{"jsonrpc":"2.0","method":"Host.setupNetworks","params":{" networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.25 5.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false" ,"ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":" legacy"}},"bondings":{},"options":{"connectivityTimeout ":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39- 4815-831b-5e3b1c22131d"}
Few seconds later there was:
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{}," id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
and still while we were calling pings there was:
{"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params ":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"}, "id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
My assumption was that those calls should not happen and calls them selves could be corrupted or their responses. What do you think?
[1]
https://github.com/oVirt/ovirt-engine/blob/master/backend/ manager/modules/bll/src/main/java/org/ovirt/engine/core/ bll/network/host/HostSetupNetworksCommand.java#L285
I suspect that getVmStats and getSpmStatus simply do not take the hostmonitoring lock, and I don't see anything wrong in that.
Note that during 006_migration, we set only a mere migration network, not the management network. This operation should not interfere with Engine-Vdsm communication in any way; I don't yet understand why you suspect that it does.
My assumption here is that I saw this failure 2 times and both were during setupNetworks. The pattern is that always a call fails which "should not" occur during such operation.
It is fair to suspect an interaction with setupNetworks, but let us put some substance into it. What is the mode of failure of the other command?
I am not sure what do you mean. Can you please explain?
Now, I understand the question (explanation offline). The reason why the parse method fails is that we have heartbeat frame glued together with a response we should not get (partial). During setupNetworks we ignore heartbeats but it doesn't mean that we do not receive them. It seems that vdsm sends a heartbeat assuming that there was no interaction but actually there was. We may want to fix this on vdsm side. On the other hand we need to fix setupNetworks locking on the engine side. We either should not lock or make sure the lock is taken for all possible interactions.
participants (3)
-
Dan Kenigsberg
-
Pavel Zhukov
-
Piotr Kliczewski