[ovirt-devel] [ OST Failure Report ] [ master ] [ 03.03.2017 ] [006_migrations host is in Connecting state]

Piotr Kliczewski pkliczew at redhat.com
Sun Mar 5 19:50:49 UTC 2017


On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken at redhat.com> wrote:

> Piotr, could you provide more information?
>
> Which setupNetworks action triggers this problem? Any idea which lock
> did we use to take and when did we drop it?
>

I though that this [1] would make sure that setupNetworks is exclusive
operation on a host which seems not to be the case.
In the logs I saw following message sent:

{"jsonrpc":"2.0","method":"Host.setupNetworks","params":{"networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"}

Few seconds later there was:

{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}

and still while we were calling pings there was:

{"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}

My assumption was that those calls should not happen and calls them selves
could be corrupted or their responses.
What do you think?

[1]
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/network/host/HostSetupNetworksCommand.java#L285

>
> Is it related to your
> https://gerrit.ovirt.org/#/q/Idaa54767bb7e54bf13e89887ca34fa8e01ade420
> (jsonrpc incomplete message)?
>
>
Seems not to be related.


>
> On Fri, Mar 3, 2017 at 4:10 PM, Piotr Kliczewski <pkliczew at redhat.com>
> wrote:
> > This one we say already. The cause of it that during network setup engine
> > sends messages which may fail or partially arrive.
> > We used to have host level lock to protect this kind of situation but it
> > seems like we do not have it anymore.
> >
> > Previous failure was triggered by host monitoring now it was SpmStatus.
> >
> > On Fri, Mar 3, 2017 at 2:54 PM, Pavel Zhukov <pzhukov at redhat.com> wrote:
> >>
> >>
> >> Hi,
> >>
> >> Migration is failed because host is in Connecting state.
> >> Seems like another jsonrpc related issue (Unrecognized message
> received).
> >>
> >> Job:
> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/5654/
> >> Logs:
> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
> master/5654/artifact/exported-artifacts/*zip*/exported-artifacts.zip
> >>
> >> --
> >> Pavel
> >>
> >> [LOGS SNIPPET]
> >> 2017-03-03 06:00:40,882-05 DEBUG
> >> [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor)
> >> [3155acab] Unable to process messages Unrecognized message received :
> >> org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Unrecognized
> >> message received
> >>   42965:2017-03-03 06:00:40,889-05 ERROR
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,
> 802),
> >> Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message:
> VDSM
> >> lago-basic-suite-master-host1 command SpmStatusVDS failed: Unrecognized
> >> message received
> >>   42966:2017-03-03 06:00:40,889-05 ERROR
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
> >> (DefaultQuartzScheduler4) [] Command 'SpmStatusVDSCommand(HostName =
> >> lago-basic-suite-master-host1,
> >> SpmStatusVDSCommandParameters:{runAsync='true',
> >> hostId='bba0ec26-4856-4389-982d-2ad68cb3f682',
> >> storagePoolId='8cc227da-70e7-4557-aa01-6d8ddee6f847'})' execution
> failed:
> >> VDSGenericException: VDSNetworkException: Unrecognized message received
> >>   42999:2017-03-03 06:00:40,925-05 WARN
> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> >> (DefaultQuartzScheduler4) [6c7fb340] EVENT_ID:
> >> SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987),
> Correlation
> >> ID: 6c7fb340, Call Stack: null, Custom Event ID: -1, Message: Invalid
> status
> >> on Data Center test-dc. Setting Data Center status to Non Responsive (On
> >> host lago-basic-suite-master-host1, Error: Network error during
> >> communication with the Host.).
> >>   43457:2017-03-03 06:00:44,466-05 ERROR
> >> [org.ovirt.engine.core.bll.network.host.HostValidator] (default
> task-17)
> >> [14be0e43-97b8-4882-bbcc-27392543fae6] Unable to setup network:
> operation
> >> can only be done when Host status is one of: Maintenance, Up,
> >> NonOperational; current status is Connecting
> >>   43460:2017-03-03 06:00:44,476-05 ERROR
> >> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
> (default
> >> task-17) [] Operation Failed: [Cannot setup Networks. Operation can be
> >> performed only when Host status is  Maintenance, Up, NonOperational.]
> >
> >
> >
> > _______________________________________________
> > Devel mailing list
> > Devel at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20170305/1e8a40f9/attachment-0001.html>


More information about the Devel mailing list