[ovirt-devel] [ OST Failure Report ] [ master ] [ 03.03.2017 ] [006_migrations host is in Connecting state]

Piotr Kliczewski pkliczew at redhat.com
Mon Mar 6 12:49:07 UTC 2017


On Mon, Mar 6, 2017 at 10:10 AM, Piotr Kliczewski <pkliczew at redhat.com>
wrote:

>
>
> On Mon, Mar 6, 2017 at 9:46 AM, Dan Kenigsberg <danken at redhat.com> wrote:
>
>> On Mon, Mar 6, 2017 at 10:11 AM, Piotr Kliczewski <pkliczew at redhat.com>
>> wrote:
>> >
>> >
>> > On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <danken at redhat.com>
>> wrote:
>> >>
>> >> On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew at redhat.com>
>> >> wrote:
>> >> >
>> >> >
>> >> > On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken at redhat.com>
>> >> > wrote:
>> >> >>
>> >> >> Piotr, could you provide more information?
>> >> >>
>> >> >> Which setupNetworks action triggers this problem? Any idea which
>> lock
>> >> >> did we use to take and when did we drop it?
>> >> >
>> >> >
>> >> > I though that this [1] would make sure that setupNetworks is
>> exclusive
>> >> > operation on a host which seems not to be the case.
>> >> > In the logs I saw following message sent:
>> >> >
>> >> >
>> >> > {"jsonrpc":"2.0","method":"Host.setupNetworks","params":{"
>> networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.25
>> 5.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false"
>> ,"ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":"
>> legacy"}},"bondings":{},"options":{"connectivityTimeout
>> ":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39-
>> 4815-831b-5e3b1c22131d"}
>> >> >
>> >> > Few seconds later there was:
>> >> >
>> >> >
>> >> > {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"
>> id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
>> >> >
>> >> > and still while we were calling pings there was:
>> >> >
>> >> >
>> >> > {"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params
>> ":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"},
>> "id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
>> >> >
>> >> > My assumption was that those calls should not happen and calls them
>> >> > selves
>> >> > could be corrupted or their responses.
>> >> > What do you think?
>> >> >
>> >> > [1]
>> >> >
>> >> > https://github.com/oVirt/ovirt-engine/blob/master/backend/
>> manager/modules/bll/src/main/java/org/ovirt/engine/core/
>> bll/network/host/HostSetupNetworksCommand.java#L285
>> >>
>> >> I suspect that getVmStats and getSpmStatus simply do not take the
>> >> hostmonitoring lock, and I don't see anything wrong in that.
>> >>
>> >> Note that during 006_migration, we set only a mere migration network,
>> >> not the management network. This operation should not interfere with
>> >> Engine-Vdsm communication in any way; I don't yet understand why you
>> >> suspect that it does.
>> >
>> >
>> > My assumption here is that I saw this failure 2 times and both were
>> during
>> > setupNetworks.
>> > The pattern is that always a call fails which "should not" occur during
>> such
>> > operation.
>> >
>>
>> It is fair to suspect an interaction with setupNetworks, but let us
>> put some substance into it.
>> What is the mode of failure of the other command?
>>
>
> I am not sure what do you mean. Can you please explain?
>
>
Now, I understand the question (explanation offline). The reason why the
parse method fails is that we have heartbeat frame glued together
with a response we should not get (partial).

During setupNetworks we ignore heartbeats but it doesn't mean that we do
not receive them.
It seems that vdsm sends a heartbeat assuming that there was no interaction
but actually there was.
We may want to fix this on vdsm side.

On the other hand we need to fix setupNetworks locking on the engine side.
We either should not lock or
make sure the lock is taken for all possible interactions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20170306/86f299d8/attachment-0001.html>


More information about the Devel mailing list