On Mon, Mar 6, 2017 at 10:10 AM, Piotr Kliczewski <pkliczew@redhat.com> wrote:


On Mon, Mar 6, 2017 at 9:46 AM, Dan Kenigsberg <danken@redhat.com> wrote:
On Mon, Mar 6, 2017 at 10:11 AM, Piotr Kliczewski <pkliczew@redhat.com> wrote:
>
>
> On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <danken@redhat.com> wrote:
>>
>> On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <pkliczew@redhat.com>
>> wrote:
>> >
>> >
>> > On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <danken@redhat.com>
>> > wrote:
>> >>
>> >> Piotr, could you provide more information?
>> >>
>> >> Which setupNetworks action triggers this problem? Any idea which lock
>> >> did we use to take and when did we drop it?
>> >
>> >
>> > I though that this [1] would make sure that setupNetworks is exclusive
>> > operation on a host which seems not to be the case.
>> > In the logs I saw following message sent:
>> >
>> >
>> > {"jsonrpc":"2.0","method":"Host.setupNetworks","params":{"networks":{"VLAN200_Network":{"vlan":"200","netmask":"255.255.255.0","ipv6autoconf":false,"nic":"eth0","bridged":"false","ipaddr":"192.0.3.1","dhcpv6":false,"mtu":1500,"switch":"legacy"}},"bondings":{},"options":{"connectivityTimeout":120,"connectivityCheck":"true"}},"id":"3f7f74ea-fc39-4815-831b-5e3b1c22131d"}
>> >
>> > Few seconds later there was:
>> >
>> >
>> > {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"67d510eb-6dfc-4f67-97b6-a4e63c670ff2"}
>> >
>> > and still while we were calling pings there was:
>> >
>> >
>> > {"jsonrpc":"2.0","method":"StoragePool.getSpmStatus","params":{"storagepoolID":"8cc227da-70e7-4557-aa01-6d8ddee6f847"},"id":"d4d04c7c-47b8-44db-867b-770e1e19361c"}
>> >
>> > My assumption was that those calls should not happen and calls them
>> > selves
>> > could be corrupted or their responses.
>> > What do you think?
>> >
>> > [1]
>> >
>> > https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/network/host/HostSetupNetworksCommand.java#L285
>>
>> I suspect that getVmStats and getSpmStatus simply do not take the
>> hostmonitoring lock, and I don't see anything wrong in that.
>>
>> Note that during 006_migration, we set only a mere migration network,
>> not the management network. This operation should not interfere with
>> Engine-Vdsm communication in any way; I don't yet understand why you
>> suspect that it does.
>
>
> My assumption here is that I saw this failure 2 times and both were during
> setupNetworks.
> The pattern is that always a call fails which "should not" occur during such
> operation.
>

It is fair to suspect an interaction with setupNetworks, but let us
put some substance into it.
What is the mode of failure of the other command?

I am not sure what do you mean. Can you please explain?


Now, I understand the question (explanation offline). The reason why the parse method fails is that we have heartbeat frame glued together
with a response we should not get (partial).

During setupNetworks we ignore heartbeats but it doesn't mean that we do not receive them.
It seems that vdsm sends a heartbeat assuming that there was no interaction but actually there was.
We may want to fix this on vdsm side.

On the other hand we need to fix setupNetworks locking on the engine side. We either should not lock or
make sure the lock is taken for all possible interactions.