<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 6, 2017 at 10:10 AM, Piotr Kliczewski <span dir="ltr"><<a href="mailto:pkliczew@redhat.com" target="_blank">pkliczew@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div class="h5"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 6, 2017 at 9:46 AM, Dan Kenigsberg <span dir="ltr"><<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-3871231027632991197HOEnZb"><div class="m_-3871231027632991197h5">On Mon, Mar 6, 2017 at 10:11 AM, Piotr Kliczewski <<a href="mailto:pkliczew@redhat.com" target="_blank">pkliczew@redhat.com</a>> wrote:<br>
><br>
><br>
> On Mon, Mar 6, 2017 at 8:23 AM, Dan Kenigsberg <<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>> wrote:<br>
>><br>
>> On Sun, Mar 5, 2017 at 9:50 PM, Piotr Kliczewski <<a href="mailto:pkliczew@redhat.com" target="_blank">pkliczew@redhat.com</a>><br>
>> wrote:<br>
>> ><br>
>> ><br>
>> > On Sun, Mar 5, 2017 at 8:29 AM, Dan Kenigsberg <<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> Piotr, could you provide more information?<br>
>> >><br>
>> >> Which setupNetworks action triggers this problem? Any idea which lock<br>
>> >> did we use to take and when did we drop it?<br>
>> ><br>
>> ><br>
>> > I though that this [1] would make sure that setupNetworks is exclusive<br>
>> > operation on a host which seems not to be the case.<br>
>> > In the logs I saw following message sent:<br>
>> ><br>
>> ><br>
>> > {"jsonrpc":"2.0","method":"Hos<wbr>t.setupNetworks","params":{"<wbr>networks":{"VLAN200_Network":{<wbr>"vlan":"200","netmask":"255.25<wbr>5.255.0","ipv6autoconf":false,<wbr>"nic":"eth0","bridged":"false"<wbr>,"ipaddr":"192.0.3.1","dhcpv6"<wbr>:false,"mtu":1500,"switch":"<wbr>legacy"}},"bondings":{},"<wbr>options":{"connectivityTimeout<wbr>":120,"connectivityCheck":"<wbr>true"}},"id":"3f7f74ea-fc39-<wbr>4815-831b-5e3b1c22131d"}<br>
>> ><br>
>> > Few seconds later there was:<br>
>> ><br>
>> ><br>
>> > {"jsonrpc":"2.0","method":"Hos<wbr>t.getAllVmStats","params":{},"<wbr>id":"67d510eb-6dfc-4f67-97b6-<wbr>a4e63c670ff2"}<br>
>> ><br>
>> > and still while we were calling pings there was:<br>
>> ><br>
>> ><br>
>> > {"jsonrpc":"2.0","method":"Sto<wbr>ragePool.getSpmStatus","params<wbr>":{"storagepoolID":"8cc227da-<wbr>70e7-4557-aa01-6d8ddee6f847"},<wbr>"id":"d4d04c7c-47b8-44db-867b-<wbr>770e1e19361c"}<br>
>> ><br>
>> > My assumption was that those calls should not happen and calls them<br>
>> > selves<br>
>> > could be corrupted or their responses.<br>
>> > What do you think?<br>
>> ><br>
>> > [1]<br>
>> ><br>
>> > <a href="https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/network/host/HostSetupNetworksCommand.java#L285" rel="noreferrer" target="_blank">https://github.com/oVirt/ovirt<wbr>-engine/blob/master/backend/<wbr>manager/modules/bll/src/main/<wbr>java/org/ovirt/engine/core/<wbr>bll/network/host/HostSetupNetw<wbr>orksCommand.java#L285</a><br>
>><br>
>> I suspect that getVmStats and getSpmStatus simply do not take the<br>
>> hostmonitoring lock, and I don't see anything wrong in that.<br>
>><br>
>> Note that during 006_migration, we set only a mere migration network,<br>
>> not the management network. This operation should not interfere with<br>
>> Engine-Vdsm communication in any way; I don't yet understand why you<br>
>> suspect that it does.<br>
><br>
><br>
> My assumption here is that I saw this failure 2 times and both were during<br>
> setupNetworks.<br>
> The pattern is that always a call fails which "should not" occur during such<br>
> operation.<br>
><br>
<br>
</div></div>It is fair to suspect an interaction with setupNetworks, but let us<br>
put some substance into it.<br>
What is the mode of failure of the other command?<br>
</blockquote></div><br></div></div></div><div class="gmail_extra">I am not sure what do you mean. Can you please explain?<br></div><div class="gmail_extra"><br></div></div>
</blockquote></div><br></div><div class="gmail_extra">Now, I understand the question (explanation offline). The reason why the parse method fails is that we have heartbeat frame glued together<br></div><div class="gmail_extra">with a response we should not get (partial). <br><br></div><div class="gmail_extra">During setupNetworks we ignore heartbeats but it doesn't mean that we do not receive them.<br></div><div class="gmail_extra">It seems that vdsm sends a heartbeat assuming that there was no interaction but actually there was.<br></div><div class="gmail_extra">We may want to fix this on vdsm side.<br><br></div><div class="gmail_extra">On the other hand we need to fix setupNetworks locking on the engine side. We either should not lock or<br></div><div class="gmail_extra">make sure the lock is taken for all possible interactions.<br></div><div class="gmail_extra"><br></div></div>