<div dir="ltr">Great root call analysis!<div>Maybe we should add something like &#39;outage&#39; guidelines on the wiki or readthedocs for any infra member that is about to do something </div><div>that might affect the DC?</div><div><br></div><div>Probably email to the infra/devel list should be OK, or even emailing to <a href="mailto:infra-support@ovirt.org">infra-support@ovirt.org</a> so there will be an open ticket.</div><div><br></div><div>thoughts?</div><div><br></div><div>E.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 9, 2016 at 4:18 PM, David Caro <span dir="ltr">&lt;<a href="mailto:dcaro@redhat.com" target="_blank">dcaro@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 02/08 20:28, David Caro wrote:<br>

&gt;<br>

&gt; Hi everyone!<br>

&gt;<br>

&gt; There has been a storage outage today, it started around 17:30 CEST and spanned<br>

&gt; until ~20:15. All the services are back up now and running, but a bunch of<br>

&gt; jenkins jobs failed due to the outage (all the slaves are using that storage)<br>

&gt; so you might see some false positives in your ci runs. To retrigger you can use<br>

&gt; this job:<br>

&gt;<br>

&gt;     <a href="http://jenkins.ovirt.org/gerrit_manual_trigger/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/gerrit_manual_trigger/</a><br>

&gt;<br>

&gt; And/or submit a new patchset (rebasing should work). In any case, if you have<br>

&gt; any issues or doubts, please respond to this email or ping me (dcaro/dcaroest)<br>

&gt; on irc.<br>

&gt;<br>

&gt; Sorry for the inconvenience, we are gathering logs to find out what happend and<br>

&gt; prevent it from happening in the future.<br>

<br>

<br>

</span>So the source of the issue has been sorted out, there was some uncoordinated<br>

effort that ended up changing the LACP settings in the switches for all the<br>

hosts, what caused a global network outage (all the hosts were affected) and<br>

that in turn caused the clustering to freeze as none of the nodes was able to<br>

contact the network both went down.<br>

<br>

Then, once the network came up, the master of the cluster tried to remount the<br>

drdb storage but was unable to due to some process keeping it busy, and did not<br>

fully start up.<br>

<br>

That is a scenario that we did not test (we tested one node failure, not both)<br>

so will have to investigate that failure case and find a solution for the<br>

clustering.<br>

<br>

We are also talking with the hosting to properly sync with us on that type of<br>

interventions so this will not happen again.<br>

<br>

<br>

Thanks for your patience<br>

<div class="HOEnZb"><div class="h5"><br>

&gt;<br>

&gt; --<br>

&gt; David Caro<br>

&gt;<br>

&gt; Red Hat S.L.<br>

&gt; Continuous Integration Engineer - EMEA ENG Virtualization R&amp;D<br>

&gt;<br>

&gt; Tel.: <a href="tel:%2B420%20532%20294%20605" value="+420532294605">+420 532 294 605</a><br>

&gt; Email: <a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a><br>

&gt; IRC: dcaro|dcaroest@{freenode|oftc|redhat}<br>

&gt; Web: <a href="http://www.redhat.com" rel="noreferrer" target="_blank">www.redhat.com</a><br>

&gt; RHT Global #: 82-62605<br>

<br>

<br>

<br>

--<br>

David Caro<br>

<br>

Red Hat S.L.<br>

Continuous Integration Engineer - EMEA ENG Virtualization R&amp;D<br>

<br>

Tel.: <a href="tel:%2B420%20532%20294%20605" value="+420532294605">+420 532 294 605</a><br>

Email: <a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a><br>

IRC: dcaro|dcaroest@{freenode|oftc|redhat}<br>

Web: <a href="http://www.redhat.com" rel="noreferrer" target="_blank">www.redhat.com</a><br>

RHT Global #: 82-62605<br>

</div></div><br>_______________________________________________<br>

Devel mailing list<br>

<a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>

<a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/devel</a><br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div>Eyal Edri<br>Associate Manager<br>EMEA ENG Virtualization R&amp;D<br>Red Hat Israel<br><br>phone: +972-9-7692018<br>irc: eedri (on #tlv #rhev-dev #rhev-integ)</div></div></div>

</div>