<div dir="ltr">Great root call analysis!<div>Maybe we should add something like 'outage' guidelines on the wiki or readthedocs for any infra member that is about to do something </div><div>that might affect the DC?</div><div><br></div><div>Probably email to the infra/devel list should be OK, or even emailing to <a href="mailto:infra-support@ovirt.org">infra-support@ovirt.org</a> so there will be an open ticket.</div><div><br></div><div>thoughts?</div><div><br></div><div>E.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 9, 2016 at 4:18 PM, David Caro <span dir="ltr"><<a href="mailto:dcaro@redhat.com" target="_blank">dcaro@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 02/08 20:28, David Caro wrote:<br>
><br>
> Hi everyone!<br>
><br>
> There has been a storage outage today, it started around 17:30 CEST and spanned<br>
> until ~20:15. All the services are back up now and running, but a bunch of<br>
> jenkins jobs failed due to the outage (all the slaves are using that storage)<br>
> so you might see some false positives in your ci runs. To retrigger you can use<br>
> this job:<br>
><br>
> <a href="http://jenkins.ovirt.org/gerrit_manual_trigger/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/gerrit_manual_trigger/</a><br>
><br>
> And/or submit a new patchset (rebasing should work). In any case, if you have<br>
> any issues or doubts, please respond to this email or ping me (dcaro/dcaroest)<br>
> on irc.<br>
><br>
> Sorry for the inconvenience, we are gathering logs to find out what happend and<br>
> prevent it from happening in the future.<br>
<br>
<br>
</span>So the source of the issue has been sorted out, there was some uncoordinated<br>
effort that ended up changing the LACP settings in the switches for all the<br>
hosts, what caused a global network outage (all the hosts were affected) and<br>
that in turn caused the clustering to freeze as none of the nodes was able to<br>
contact the network both went down.<br>
<br>
Then, once the network came up, the master of the cluster tried to remount the<br>
drdb storage but was unable to due to some process keeping it busy, and did not<br>
fully start up.<br>
<br>
That is a scenario that we did not test (we tested one node failure, not both)<br>
so will have to investigate that failure case and find a solution for the<br>
clustering.<br>
<br>
We are also talking with the hosting to properly sync with us on that type of<br>
interventions so this will not happen again.<br>
<br>
<br>
Thanks for your patience<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> --<br>
> David Caro<br>
><br>
> Red Hat S.L.<br>
> Continuous Integration Engineer - EMEA ENG Virtualization R&D<br>
><br>
> Tel.: <a href="tel:%2B420%20532%20294%20605" value="+420532294605">+420 532 294 605</a><br>
> Email: <a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a><br>
> IRC: dcaro|dcaroest@{freenode|oftc|redhat}<br>
> Web: <a href="http://www.redhat.com" rel="noreferrer" target="_blank">www.redhat.com</a><br>
> RHT Global #: 82-62605<br>
<br>
<br>
<br>
--<br>
David Caro<br>
<br>
Red Hat S.L.<br>
Continuous Integration Engineer - EMEA ENG Virtualization R&D<br>
<br>
Tel.: <a href="tel:%2B420%20532%20294%20605" value="+420532294605">+420 532 294 605</a><br>
Email: <a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a><br>
IRC: dcaro|dcaroest@{freenode|oftc|redhat}<br>
Web: <a href="http://www.redhat.com" rel="noreferrer" target="_blank">www.redhat.com</a><br>
RHT Global #: 82-62605<br>
</div></div><br>_______________________________________________<br>
Devel mailing list<br>
<a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/devel</a><br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div>Eyal Edri<br>Associate Manager<br>EMEA ENG Virtualization R&D<br>Red Hat Israel<br><br>phone: +972-9-7692018<br>irc: eedri (on #tlv #rhev-dev #rhev-integ)</div></div></div>
</div>