[ovirt-devel] PHX storage outage - all jenkins slaves affected

Wed Feb 10 11:49:22 UTC 2016

Great root call analysis!
Maybe we should add something like 'outage' guidelines on the wiki or
readthedocs for any infra member that is about to do something
that might affect the DC?

Probably email to the infra/devel list should be OK, or even emailing to
infra-support at ovirt.org so there will be an open ticket.

thoughts?

E.

On Tue, Feb 9, 2016 at 4:18 PM, David Caro <dcaro at redhat.com> wrote:

> On 02/08 20:28, David Caro wrote:
> >
> > Hi everyone!
> >
> > There has been a storage outage today, it started around 17:30 CEST and
> spanned
> > until ~20:15. All the services are back up now and running, but a bunch
> of
> > jenkins jobs failed due to the outage (all the slaves are using that
> storage)
> > so you might see some false positives in your ci runs. To retrigger you
> can use
> > this job:
> >
> >     http://jenkins.ovirt.org/gerrit_manual_trigger/
> >
> > And/or submit a new patchset (rebasing should work). In any case, if you
> have
> > any issues or doubts, please respond to this email or ping me
> (dcaro/dcaroest)
> > on irc.
> >
> > Sorry for the inconvenience, we are gathering logs to find out what
> happend and
> > prevent it from happening in the future.
>
>
> So the source of the issue has been sorted out, there was some
> uncoordinated
> effort that ended up changing the LACP settings in the switches for all the
> hosts, what caused a global network outage (all the hosts were affected)
> and
> that in turn caused the clustering to freeze as none of the nodes was able
> to
> contact the network both went down.
>
> Then, once the network came up, the master of the cluster tried to remount
> the
> drdb storage but was unable to due to some process keeping it busy, and
> did not
> fully start up.
>
> That is a scenario that we did not test (we tested one node failure, not
> both)
> so will have to investigate that failure case and find a solution for the
> clustering.
>
> We are also talking with the hosting to properly sync with us on that type
> of
> interventions so this will not happen again.
>
>
> Thanks for your patience
>
> >
> > --
> > David Caro
> >
> > Red Hat S.L.
> > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> >
> > Tel.: +420 532 294 605
> > Email: dcaro at redhat.com
> > IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> > Web: www.redhat.com
> > RHT Global #: 82-62605
>
>
>
> --
> David Caro
>
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R&D
>
> Tel.: +420 532 294 605
> Email: dcaro at redhat.com
> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> Web: www.redhat.com
> RHT Global #: 82-62605
>
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>

-- 
Eyal Edri
Associate Manager
EMEA ENG Virtualization R&D
Red Hat Israel

phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20160210/21e34465/attachment-0001.html>