[ovirt-devel] PHX storage outage - all jenkins slaves affected

Wed Feb 10 11:49:40 UTC 2016

On Wed, Feb 10, 2016 at 11:49 AM, Eyal Edri <eedri at redhat.com> wrote:

> Great root call analysis!
>

s/call/cause :)

> Maybe we should add something like 'outage' guidelines on the wiki or
> readthedocs for any infra member that is about to do something
> that might affect the DC?
>
> Probably email to the infra/devel list should be OK, or even emailing to
> infra-support at ovirt.org so there will be an open ticket.
>
> thoughts?
>
> E.
>
> On Tue, Feb 9, 2016 at 4:18 PM, David Caro <dcaro at redhat.com> wrote:
>
>> On 02/08 20:28, David Caro wrote:
>> >
>> > Hi everyone!
>> >
>> > There has been a storage outage today, it started around 17:30 CEST and
>> spanned
>> > until ~20:15. All the services are back up now and running, but a bunch
>> of
>> > jenkins jobs failed due to the outage (all the slaves are using that
>> storage)
>> > so you might see some false positives in your ci runs. To retrigger you
>> can use
>> > this job:
>> >
>> >     http://jenkins.ovirt.org/gerrit_manual_trigger/
>> >
>> > And/or submit a new patchset (rebasing should work). In any case, if
>> you have
>> > any issues or doubts, please respond to this email or ping me
>> (dcaro/dcaroest)
>> > on irc.
>> >
>> > Sorry for the inconvenience, we are gathering logs to find out what
>> happend and
>> > prevent it from happening in the future.
>>
>>
>> So the source of the issue has been sorted out, there was some
>> uncoordinated
>> effort that ended up changing the LACP settings in the switches for all
>> the
>> hosts, what caused a global network outage (all the hosts were affected)
>> and
>> that in turn caused the clustering to freeze as none of the nodes was
>> able to
>> contact the network both went down.
>>
>> Then, once the network came up, the master of the cluster tried to
>> remount the
>> drdb storage but was unable to due to some process keeping it busy, and
>> did not
>> fully start up.
>>
>> That is a scenario that we did not test (we tested one node failure, not
>> both)
>> so will have to investigate that failure case and find a solution for the
>> clustering.
>>
>> We are also talking with the hosting to properly sync with us on that
>> type of
>> interventions so this will not happen again.
>>
>>
>> Thanks for your patience
>>
>> >
>> > --
>> > David Caro
>> >
>> > Red Hat S.L.
>> > Continuous Integration Engineer - EMEA ENG Virtualization R&D
>> >
>> > Tel.: +420 532 294 605
>> > Email: dcaro at redhat.com
>> > IRC: dcaro|dcaroest@{freenode|oftc|redhat}
>> > Web: www.redhat.com
>> > RHT Global #: 82-62605
>>
>>
>>
>> --
>> David Caro
>>
>> Red Hat S.L.
>> Continuous Integration Engineer - EMEA ENG Virtualization R&D
>>
>> Tel.: +420 532 294 605
>> Email: dcaro at redhat.com
>> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
>> Web: www.redhat.com
>> RHT Global #: 82-62605
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
>
>
> --
> Eyal Edri
> Associate Manager
> EMEA ENG Virtualization R&D
> Red Hat Israel
>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>

-- 
Eyal Edri
Associate Manager
EMEA ENG Virtualization R&D
Red Hat Israel

phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20160210/b827aa33/attachment-0001.html>