PHX storage outage - all jenkins slaves affected

--b8GWCKCLzrXbuNet Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi everyone! There has been a storage outage today, it started around 17:30 CEST and spa= nned until ~20:15. All the services are back up now and running, but a bunch of jenkins jobs failed due to the outage (all the slaves are using that storag= e) so you might see some false positives in your ci runs. To retrigger you can= use this job: http://jenkins.ovirt.org/gerrit_manual_trigger/ And/or submit a new patchset (rebasing should work). In any case, if you ha= ve any issues or doubts, please respond to this email or ping me (dcaro/dcaroe= st) on irc. Sorry for the inconvenience, we are gathering logs to find out what happend= and prevent it from happening in the future. --=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --b8GWCKCLzrXbuNet Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWuOxeAAoJEEBxx+HSYmnDiHgH/0lwh8cXHRnlJUNtEe7hu+7Y 5ZpWNOKfWlf4Gu9CUzBscoVTuZ0RmgKEPFb/OqWTk6R+wR5bcWsAsP0NjceyWDr7 GC7s0h9QK5Fibd6IwxWYv6jXd1N+E13XiPDQShms0Gy6HG5MV1PIjSc2Gb9jqcMM TN1Gbgs308aJwtwUMJc40PoOSls3UaQRBMrgO8aM5UOqRD6VFFaB2tnkaelQs50o TK4wwLhkSvs7irpBsBOVmABDOLe0cyOMUmFSBK8pfrSU8niMJDg7mxCPN9wFRHI4 XHgYTT4MaaRluXbpkA3Q8UONXOt4bx8MME+V+J7kAgIhDQ1UFQuPbgP/j2E2n/4= =qCbC -----END PGP SIGNATURE----- --b8GWCKCLzrXbuNet--

=20 Hi everyone! =20 There has been a storage outage today, it started around 17:30 CEST and s=
--xHFwDpU9dbj6ez1V Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 02/08 20:28, David Caro wrote: panned
until ~20:15. All the services are back up now and running, but a bunch of jenkins jobs failed due to the outage (all the slaves are using that stor= age) so you might see some false positives in your ci runs. To retrigger you c= an use this job: =20 http://jenkins.ovirt.org/gerrit_manual_trigger/ =20 And/or submit a new patchset (rebasing should work). In any case, if you = have any issues or doubts, please respond to this email or ping me (dcaro/dcar= oest) on irc. =20 Sorry for the inconvenience, we are gathering logs to find out what happe= nd and prevent it from happening in the future.
So the source of the issue has been sorted out, there was some uncoordinated effort that ended up changing the LACP settings in the switches for all the hosts, what caused a global network outage (all the hosts were affected) and that in turn caused the clustering to freeze as none of the nodes was able = to contact the network both went down. Then, once the network came up, the master of the cluster tried to remount = the drdb storage but was unable to due to some process keeping it busy, and did= not fully start up. That is a scenario that we did not test (we tested one node failure, not bo= th) so will have to investigate that failure case and find a solution for the clustering. We are also talking with the hosting to properly sync with us on that type = of interventions so this will not happen again. Thanks for your patience
=20 --=20 David Caro =20 Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D =20 Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --xHFwDpU9dbj6ez1V Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWuhFEAAoJEEBxx+HSYmnDkFMH+gNCa1Rd8dSoYaOdjgQBy3FZ op05ye3zt7UaQ9ZcBlTtkmJ89XxtadEWXEc8gmqvaiuQxF6tSO9BX/jHN4OPMKvb RNTEDpUCyxJmPl/hI/mbkOIQjaceTXjgJv/AgzrEGrBVKi5nbwxi8IIa2y7D8p53 hM/dy4DtFFTU6VtkQUHQsi7zANSLVE5/dd8B/er6KxfYH+PdzHTOnZz0kGiUXtkD sN5pCqEeMXGKf8BsHxj0a8RBVQDBTKKakl5WoqGcCfXErGpG1lcpvXfI+dMN2H5r 7o0fVLCOZvTYKRXOmzdOhn4ZqlWMVrDaYoQjESAYaoR6/Oon+mWsrqsyNGQy5kM= =UpXh -----END PGP SIGNATURE----- --xHFwDpU9dbj6ez1V--

Great root call analysis! Maybe we should add something like 'outage' guidelines on the wiki or readthedocs for any infra member that is about to do something that might affect the DC? Probably email to the infra/devel list should be OK, or even emailing to infra-support@ovirt.org so there will be an open ticket. thoughts? E. On Tue, Feb 9, 2016 at 4:18 PM, David Caro <dcaro@redhat.com> wrote:
On 02/08 20:28, David Caro wrote:
Hi everyone!
There has been a storage outage today, it started around 17:30 CEST and
spanned
until ~20:15. All the services are back up now and running, but a bunch of jenkins jobs failed due to the outage (all the slaves are using that storage) so you might see some false positives in your ci runs. To retrigger you can use this job:
http://jenkins.ovirt.org/gerrit_manual_trigger/
And/or submit a new patchset (rebasing should work). In any case, if you have any issues or doubts, please respond to this email or ping me (dcaro/dcaroest) on irc.
Sorry for the inconvenience, we are gathering logs to find out what happend and prevent it from happening in the future.
So the source of the issue has been sorted out, there was some uncoordinated effort that ended up changing the LACP settings in the switches for all the hosts, what caused a global network outage (all the hosts were affected) and that in turn caused the clustering to freeze as none of the nodes was able to contact the network both went down.
Then, once the network came up, the master of the cluster tried to remount the drdb storage but was unable to due to some process keeping it busy, and did not fully start up.
That is a scenario that we did not test (we tested one node failure, not both) so will have to investigate that failure case and find a solution for the clustering.
We are also talking with the hosting to properly sync with us on that type of interventions so this will not happen again.
Thanks for your patience
-- David Caro
Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D
Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605
-- David Caro
Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D
Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Eyal Edri Associate Manager EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

On Wed, Feb 10, 2016 at 11:49 AM, Eyal Edri <eedri@redhat.com> wrote:
Great root call analysis!
s/call/cause :)
Maybe we should add something like 'outage' guidelines on the wiki or readthedocs for any infra member that is about to do something that might affect the DC?
Probably email to the infra/devel list should be OK, or even emailing to infra-support@ovirt.org so there will be an open ticket.
thoughts?
E.
On Tue, Feb 9, 2016 at 4:18 PM, David Caro <dcaro@redhat.com> wrote:
On 02/08 20:28, David Caro wrote:
Hi everyone!
There has been a storage outage today, it started around 17:30 CEST and
spanned
until ~20:15. All the services are back up now and running, but a bunch of jenkins jobs failed due to the outage (all the slaves are using that storage) so you might see some false positives in your ci runs. To retrigger you can use this job:
http://jenkins.ovirt.org/gerrit_manual_trigger/
And/or submit a new patchset (rebasing should work). In any case, if you have any issues or doubts, please respond to this email or ping me (dcaro/dcaroest) on irc.
Sorry for the inconvenience, we are gathering logs to find out what happend and prevent it from happening in the future.
So the source of the issue has been sorted out, there was some uncoordinated effort that ended up changing the LACP settings in the switches for all the hosts, what caused a global network outage (all the hosts were affected) and that in turn caused the clustering to freeze as none of the nodes was able to contact the network both went down.
Then, once the network came up, the master of the cluster tried to remount the drdb storage but was unable to due to some process keeping it busy, and did not fully start up.
That is a scenario that we did not test (we tested one node failure, not both) so will have to investigate that failure case and find a solution for the clustering.
We are also talking with the hosting to properly sync with us on that type of interventions so this will not happen again.
Thanks for your patience
-- David Caro
Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D
Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605
-- David Caro
Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D
Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Eyal Edri Associate Manager EMEA ENG Virtualization R&D Red Hat Israel
phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
-- Eyal Edri Associate Manager EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
participants (2)
-
David Caro
-
Eyal Edri