July 2018 - Infra - oVirt List Archives

[oVirt Jenkins] ovirt-system-tests_he-basic-suite-4.2 - Build # 391 - Failure!
by jenkins＠jenkins.phx.ovirt.org 18 Jul '18

18 Jul '18

Project: http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.2/ Build: http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.2/391/ Build Number: 391 Build Status: Failure Triggered By: Started by timer ------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #391 [Dafna Ron] ovirt-master: skipping deactivae/activate storage tests ----------------- Failed Tests: ----------------- All tests passed

1 1

[oVirt Jenkins] ovirt-system-tests_he-node-ng-suite-4.2 - Build # 137 - Failure!
by jenkins＠jenkins.phx.ovirt.org 18 Jul '18

18 Jul '18

Project: http://jenkins.ovirt.org/job/ovirt-system-tests_he-node-ng-suite-4.2/ Build: http://jenkins.ovirt.org/job/ovirt-system-tests_he-node-ng-suite-4.2/137/ Build Number: 137 Build Status: Failure Triggered By: Started by timer ------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #137 [Dafna Ron] ovirt-master: skipping deactivae/activate storage tests ----------------- Failed Tests: ----------------- No tests ran.

1 1

[JIRA] (OVIRT-2345) Allow parallel reposyncs
by Daniel Belenky (oVirt JIRA) 18 Jul '18

18 Jul '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-2345?page=com.atlassian.jira.… ] Daniel Belenky updated OVIRT-2345: ---------------------------------- Description: Since we're going to run multiple suites in parallel on the same host, the repolock may become a bottle neck. We probably will need per-container cache or something similar. > Allow parallel reposyncs > ------------------------ > > Key: OVIRT-2345 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2345 > Project: oVirt - virtualization made easy > Issue Type: Task > Reporter: Daniel Belenky > Assignee: infra > > Since we're going to run multiple suites in parallel on the same host, the repolock may become a bottle neck. We probably will need per-container cache or something similar. -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0

[JIRA] (OVIRT-2345) Allow parallel reposyncs
by Daniel Belenky (oVirt JIRA) 18 Jul '18

18 Jul '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-2345?page=com.atlassian.jira.… ] Daniel Belenky updated OVIRT-2345: ---------------------------------- Epic Link: OVIRT-2326 > Allow parallel reposyncs > ------------------------ > > Key: OVIRT-2345 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2345 > Project: oVirt - virtualization made easy > Issue Type: Task > Reporter: Daniel Belenky > Assignee: infra > -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0

[JIRA] (OVIRT-2345) Allow parallel reposyncs
by Daniel Belenky (oVirt JIRA) 18 Jul '18

18 Jul '18

Daniel Belenky created OVIRT-2345: ------------------------------------- Summary: Allow parallel reposyncs Key: OVIRT-2345 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2345 Project: oVirt - virtualization made easy Issue Type: Task Reporter: Daniel Belenky Assignee: infra -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0

[JIRA] (OVIRT-2264) Cannot deactivate Storage while there are running tasks on this Storage
by Dafna Ron (oVirt JIRA) 18 Jul '18

18 Jul '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-2264?page=com.atlassian.jira.… ] Dafna Ron commented on OVIRT-2264: ---------------------------------- [~gbenhaim(a)redhat.com] can you please merge? > Cannot deactivate Storage while there are running tasks on this Storage > ----------------------------------------------------------------------- > > Key: OVIRT-2264 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2264 > Project: oVirt - virtualization made easy > Issue Type: Bug > Reporter: Dafna Ron > Assignee: infra > Labels: ost_failures, ost_race > > we failed test: 007_sd_reattach.deactivate_storage_domain in vdsm project with error: Cannot deactivate Storage while there are running tasks on this Storage. -Please wait until tasks will finish and try again.]". HTTP response code is 409. > I am opening this as a race in case something changed in OST tests which would cause this to repeat. > https://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/2506 > https://gerrit.ovirt.org/#/c/92313/2 -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0

[JIRA] (OVIRT-2338) resources got stuck
by Barak Korren (oVirt JIRA) 18 Jul '18

18 Jul '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-2338?page=com.atlassian.jira.… ] Barak Korren commented on OVIRT-2338: ------------------------------------- [~dron] no, we want this ticket to track immediate setup of a watchdog device. > resources got stuck > ------------------- > > Key: OVIRT-2338 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2338 > Project: oVirt - virtualization made easy > Issue Type: Bug > Reporter: Emil Natan > Assignee: infra > Labels: ost_failures, ost_infra > > resources.ovirt.org got stuck. Initially we received different nagios alerts about number of processes and filesystems usage, but the root cause was "Socket timeout after 10 seconds". There was not ssh connectivity, so reset of the VM through the engine UI helped to get it running again. > The issue affected few CQ tests. > Possible improvement could be to set watchdog to automatically reboot the VM if it gets stuck. -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0

[JIRA] (OVIRT-2338) resources got stuck
by Dafna Ron (oVirt JIRA) 18 Jul '18

18 Jul '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-2338?page=com.atlassian.jira.… ] Dafna Ron commented on OVIRT-2338: ---------------------------------- [~bkorren(a)redhat.com] [~ena(a)redhat.com] based on the mail, can we close this jira as resolved and continue follow through the epic? > resources got stuck > ------------------- > > Key: OVIRT-2338 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2338 > Project: oVirt - virtualization made easy > Issue Type: Bug > Reporter: Emil Natan > Assignee: infra > Labels: ost_failures, ost_infra > > resources.ovirt.org got stuck. Initially we received different nagios alerts about number of processes and filesystems usage, but the root cause was "Socket timeout after 10 seconds". There was not ssh connectivity, so reset of the VM through the engine UI helped to get it running again. > The issue affected few CQ tests. > Possible improvement could be to set watchdog to automatically reboot the VM if it gets stuck. -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0

Failure if resources.ovirt.org on Tuesday
by Barak Korren 18 Jul '18

18 Jul '18

We had an issue with resources.ovirt.org becoming unresponsive to all network access on Tuesday, July 17th. Tracking of the issue was done using: https://ovirt-jira.atlassian.net/browse/OVIRT-2338 During the attempt to analyse and fix this issue we came across several other issues the have to do with out ability to gain access to the production VMs. Thos issues hade been documented and collected in the following epic: https://ovirt-jira.atlassian.net/browse/OVIRT-2337 We need to resolve these issues ASAP to ensure that next time such a production outage occurs we don't end up struggling for many minutes to get basic access. In the end we reached the conclusion that the server itself was stuck and needed a hard-reset. This raises the question why didn't we have a watchdog device configured on the VM to automatically detect and deal with such issues. This experience led us to the understanding that resources is fulfilling far too many critical roles ATM for us to be able to responsibly keep it as a simple single VM. I've created the following Epic to track and discuss work for improving the infrastructure behind resources.ovirt.org to make it less fragile and more reliable: https://ovirt-jira.atlassian.net/browse/OVIRT-2344 -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

1 0

[JIRA] (OVIRT-2337) Ensure emergency access to PHX production VMs
by Barak Korren (oVirt JIRA) 18 Jul '18

18 Jul '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-2337?page=com.atlassian.jira.… ] Barak Korren updated OVIRT-2337: -------------------------------- Summary: Ensure emergency access to PHX production VMs (was: Ensure emergency access to PHC production VMs) > Ensure emergency access to PHX production VMs > --------------------------------------------- > > Key: OVIRT-2337 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2337 > Project: oVirt - virtualization made easy > Issue Type: Epic > Components: oVirt Infra > Reporter: Barak Korren > Assignee: infra > Priority: High > > Make sure that in case of malfunction we have multiple fail-safe ways of gaining access to the PHX production VMs to resolve issues. -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100090)

1 0