Hi all,
I noticed that our hosted-engine suites [1] often fail recently, and
decided to have a look at [2], which are on 4.2, which should
hopefully be "rock solid" and basically never fail.
I looked at these, [3][4][5][6][7], which are all the ones that still
appear in [2] and marked as failed.
Among them:
- All but one failed while "Waiting for agent to be ready" and timing
out after 10 minutes, as part of 008_restart_he_vm.py, which was added
a month ago [8] and then patched [9].
- The other one [7] failed while "Waiting for engine to migrate", also
eventually timing out after 10 minutes, as part of
010_local_mainentance.py, which was also added in [9].
I also had a look at the last ones that succeeded, builds 329 to 337
of [2]. There:
- "Waiting for agent to be ready" took between 26 and 48 seconds
- "Waiting for engine to migrate" took between 69 and 82 seconds
Assuming these numbers are reasonable (which might be debatable), 10
minutes indeed sounds like a reasonable timeout, and I think we should
handle each failure specifically. Did anyone check them? Was it an
infra issue/load/etc.? A bug? Something else?
I didn't check the logs yet, might do this later. Also didn't check
the failures in other jobs in [1].
Best regards,
[1]
https://jenkins.ovirt.org/search/?q=he-basic
[2]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/
[3]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4...
[4]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4...
[5]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4...
[6]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4...
[7]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4...
[8]
https://gerrit.ovirt.org/91952
[9]
https://gerrit.ovirt.org/92341
--
Didi