
Hi all, I noticed that our hosted-engine suites [1] often fail recently, and decided to have a look at [2], which are on 4.2, which should hopefully be "rock solid" and basically never fail. I looked at these, [3][4][5][6][7], which are all the ones that still appear in [2] and marked as failed. Among them: - All but one failed while "Waiting for agent to be ready" and timing out after 10 minutes, as part of 008_restart_he_vm.py, which was added a month ago [8] and then patched [9]. - The other one [7] failed while "Waiting for engine to migrate", also eventually timing out after 10 minutes, as part of 010_local_mainentance.py, which was also added in [9]. I also had a look at the last ones that succeeded, builds 329 to 337 of [2]. There: - "Waiting for agent to be ready" took between 26 and 48 seconds - "Waiting for engine to migrate" took between 69 and 82 seconds Assuming these numbers are reasonable (which might be debatable), 10 minutes indeed sounds like a reasonable timeout, and I think we should handle each failure specifically. Did anyone check them? Was it an infra issue/load/etc.? A bug? Something else? I didn't check the logs yet, might do this later. Also didn't check the failures in other jobs in [1]. Best regards, [1] https://jenkins.ovirt.org/search/?q=he-basic [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/ [3] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/... [5] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/... [6] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/... [7] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/... [8] https://gerrit.ovirt.org/91952 [9] https://gerrit.ovirt.org/92341 -- Didi