hc-basic-suite-master fails due to missing glusterfs firewalld services

newer
test_hotplug_memory fails basic...

Yedidyah Bar David

16 Jun 2021 16 Jun '21

12:23 p.m.

Hi, I now tried running locally hc-basic-suite-master with a patched OST, and it failed due to $subject. I checked and see that this also happened on CI, e.g. [1], before it started failing to to an unrelated reason later: E TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] *** E failed: [lago-hc-basic-suite-master-host-0] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-2] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-1] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} This seems similar to [2], and indeed I can't see the package 'glusterfs-server' installed locally on host-0. Any idea? Thanks and best regards, [1] https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/ [2] https://github.com/oVirt/ovirt-ansible/issues/124 -- Didi

Show replies by date

Yedidyah Bar David

17 Jun 17 Jun

1:44 p.m.

On Wed, Jun 16, 2021 at 1:23 PM Yedidyah Bar David <didi@redhat.com> wrote:

...

Hi,

I now tried running locally hc-basic-suite-master with a patched OST, and it failed due to $subject. I checked and see that this also happened on CI, e.g. [1], before it started failing to to an unrelated reason later:

E TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] *** E failed: [lago-hc-basic-suite-master-host-0] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-2] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-1] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

This seems similar to [2], and indeed I can't see the package 'glusterfs-server' installed locally on host-0. Any idea?

I think I understand: It seems like the deployment of hc relied on the order of running the deploy scripts as written in lagoinitfile. With the new deploy code, all of them run in parallel. Does this make sense?

...

Thanks and best regards,

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/

[2] https://github.com/oVirt/ovirt-ansible/issues/124 -- Didi

-- Didi

Marcin Sobczyk

5:26 p.m.

New subject: hc-basic-suite-master fails due to missing glusterfs firewalld services

On 6/17/21 1:44 PM, Yedidyah Bar David wrote:

...

On Wed, Jun 16, 2021 at 1:23 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi,

I now tried running locally hc-basic-suite-master with a patched OST, and it failed due to $subject. I checked and see that this also happened on CI, e.g. [1], before it started failing to to an unrelated reason later:

E TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] *** E failed: [lago-hc-basic-suite-master-host-0] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-2] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-1] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

This seems similar to [2], and indeed I can't see the package 'glusterfs-server' installed locally on host-0. Any idea? I think I understand:

It seems like the deployment of hc relied on the order of running the deploy scripts as written in lagoinitfile. With the new deploy code, all of them run in parallel. Does this make sense? The scripts run in parallel as in "on all VMs at the same time", but sequentially as in "one script at a time on each VM" - this is the same behavior we had with lago deployment.

Regards, Marcin

...

...
Thanks and best regards,

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/

[2] https://github.com/oVirt/ovirt-ansible/issues/124 -- Didi

-- Didi

Yedidyah Bar David

6:59 p.m.

On Thu, Jun 17, 2021 at 6:27 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

...

On 6/17/21 1:44 PM, Yedidyah Bar David wrote:

...
On Wed, Jun 16, 2021 at 1:23 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi,

I now tried running locally hc-basic-suite-master with a patched OST, and it failed due to $subject. I checked and see that this also happened on CI, e.g. [1], before it started failing to to an unrelated reason later:

E TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] *** E failed: [lago-hc-basic-suite-master-host-0] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-2] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-1] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

This seems similar to [2], and indeed I can't see the package 'glusterfs-server' installed locally on host-0. Any idea? I think I understand:

It seems like the deployment of hc relied on the order of running the deploy scripts as written in lagoinitfile. With the new deploy code, all of them run in parallel. Does this make sense? The scripts run in parallel as in "on all VMs at the same time", but sequentially as in "one script at a time on each VM" - this is the same behavior we had with lago deployment.

Well, I do not think it works as intended, then. When running locally, I logged into host-0, and after it failed, I had: # dnf history ID | Command line | Date and time | Action(s) | Altered ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4 | install -y --nogpgcheck ansible gluster-ansible-roles ovirt-hosted-engine-setup ovirt-ansible-hosted-engine-setup ovirt-ansible-reposit | 2021-06-17 11:54 | I, U | 8 3 | -y --nogpgcheck install ovirt-host python3-coverage vdsm-hook-vhostmd | 2021-06-08 02:15 | Install | 493 EE 2 | install -y dnf-utils https://resources.ovirt.org/pub/yum-repo/ovirt-release-master.rpm | 2021-06-08 02:14 | Install | 1 1 | | 2021-06-08 02:06 | Install | 511 EE Meaning, it already ran setup_first_host.sh (and failed there), but didn't run hc_setup_host.sh, although it appears before it. If you check [1], which is a build that failed due to this reason (unlike the later ones), you see there: ------------------------------ Captured log setup ------------------------------ 2021-06-07 01:58:38+0000,594 INFO [ost_utils.pytest.fixtures.deployment] Waiting for SSH on the VMs (deployment:40) 2021-06-07 01:59:11+0000,947 INFO [ost_utils.deployment_utils.package_mgmt] oVirt packages used on VMs: (package_mgmt:133) 2021-06-07 01:59:11+0000,948 INFO [ost_utils.deployment_utils.package_mgmt] vdsm-4.40.70.2-1.git34cdc8884.el8.x86_64 (package_mgmt:135) 2021-06-07 01:59:11+0000,950 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-1 (scripts:36) 2021-06-07 01:59:11+0000,950 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-2 (scripts:36) 2021-06-07 01:59:11+0000,952 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) 2021-06-07 01:59:13+0000,260 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-1 (scripts:36) 2021-06-07 01:59:13+0000,370 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) 2021-06-07 01:59:13+0000,526 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-2 (scripts:36) 2021-06-07 01:59:15+0000,250 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/setup_first_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) So you see that hc_setup_host.sh was at least logged as being started _after_ setup_host.sh, but very _close_ to it - I can't believe it finished in 2 seconds. This part of the log is the same also for later runs, although they fail earlier. You can compare this with the log of the last successful run (using lago deploy), which also does not very clearly show when each script finished, but at least logs their start in the correct order. That said, I do not think the solution should be to now spend time on investigating this, finding the root cause, and fixing - I think we should instead stop keeping the list of deploy scripts in lagoinitfile but move them simply to python code. Best regards,

...

Regards, Marcin

...
...
Thanks and best regards,

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/

[2] https://github.com/oVirt/ovirt-ansible/issues/124 -- Didi

-- Didi

-- Didi

Marcin Sobczyk

18 Jun 18 Jun

9:18 a.m.

New subject: hc-basic-suite-master fails due to missing glusterfs firewalld services

On 6/17/21 6:59 PM, Yedidyah Bar David wrote:

...

On Thu, Jun 17, 2021 at 6:27 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

...
On 6/17/21 1:44 PM, Yedidyah Bar David wrote:

...
On Wed, Jun 16, 2021 at 1:23 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi,

I now tried running locally hc-basic-suite-master with a patched OST, and it failed due to $subject. I checked and see that this also happened on CI, e.g. [1], before it started failing to to an unrelated reason later:

E TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] *** E failed: [lago-hc-basic-suite-master-host-0] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-2] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-1] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

This seems similar to [2], and indeed I can't see the package 'glusterfs-server' installed locally on host-0. Any idea? I think I understand:

It seems like the deployment of hc relied on the order of running the deploy scripts as written in lagoinitfile. With the new deploy code, all of them run in parallel. Does this make sense? The scripts run in parallel as in "on all VMs at the same time", but sequentially as in "one script at a time on each VM" - this is the same behavior we had with lago deployment.

Well, I do not think it works as intended, then. When running locally, I logged into host-0, and after it failed, I had:

# dnf history ID | Command line

| Date and time | Action(s) | Altered ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4 | install -y --nogpgcheck ansible gluster-ansible-roles ovirt-hosted-engine-setup ovirt-ansible-hosted-engine-setup ovirt-ansible-reposit | 2021-06-17 11:54 | I, U | 8 3 | -y --nogpgcheck install ovirt-host python3-coverage vdsm-hook-vhostmd | 2021-06-08 02:15 | Install | 493 EE 2 | install -y dnf-utils https://resources.ovirt.org/pub/yum-repo/ovirt-release-master.rpm | 2021-06-08 02:14 | Install | 1 1 |

| 2021-06-08 02:06 | Install | 511 EE

Meaning, it already ran setup_first_host.sh (and failed there), but didn't run hc_setup_host.sh, although it appears before it.

If you check [1], which is a build that failed due to this reason (unlike the later ones), you see there:

------------------------------ Captured log setup ------------------------------ 2021-06-07 01:58:38+0000,594 INFO [ost_utils.pytest.fixtures.deployment] Waiting for SSH on the VMs (deployment:40) 2021-06-07 01:59:11+0000,947 INFO [ost_utils.deployment_utils.package_mgmt] oVirt packages used on VMs: (package_mgmt:133) 2021-06-07 01:59:11+0000,948 INFO [ost_utils.deployment_utils.package_mgmt] vdsm-4.40.70.2-1.git34cdc8884.el8.x86_64 (package_mgmt:135) 2021-06-07 01:59:11+0000,950 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-1 (scripts:36) 2021-06-07 01:59:11+0000,950 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-2 (scripts:36) 2021-06-07 01:59:11+0000,952 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) 2021-06-07 01:59:13+0000,260 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-1 (scripts:36) 2021-06-07 01:59:13+0000,370 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) 2021-06-07 01:59:13+0000,526 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-2 (scripts:36) 2021-06-07 01:59:15+0000,250 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/setup_first_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36)

So you see that hc_setup_host.sh was at least logged as being started _after_ setup_host.sh, but very _close_ to it - I can't believe it finished in 2 seconds. This part of the log is the same also for later runs, although they fail earlier. You can compare this with the log of the last successful run (using lago deploy), which also does not very clearly show when each script finished, but at least logs their start in the correct order.

That said, I do not think the solution should be to now spend time on investigating this, finding the root cause, and fixing - I think we should instead stop keeping the list of deploy scripts in lagoinitfile but move them simply to python code. [1] is unfortunately already gone - please ping me when you notice this kind of behavior again.

Regards, Marcin

...

Best regards,

...
Regards, Marcin

...
...
Thanks and best regards,

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/

[2] https://github.com/oVirt/ovirt-ansible/issues/124 -- Didi

-- Didi

Yedidyah Bar David

20 Jun 20 Jun

11:39 a.m.

On Fri, Jun 18, 2021 at 10:18 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:

...

On 6/17/21 6:59 PM, Yedidyah Bar David wrote:

...
On Thu, Jun 17, 2021 at 6:27 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

...
On 6/17/21 1:44 PM, Yedidyah Bar David wrote:

...
On Wed, Jun 16, 2021 at 1:23 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi,

I now tried running locally hc-basic-suite-master with a patched OST, and it failed due to $subject. I checked and see that this also happened on CI, e.g. [1], before it started failing to to an unrelated reason later:

E TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] *** E failed: [lago-hc-basic-suite-master-host-0] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-2] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"} E failed: [lago-hc-basic-suite-master-host-1] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

This seems similar to [2], and indeed I can't see the package 'glusterfs-server' installed locally on host-0. Any idea? I think I understand:

It seems like the deployment of hc relied on the order of running the deploy scripts as written in lagoinitfile. With the new deploy code, all of them run in parallel. Does this make sense? The scripts run in parallel as in "on all VMs at the same time", but sequentially as in "one script at a time on each VM" - this is the same behavior we had with lago deployment.

Well, I do not think it works as intended, then. When running locally, I logged into host-0, and after it failed, I had:

# dnf history ID | Command line

| Date and time | Action(s) | Altered ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4 | install -y --nogpgcheck ansible gluster-ansible-roles ovirt-hosted-engine-setup ovirt-ansible-hosted-engine-setup ovirt-ansible-reposit | 2021-06-17 11:54 | I, U | 8 3 | -y --nogpgcheck install ovirt-host python3-coverage vdsm-hook-vhostmd | 2021-06-08 02:15 | Install | 493 EE 2 | install -y dnf-utils https://resources.ovirt.org/pub/yum-repo/ovirt-release-master.rpm | 2021-06-08 02:14 | Install | 1 1 |

| 2021-06-08 02:06 | Install | 511 EE

Meaning, it already ran setup_first_host.sh (and failed there), but didn't run hc_setup_host.sh, although it appears before it.

If you check [1], which is a build that failed due to this reason (unlike the later ones), you see there:

------------------------------ Captured log setup ------------------------------ 2021-06-07 01:58:38+0000,594 INFO [ost_utils.pytest.fixtures.deployment] Waiting for SSH on the VMs (deployment:40) 2021-06-07 01:59:11+0000,947 INFO [ost_utils.deployment_utils.package_mgmt] oVirt packages used on VMs: (package_mgmt:133) 2021-06-07 01:59:11+0000,948 INFO [ost_utils.deployment_utils.package_mgmt] vdsm-4.40.70.2-1.git34cdc8884.el8.x86_64 (package_mgmt:135) 2021-06-07 01:59:11+0000,950 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-1 (scripts:36) 2021-06-07 01:59:11+0000,950 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-2 (scripts:36) 2021-06-07 01:59:11+0000,952 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) 2021-06-07 01:59:13+0000,260 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-1 (scripts:36) 2021-06-07 01:59:13+0000,370 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36) 2021-06-07 01:59:13+0000,526 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh on lago-hc-basic-suite-master-host-2 (scripts:36) 2021-06-07 01:59:15+0000,250 INFO [ost_utils.deployment_utils.scripts] Running /home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/setup_first_host.sh on lago-hc-basic-suite-master-host-0 (scripts:36)

So you see that hc_setup_host.sh was at least logged as being started _after_ setup_host.sh, but very _close_ to it - I can't believe it finished in 2 seconds. This part of the log is the same also for later runs, although they fail earlier. You can compare this with the log of the last successful run (using lago deploy), which also does not very clearly show when each script finished, but at least logs their start in the correct order.

That said, I do not think the solution should be to now spend time on investigating this, finding the root cause, and fixing - I think we should instead stop keeping the list of deploy scripts in lagoinitfile but move them simply to python code. [1] is unfortunately already gone - please ping me when you notice this kind of behavior again.

OK, it seems to be unrelated. Should be fixed with something like: https://gerrit.ovirt.org/c/ovirt-system-tests/+/115318 hc is Not ready yet, though - no urgency in merging this.

...

Regards, Marcin

...
Best regards,

...
Regards, Marcin

...
...
Thanks and best regards,

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/

[2] https://github.com/oVirt/ovirt-ansible/issues/124 -- Didi

-- Didi

-- Didi

1657

Age (days ago)

1661

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Marcin Sobczyk
Yedidyah Bar David