On Fri, Jun 18, 2021 at 10:18 AM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
On 6/17/21 6:59 PM, Yedidyah Bar David wrote:
> On Thu, Jun 17, 2021 at 6:27 PM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
>>
>>
>> On 6/17/21 1:44 PM, Yedidyah Bar David wrote:
>>> On Wed, Jun 16, 2021 at 1:23 PM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>>> Hi,
>>>>
>>>> I now tried running locally hc-basic-suite-master with a patched OST,
>>>> and it failed due to $subject. I checked and see that this also
>>>> happened on CI, e.g. [1], before it started failing to to an unrelated
>>>> reason later:
>>>>
>>>> E TASK [gluster.infra/roles/firewall_config : Add/Delete
>>>> services to firewalld rules] ***
>>>> E failed: [lago-hc-basic-suite-master-host-0]
>>>> (item=glusterfs) => {"ansible_loop_var": "item",
"changed": false,
>>>> "item": "glusterfs", "msg": "ERROR:
Exception caught:
>>>> org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE:
'glusterfs'
>>>> not among existing services Permanent and Non-Permanent(immediate)
>>>> operation, Services are defined by port/tcp relationship and named as
>>>> they are in /etc/services (on most systems)"}
>>>> E failed: [lago-hc-basic-suite-master-host-2]
>>>> (item=glusterfs) => {"ansible_loop_var": "item",
"changed": false,
>>>> "item": "glusterfs", "msg": "ERROR:
Exception caught:
>>>> org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE:
'glusterfs'
>>>> not among existing services Permanent and Non-Permanent(immediate)
>>>> operation, Services are defined by port/tcp relationship and named as
>>>> they are in /etc/services (on most systems)"}
>>>> E failed: [lago-hc-basic-suite-master-host-1]
>>>> (item=glusterfs) => {"ansible_loop_var": "item",
"changed": false,
>>>> "item": "glusterfs", "msg": "ERROR:
Exception caught:
>>>> org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE:
'glusterfs'
>>>> not among existing services Permanent and Non-Permanent(immediate)
>>>> operation, Services are defined by port/tcp relationship and named as
>>>> they are in /etc/services (on most systems)"}
>>>>
>>>> This seems similar to [2], and indeed I can't see the package
>>>> 'glusterfs-server' installed locally on host-0. Any idea?
>>> I think I understand:
>>>
>>> It seems like the deployment of hc relied on the order of running the
deploy
>>> scripts as written in lagoinitfile. With the new deploy code, all of them
run
>>> in parallel. Does this make sense?
>> The scripts run in parallel as in "on all VMs at the same time", but
>> sequentially
>> as in "one script at a time on each VM" - this is the same behavior
we
>> had with lago deployment.
> Well, I do not think it works as intended, then. When running locally,
> I logged into host-0, and after it failed, I had:
>
> # dnf history
> ID | Command line
>
> | Date and time | Action(s) | Altered
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 4 | install -y --nogpgcheck ansible gluster-ansible-roles
> ovirt-hosted-engine-setup ovirt-ansible-hosted-engine-setup
> ovirt-ansible-reposit | 2021-06-17 11:54 | I, U | 8
> 3 | -y --nogpgcheck install ovirt-host python3-coverage
> vdsm-hook-vhostmd
> | 2021-06-08 02:15 | Install | 493 EE
> 2 | install -y dnf-utils
>
https://resources.ovirt.org/pub/yum-repo/ovirt-release-master.rpm
> | 2021-06-08 02:14 |
> Install | 1
> 1 |
>
> | 2021-06-08 02:06 | Install | 511 EE
>
> Meaning, it already ran setup_first_host.sh (and failed there), but
> didn't run hc_setup_host.sh, although it appears before it.
>
> If you check [1], which is a build that failed due to this reason
> (unlike the later ones), you see there:
>
> ------------------------------ Captured log setup ------------------------------
> 2021-06-07 01:58:38+0000,594 INFO
> [ost_utils.pytest.fixtures.deployment] Waiting for SSH on the VMs
> (deployment:40)
> 2021-06-07 01:59:11+0000,947 INFO
> [ost_utils.deployment_utils.package_mgmt] oVirt packages used on VMs:
> (package_mgmt:133)
> 2021-06-07 01:59:11+0000,948 INFO
> [ost_utils.deployment_utils.package_mgmt]
> vdsm-4.40.70.2-1.git34cdc8884.el8.x86_64 (package_mgmt:135)
> 2021-06-07 01:59:11+0000,950 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh
> on lago-hc-basic-suite-master-host-1 (scripts:36)
> 2021-06-07 01:59:11+0000,950 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh
> on lago-hc-basic-suite-master-host-2 (scripts:36)
> 2021-06-07 01:59:11+0000,952 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/common/deploy-scripts/setup_host.sh
> on lago-hc-basic-suite-master-host-0 (scripts:36)
> 2021-06-07 01:59:13+0000,260 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh
> on lago-hc-basic-suite-master-host-1 (scripts:36)
> 2021-06-07 01:59:13+0000,370 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh
> on lago-hc-basic-suite-master-host-0 (scripts:36)
> 2021-06-07 01:59:13+0000,526 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/hc_setup_host.sh
> on lago-hc-basic-suite-master-host-2 (scripts:36)
> 2021-06-07 01:59:15+0000,250 INFO
> [ost_utils.deployment_utils.scripts] Running
>
/home/jenkins/workspace/ovirt-system-tests_hc-basic-suite-master/ovirt-system-tests/hc-basic-suite-master/setup_first_host.sh
> on lago-hc-basic-suite-master-host-0 (scripts:36)
>
> So you see that hc_setup_host.sh was at least logged as being started
> _after_ setup_host.sh, but very _close_ to it - I can't believe it
> finished in 2 seconds. This part of the log is the same also for later
> runs, although they fail earlier. You can compare this with the log of
> the last successful run (using lago deploy), which also does not very
> clearly show when each script finished, but at least logs their start
> in the correct order.
>
> That said, I do not think the solution should be to now spend time on
> investigating this, finding the root cause, and fixing - I think we
> should instead stop keeping the list of deploy scripts in lagoinitfile
> but move them simply to python code.
[1] is unfortunately already gone - please ping me when you notice this
kind of behavior again.
OK, it seems to be unrelated. Should be fixed with something like:
Regards, Marcin
> Best regards,
>
>> Regards, Marcin
>>
>>>> Thanks and best regards,
>>>>
>>>> [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/2088/
>>>>
>>>> [2]
https://github.com/oVirt/ovirt-ansible/issues/124
>>>> --
>>>> Didi
>>>
>>> --
>>> Didi
>>>
>