[ovirt-users] hosted-engine Debug Help

Simone Tiraboschi stirabos at redhat.com
Tue Apr 3 20:50:32 UTC 2018


On Tue, Apr 3, 2018 at 9:46 PM, RabidCicada <rabidcicada at gmail.com> wrote:

> Indeed that is the problem.  I am sshing as root
>
> The only possible explanations I have for this working in the ovirt
> autmated testing setup are:
>
>    - Either the ovirt automated tests must be using a host inventory file
>    with become: true?
>    - The inventory file ssh credentials happen to be to vdsm user? (And
>    become is just not doing anything)
>    - Some other janky thing that allows the access
>
> Yes, you are absolutely right:
we are running the CI suite with no_root_squash
https://github.com/oVirt/ovirt-system-tests/blob/master/common/deploy-scripts/setup_storage_nfs_el7.sh#L17
while we recommend anonuid=36,anongid=36,all_squash
https://www.ovirt.org/documentation/how-to/troubleshooting/troubleshooting-nfs-storage-issues/
and this explains why it can write as root in the CI env.

Thanks for the report!


>
> It also happens for the immediately following task `Initialize metadata
> volume` and many more following that task.
>
> I will correct and then open a pull request to fix with your input that
> you believe it's a problem.
>
> ~Kyle
>
>
> On Tue, Apr 3, 2018 at 3:08 PM, RabidCicada <rabidcicada at gmail.com> wrote:
>
>> Alright.  By differential comparing to successfull postgres-command
>> sudo-becomes.  It looks like the salient difference is that `Copy
>> configuration archive to storage` is missing become:true.  Testing now.
>>
>> The similar postgres commands have become:true while this one does not.
>>
>> ~Kyle
>>
>> On Tue, Apr 3, 2018 at 1:19 PM, Simone Tiraboschi <stirabos at redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 5:21 PM, RabidCicada <rabidcicada at gmail.com>
>>> wrote:
>>>
>>>> I've attached a full debug packet below.  I include the log file from
>>>> ovirt-hosted-engine-setup.  I include relevant cmd line info.  I include
>>>> info from the command line where epdb has a breakpoint in playbook.py from
>>>> ansible itself.  I also include info from commands I ran after it failed.
>>>> I also include attached the ferried over script in /root/.ansible/tmp that
>>>> is run.
>>>>
>>>>
>>>> *Output on cmd line:*
>>>> [ INFO  ] TASK [Copy configuration archive to storage]
>>>> [ ERROR ] fatal: [localhost]: FAILED! => {
>>>>              "changed": true,
>>>>              "cmd": [
>>>>                  "dd",
>>>>                  "bs=20480",
>>>>                  "count=1",
>>>>                  "oflag=direct",
>>>>                  "if=/var/tmp/localvmbCDQIR/5e
>>>> f881f5-c992-48d2-b969-a0b6156bdf7c",
>>>>                  "of=/rhev/data-center/mnt/nod
>>>> e.local:_srv_data/81292f3f-11d3-4e38-9afa-62e133aa8017/image
>>>> s/c5510e77-1ee0-479c-b6cf-24c179313a45/5ef881f5-c992-48d2-b9
>>>> 69-a0b6156bdf7c"
>>>>              ],
>>>>              "delta": "0:00:00.004336",
>>>>              "end": "2018-04-03 15:01:55.581823",
>>>>              "invocation": {
>>>>                  "module_args": {
>>>>                      "_raw_params": "dd bs=20480 count=1 oflag=direct
>>>> if=\"/var/tmp/localvmbCDQIR/5ef881f5-c992-48d2-b969-a0b6156bdf7c\"
>>>> of=\"/rhev/data-center/mnt/node.local:_srv_data/81292f3f-11d
>>>> 3-4e38-9afa-62e133aa8017/images/c5510e77-1ee0-479c-b6cf-24c1
>>>> 79313a45/5ef881f5-c992-48d2-b969-a0b6156bdf7c\"",
>>>>                      "_uses_shell": false,
>>>>                      "chdir": null,
>>>>                      "creates": null,
>>>>                      "executable": null,
>>>>                      "removes": null,
>>>>                      "stdin": null,
>>>>                      "warn": true
>>>>                  }
>>>>              },
>>>>              "msg": "non-zero return code",
>>>>              "rc": 1,
>>>>              "start": "2018-04-03 15:01:55.577487",
>>>>              "stderr": "dd: failed to open
>>>> ‘/rhev/data-center/mnt/node.local:_srv_data/81292f3f-11d3-4e
>>>> 38-9afa-62e133aa8017/images/c5510e77-1ee0-479c-b6cf-24c17931
>>>> 3a45/5ef881f5-c992-48d2-b969-a0b6156bdf7c’: Permission denied",
>>>>              "stderr_lines": [
>>>>                  "dd: failed to open ‘/rhev/data-center/mnt/node.lo
>>>> cal:_srv_data/81292f3f-11d3-4e38-9afa-62e133aa8017/images/c5
>>>> 510e77-1ee0-479c-b6cf-24c179313a45/5ef881f5-c992-48d2-b969-a0b6156bdf7c’:
>>>> Permission denied"
>>>>              ],
>>>>              "stdout": "",
>>>>              "stdout_lines": []
>>>>          }
>>>> [ ERROR ] Failed to execute stage 'Closing up': Failed executing
>>>> ansible-playbook
>>>>
>>>
>>> In the playbook we have on that task:
>>>     become_user: vdsm
>>>     become_method: sudo
>>>
>>> but I fear it got somehow ignored.
>>> I'll investigate it.
>>>
>>>
>>>>
>>>> *Output from ansible epdb tracepoint:*
>>>>
>>>> Using module file /usr/lib/python2.7/site-packag
>>>> es/ansible/modules/commands/command.py
>>>> <localhost> ESTABLISH LOCAL CONNECTION FOR USER: root
>>>> <localhost> EXEC /bin/sh -c 'echo ~ && sleep 0'
>>>> <localhost> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo
>>>> /root/.ansible/tmp/ansible-tmp-1522767715.36-81496549401055 `" && echo
>>>> ansible-tmp-1522767715.36-81496549401055="` echo
>>>> /root/.ansible/tmp/ansible-tmp-1522767715.36-81496549401055 `" ) &&
>>>> sleep 0'
>>>> <localhost> PUT /tmp/tmpGMGdjh TO /root/.ansible/tmp/ansible-tmp
>>>> -1522767715.36-81496549401055/command.py
>>>> <localhost> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1522767715.36-81496549401055/
>>>> /root/.ansible/tmp/ansible-tmp-1522767715.36-81496549401055/command.py
>>>> && sleep 0'
>>>> <localhost> EXEC /bin/sh -c '/usr/bin/python
>>>> /root/.ansible/tmp/ansible-tmp-1522767715.36-81496549401055/command.py
>>>> && sleep 0'
>>>>         to retry, use: --limit @/usr/share/ovirt-hosted-engin
>>>> e-setup/ansible/create_target_vm.retry
>>>>
>>>> The above command.py is the on I have attached as problematic_command.py
>>>>
>>>> *Investigation After Failure:*
>>>> [root at node ~]# ls -al '/rhev/data-center/mnt/node.lo
>>>> cal:_srv_data/81292f3f-11d3-4e38-9afa-62e133aa8017/images/c5
>>>> 510e77-1ee0-479c-b6cf-24c179313a45/5ef881f5-c992-48d2-b969-a
>>>> 0b6156bdf7c'
>>>> -rw-rw----. 1 vdsm kvm 20480 Apr  3 15:01 /rhev/data-center/mnt/node.loc
>>>> al:_srv_data/81292f3f-11d3-4e38-9afa-62e133aa8017/images/c55
>>>> 10e77-1ee0-479c-b6cf-24c179313a45/5ef881f5-c992-48d2-b969-a0b6156bdf7c
>>>>
>>>> sudo -u vdsm dd bs=20480 count=1 oflag=direct
>>>> if="/var/tmp/localvmbCDQIR/5ef881f5-c992-48d2-b969-a0b6156bdf7c"
>>>> of="/rhev/data-center/mnt/node.local:_srv_data/81292f3f-11d3
>>>> -4e38-9afa-62e133aa8017/images/c5510e77-1ee0-479c-b6cf-24c17
>>>> 9313a45/5ef881f5-c992-48d2-b969-a0b6156bdf7c
>>>> 1+0 records in
>>>> 1+0 records out
>>>>
>>>>
>>>> It seems to me that somehow it is not getting the right permissions
>>>> even though the playbook has:
>>>> - name: Copy configuration archive to storage
>>>>     command: dd bs=20480 count=1 oflag=direct if="{{ LOCAL_VM_DIR }}/{{
>>>> he_conf_disk_details.disk.image_id }}" of="{{ he_conf_disk_path }}"
>>>>     become_user: vdsm
>>>>     become_method: sudo
>>>>     changed_when: True
>>>>     tags: [ 'skip_ansible_lint' ]
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 8:51 AM, RabidCicada <rabidcicada at gmail.com>
>>>> wrote:
>>>>
>>>>> I am now also running with:
>>>>>
>>>>> export ANSIBLE_VERBOSITY=5
>>>>>
>>>>> export ANSIBLE_FORKS=1
>>>>>
>>>>> export ANSIBLE_KEEP_REMOTE_FILES=1
>>>>>
>>>>>
>>>>> On Tue, Apr 3, 2018 at 8:49 AM, RabidCicada <rabidcicada at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Here's the log.
>>>>>>
>>>>>> So the command that it says it ran is:
>>>>>> dd bs=20480 count=1 oflag=direct if=/var/tmp/localvmHaWb6G/1cce
>>>>>> 8df2-1810-4063-b4e2-e19a2c5b1909 of=/rhev/data-center/mnt/node.
>>>>>> local:_srv_data/3c7485ea-14e3-40c1-b627-f89a819ed1d6/images/
>>>>>> 2c1f7c2f-b8f7-46d4-ac66-8ff1e9649e29/1cce8df2-1810-4063-b4e2
>>>>>> -e19a2c5b1909
>>>>>>
>>>>>> But we all know that was done with:
>>>>>>
>>>>>>   - name: Copy configuration archive to storage
>>>>>>     command: dd bs=20480 count=1 oflag=direct if="{{ LOCAL_VM_DIR
>>>>>> }}/{{ he_conf_disk_details.disk.image_id }}" of="{{
>>>>>> he_conf_disk_path }}"
>>>>>>     become_user: vdsm
>>>>>>     become_method: sudo
>>>>>>     changed_when: True
>>>>>>     tags: [ 'skip_ansible_lint' ]
>>>>>>
>>>>>> So I manually replicated with `sudo vdsm dd bs=20480 count=1
>>>>>> if=/var/tmp/localvmHaWb6G/1cce8df2-1810-4063-b4e2-e19a2c5b1909
>>>>>> of=/rhev/data-center/mnt/node.local:_srv_data/3c7485ea-14e3-
>>>>>> 40c1-b627-f89a819ed1d6/images/2c1f7c2f-b8f7-46d4-ac66-8ff1e9
>>>>>> 649e29/1cce8df2-1810-4063-b4e2-e19a2c5b1909`
>>>>>>
>>>>>> And it works when I manually do it.  Though I think I didn't use the
>>>>>> oflag=direct (Just realised this)
>>>>>>
>>>>>> I eventually put a pause task directly preceeding it with debug
>>>>>> output that showed the file paths.  I manually ran the command and it
>>>>>> worked.  Then let it do it....failed.  I checked all the permissions e.g.
>>>>>> vdsm:kvm.  All looks good from a filesystem point of view.  I'm beginning
>>>>>> (naively) to suspect a race condition for the file access problem...but
>>>>>> have come nowhere close to solving it.
>>>>>>
>>>>>> *Can you suggest a good way to continue executing the install process
>>>>>> from the create_target_vm.yml playbook (with proper variables and context
>>>>>> from otopi etc)?*  I currently have to restart the entire process
>>>>>> over again and wait quite a while for it to circle back around.
>>>>>>
>>>>>> I have since discovered epdb and I've set breakpoints directly in
>>>>>> playbook.py of ansible just to see better log output.  I insert epdb.serve
>>>>>> and use netcat to connect since epdb on python 2.7.5 and up seems to have
>>>>>> problems using epdb.connect() itself.
>>>>>>
>>>>>> ~Kyle
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 4:06 AM, Simone Tiraboschi <
>>>>>> stirabos at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 2, 2018 at 4:52 PM, RabidCicada <rabidcicada at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Heyo everyone.  I'm trying to debug hosted-engine --deploy.  It is
>>>>>>>> failing in `Copy configuration archive to storage` in
>>>>>>>> `create_target_vm.yml` from `hosted-engine --deploy`.  My general
>>>>>>>> and most important query here is how to get good debug output from ansible
>>>>>>>> through hosted-engine.  I'm running hosted-engine through an ssh session.
>>>>>>>>
>>>>>>>> I can't figure out how to get good debug output from ansible within
>>>>>>>> that workflow.  I see it's running through otopi, I tried setting
>>>>>>>> typical `debugger: on_failed` hooks etc and tried many incantations on the
>>>>>>>> command line and config files to get ansible to help me out.  The
>>>>>>>> debugger: directive and other debugger related ansible config file stuff
>>>>>>>> wouldn't result in any debugger popping up.  I also can't seem to pass
>>>>>>>> normal -vvvv flags to hosted-engine either and get it to ansible.
>>>>>>>> Ultimately I tried to use a `pause` directive and it complained that it was
>>>>>>>> in a non-interactive shell.  I figured it might be the result of my ssh
>>>>>>>> session so I enabled tty allocation with -t -t.  It did not resolve the
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> I eventually wrote-my-own/stole a callback_plugin that checks an
>>>>>>>> environmental variable and enables `display.verbosity = int(v)` since I
>>>>>>>> can't seem to pass typical -vvvv stuff to ansible through `hosted-engine
>>>>>>>> --deploy`.  It give me the best info that I have so far.   But it wont give
>>>>>>>> me enough to debug issues around Gathering Facts or what looks like a
>>>>>>>> sudo/permission problem in `Copy configuration archive to storage`
>>>>>>>> in `create_target_vm.yml`.  I took and used the exact command that they use
>>>>>>>> manually and it works when I run it manually (But I can't get debug output
>>>>>>>> to show me the exact sudo command being executed), hence my interest in
>>>>>>>> passing -vvvv or equivalent to ansible through `hosted-engine`.  I
>>>>>>>> intentionally disabled the VM_directory cleanup so that I could execute the
>>>>>>>> same stuff.
>>>>>>>>
>>>>>>>> So....after all that...what is a good way to get deep debug info
>>>>>>>> from hosted-engine ansible stuff?
>>>>>>>>
>>>>>>>
>>>>>>> You should already find all the relevant log entries in a file
>>>>>>> called /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engin
>>>>>>> e-setup-ansible-create_target_vm-{timestamp}-{hash}.log
>>>>>>>
>>>>>>> Can you please share it?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Or does anyone have intuition for the possible sudo problem?
>>>>>>>> ~Kyle
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180403/a3878a6d/attachment.html>


More information about the Users mailing list