[ovirt-users] HA agent fails to start
Simone Tiraboschi
stirabos at redhat.com
Thu Apr 14 20:38:44 UTC 2016
On Thu, Apr 14, 2016 at 6:53 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
> On 14.04.16 18:46, Simone Tiraboschi wrote:
>> On Thu, Apr 14, 2016 at 4:04 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>> On 04/14/2016 02:14 PM, Simone Tiraboschi wrote:
>>>> On Thu, Apr 14, 2016 at 12:51 PM, Richard Neuboeck
>>>> <hawk at tbi.univie.ac.at> wrote:
>>>>> On 04/13/2016 10:00 AM, Simone Tiraboschi wrote:
>>>>>> On Wed, Apr 13, 2016 at 9:38 AM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>> The answers file shows the setup time of both machines.
>>>>>>>
>>>>>>> On both machines hosted-engine.conf got rotated right before I wrote
>>>>>>> this mail. Is it possible that I managed to interrupt the rotation with
>>>>>>> the reboot so the backup was accurate but the update not yet written to
>>>>>>> hosted-engine.conf?
>>>>>>
>>>>>> AFAIK we don't have any rotation mechanism for that file; something
>>>>>> else you have in place on that host?
>>>>>
>>>>> Those machines are all CentOS 7.2 minimal installs. The only
>>>>> adaptation I do is installing vim, removing postfix and installing
>>>>> exim, removing firewalld and installing iptables-service. Then I add
>>>>> the oVirt repos (3.6 and 3.6-snapshot) and deploy the host.
>>>>>
>>>>> But checking lsof shows that 'ovirt-ha-agent --no-daemon' has access
>>>>> to the config file (and the one ending with ~):
>>>>>
>>>>> # lsof | grep 'hosted-engine.conf~'
>>>>> ovirt-ha- 193446 vdsm 351u REG
>>>>> 253,0 1021 135070683
>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>>
>>>> This is not that much relevant if the file was renamed after
>>>> ovirt-ha-agent opened it.
>>>> Try this:
>>>>
>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# tail -n1 -f
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf &
>>>> [1] 28866
>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# port=
>>>>
>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# lsof | grep hosted-engine.conf
>>>> tail 28866 root 3r REG
>>>> 253,0 1014 1595898 /etc/ovirt-hosted-engine/hosted-engine.conf
>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# mv
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# lsof | grep hosted-engine.conf
>>>> tail 28866 root 3r REG
>>>> 253,0 1014 1595898
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]#
>>>>
>>>
>>> I've issued the commands you suggested but I don't know how that
>>> helps to find the process accessing the config files.
>>>
>>> After moving the hosted-engine.conf file the HA agent crashed
>>> logging the information that the config file is not available.
>>>
>>> Here is the output from every command:
>>>
>>> # tail -n1 -f /etc/ovirt-hosted-engine/hosted-engine.conf &
>>> [1] 167865
>>> [root at cube-two ~]# port=
>>> # lsof | grep hosted-engine.conf
>>> ovirt-ha- 166609 vdsm 5u REG
>>> 253,0 1021 134433491
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 7u REG
>>> 253,0 1021 134433453
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 8u REG
>>> 253,0 1021 134433489
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 9u REG
>>> 253,0 1021 134433493
>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>> ovirt-ha- 166609 vdsm 10u REG
>>> 253,0 1021 134433495
>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>> tail 167865 root 3r REG
>>> 253,0 1021 134433493
>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>> # mv /etc/ovirt-hosted-engine/hosted-engine.conf
>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>> # lsof | grep hosted-engine.conf
>>> ovirt-ha- 166609 vdsm 5u REG
>>> 253,0 1021 134433491
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 7u REG
>>> 253,0 1021 134433453
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 8u REG
>>> 253,0 1021 134433489
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 9u REG
>>> 253,0 1021 134433493
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 10u REG
>>> 253,0 1021 134433495
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>> ovirt-ha- 166609 vdsm 12u REG
>>> 253,0 1021 134433498
>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>> ovirt-ha- 166609 vdsm 13u REG
>>> 253,0 1021 134433499
>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>> tail 167865 root 3r REG
>>> 253,0 1021 134433493
>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>
>>>
>>>> The issue is understanding who renames that file on your host.
>>>
>>> From what I've seen so far it looks like a child of vdsm accesses
>>> /etc/ovirt-hosted-engine/hosted-engine.conf periodically but is not
>>> responsible for the ~ file.
>>>
>>> # auditctl -w /etc/ovirt-hosted-engine/hosted-engine.conf
>>> and
>>> # auditctl -w /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>
>>> auditd.log shows this:
>>>
>>> type=SYSCALL msg=audit(1460639783.613:482590): arch=c000003e
>>> syscall=2 success=yes exit=75 a0=7f29b400f0b0 a1=0 a2=1b6 a3=24
>>> items=1 ppid=1 pid=3701 auid=4294967295 uid=36 gid=36 euid=36
>>> suid=36 fsuid=36 egid=36 sgid=36 fsgid=36 tty=(none) ses=4294967295
>>> comm="jsonrpc.Executo" exe="/usr/bin/python2.7"
>>> subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 key=(null)
>>> type=CWD msg=audit(1460639783.613:482590): cwd="/"
>>> type=PATH msg=audit(1460639783.613:482590): item=0
>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf" inode=134433499
>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>> obj=system_u:object_r:etc_t:s0 objtype=NORMAL
>>>
>>>
>>> Now that the HA agent is dead I'm removing the ~ file and starting
>>> the HA agent again. The ~ file immediately appears again.
>>>
>>> # rm hosted-engine.conf~
>>> rm: remove regular file ‘hosted-engine.conf~’? y
>>> [root at cube-two ovirt-hosted-engine]# ls -l
>>> total 6800
>>> -rw-r--r--. 1 root root 3252 Apr 8 10:35 answers.conf
>>> -rw-r--r--. 1 root root 6948582 Apr 14 14:48 ha-trace.log
>>> -rw-r--r--. 1 root root 1021 Apr 14 15:07 hosted-engine.conf
>>> -rw-r--r--. 1 root root 413 Apr 8 10:35 iptables.example
>>> [root at cube-two ovirt-hosted-engine]# systemctl start ovirt-ha-agent
>>> [root at cube-two ovirt-hosted-engine]# ls -l
>>> total 6804
>>> -rw-r--r--. 1 root root 3252 Apr 8 10:35 answers.conf
>>> -rw-r--r--. 1 root root 6948582 Apr 14 14:48 ha-trace.log
>>> -rw-r--r--. 1 root root 1021 Apr 14 15:18 hosted-engine.conf
>>> -rw-r--r--. 1 root root 1021 Apr 14 15:07 hosted-engine.conf~
>>> -rw-r--r--. 1 root root 413 Apr 8 10:35 iptables.example
>>>
>>> The auditd.log shows that ~ file is moved into place but not what
>>> issued the mv:
>>>
>>> type=CONFIG_CHANGE msg=audit(1460639919.277:482750): auid=4294967295
>>> ses=4294967295 op="updated_rules"
>>> path="/etc/ovirt-hosted-engine/hosted-engine.conf~" key=(null)
>>> list=4 res=1
>>> type=SYSCALL msg=audit(1460639919.277:482751): arch=c000003e
>>> syscall=82 success=yes exit=0 a0=7ffe4b3c0e90 a1=7ffe4b3bf920
>>> a2=7f68083a2778 a3=7ffe4b3bf680 items=5 ppid=170233 pid=170234
>>> auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 eg
>>> id=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mv"
>>> exe="/usr/bin/mv" subj=system_u:system_r:unconfined_service_t:s0
>>> key=(null)
>>> type=CWD msg=audit(1460639919.277:482751): cwd="/"
>>> type=PATH msg=audit(1460639919.277:482751): item=0
>>> name="/etc/ovirt-hosted-engine/" inode=69555 dev=fd:00 mode=040755
>>> ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 objtype=PARENT
>>> type=PATH msg=audit(1460639919.277:482751): item=1
>>> name="/etc/ovirt-hosted-engine/" inode=69555 dev=fd:00 mode=040755
>>> ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 objtype=PARENT
>>> type=PATH msg=audit(1460639919.277:482751): item=2
>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf" inode=134433453
>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>> obj=system_u:object_r:etc_t:s0 objtype=DELETE
>>> type=PATH msg=audit(1460639919.277:482751): item=3
>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf~" inode=134433499
>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>> obj=system_u:object_r:etc_t:s0 objtype=DELETE
>>> type=PATH msg=audit(1460639919.277:482751): item=4
>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf~" inode=134433453
>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>> obj=system_u:object_r:etc_t:s0 objtype=CREATE
>>>
>>>
>>>> As a thumb rule, if a file name is appended with a tilde~, it only
>>>> means that it is a backup created by a text editor or similar program.
>>>
>>> If anyone except myself would have access to these systems I would
>>> guess the same. But since I'm not editing anything in
>>> /etc/ovirt-hosted-engine there must be another reason. And there is.
>>>
>>> Aside from auditd I tried to strace the whole thing just to make
>>> sure it comes from the HA agent.
>>>
>>> [root at cube-two ~]# strace -o ha-trace.log -f
>>> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>>>
>>> Looking at the trace log I found this:
>>>
>>> 183409 statfs("/etc/ovirt-hosted-engine/.", {f_type=0x58465342,
>>> f_bsize=4096, f_blocks=13100800, f_bfree=12523576,
>>> f_bavail=12523576, f_files=52428800, f_ffree=52379892,
>>> f_fsid={64768, 0}, f_namelen=255, f_frsize=4096}) = 0
>>> 183409 rename("/etc/ovirt-hosted-engine/hosted-engine.conf",
>>> "/etc/ovirt-hosted-engine/hosted-engine.conf~") = 0
>>> 183409 rename("/var/lib/ovirt-hosted-engine-ha/tmpNjTElr",
>>> "/etc/ovirt-hosted-engine/hosted-engine.conf") = 0
>>> 183409 newfstatat(AT_FDCWD,
>>> "/etc/ovirt-hosted-engine/hosted-engine.conf",
>>> {st_mode=S_IFREG|0600, st_size=1021, ...}, AT_SYMLINK_NOFOLLOW) = 0
>>> 183409 open("/etc/ovirt-hosted-engine/hosted-engine.conf",
>>> O_RDONLY|O_NOFOLLOW) = 3
>>>
>>>
>>> Putting it all together I started reading the HA agent sources and
>>> found the function _wrote_updated_conf_file in
>>> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/upgrade.py
>>> which issues a mv -b which creates the ~ file.
>>
>> This should just trigger during 3.5 to 3.6 upgrade but your host are new.
>> Can you please attach /var/log/ovirt-hosted-engine-ha/agent.log from
>> one of them?
>
> The agent.log of host cube-two is attached to this mail.
Yes, you are right:
it's looping trying to fix a path in the config file (on 3.5 we didn't
check if an NFS path was ending with a '/' while for other reasons it
wasn't working on 3.6 and so we need to fix it) but its doesn't seams
you case and so the strange loop.
Now I need to understand why it enters there.
Can you please execute
tree /rhev/data-center/
and post me the output?
Thanks again
>>> The question now is why is this done so frequently. Especially
>>> considering since there are no modifications to the file. Is this
>>> behavior normal?
>>>
>>> [root at cube-two ~]# diff /etc/ovirt-hosted-engine/hosted-engine.conf*
>>> [root at cube-two ~]#
>>>
>>>
>>>>>>> [root at cube-two ~]# ls -l /etc/ovirt-hosted-engine
>>>>>>> total 16
>>>>>>> -rw-r--r--. 1 root root 3252 Apr 8 10:35 answers.conf
>>>>>>> -rw-r--r--. 1 root root 1021 Apr 13 09:31 hosted-engine.conf
>>>>>>> -rw-r--r--. 1 root root 1021 Apr 13 09:30 hosted-engine.conf~
>>>>>>>
>>>>>>> [root at cube-three ~]# ls -l /etc/ovirt-hosted-engine
>>>>>>> total 16
>>>>>>> -rw-r--r--. 1 root root 3233 Apr 11 08:02 answers.conf
>>>>>>> -rw-r--r--. 1 root root 1002 Apr 13 09:31 hosted-engine.conf
>>>>>>> -rw-r--r--. 1 root root 1002 Apr 13 09:31 hosted-engine.conf~
>>>>>>>
>>>>>>> On 12.04.16 16:01, Simone Tiraboschi wrote:
>>>>>>>> Everything seams fine here,
>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf seams to be correctly
>>>>>>>> created with the right name.
>>>>>>>> Can you please check the latest modification time of your
>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~ and compare it with the
>>>>>>>> setup time?
>>>>>>>>
>>>>>>>> On Tue, Apr 12, 2016 at 2:34 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>>>> On 04/12/2016 11:32 AM, Simone Tiraboschi wrote:
>>>>>>>>>> On Mon, Apr 11, 2016 at 8:11 AM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>>>>>> Hi oVirt Group,
>>>>>>>>>>>
>>>>>>>>>>> in my attempts to get all aspects of oVirt 3.6 up and running I
>>>>>>>>>>> stumbled upon something I'm not sure how to fix:
>>>>>>>>>>>
>>>>>>>>>>> Initially I installed a hosted engine setup. After that I added
>>>>>>>>>>> another HA host (with hosted-engine --deploy). The host was
>>>>>>>>>>> registered in the Engine correctly and HA agent came up as expected.
>>>>>>>>>>>
>>>>>>>>>>> However if I reboot the second host (through the Engine UI or
>>>>>>>>>>> manually) HA agent fails to start. The reason seems to be that
>>>>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf is empty. The backup
>>>>>>>>>>> file ending with ~ exists though.
>>>>>>>>>>
>>>>>>>>>> Can you please attach hosted-engine-setup logs from your additional hosts?
>>>>>>>>>> AFAIK our code will never take a ~ ending backup of that file.
>>>>>>>>>
>>>>>>>>> ovirt-hosted-engine-setup logs from both additional hosts are
>>>>>>>>> attached to this mail.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Here are the log messages from the journal:
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at systemd[1]: Starting oVirt
>>>>>>>>>>> Hosted Engine High Availability Monitoring Agent...
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.agent.Agent:ovirt-hosted-engine-ha
>>>>>>>>>>> agent 1.3.5.3-0.0.master started
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found
>>>>>>>>>>> certificate common name: cube-two.tbi.univie.ac.at
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>> ovirt-ha-agent
>>>>>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Hosted
>>>>>>>>>>> Engine is not configured. Shutting down.
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>> ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Hosted
>>>>>>>>>>> Engine is not configured. Shutting down.
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down
>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at systemd[1]:
>>>>>>>>>>> ovirt-ha-agent.service: main process exited, code=exited, status=255/n/a
>>>>>>>>>>>
>>>>>>>>>>> If I restore the configuration from the backup file and manually
>>>>>>>>>>> restart the HA agent it's working properly.
>>>>>>>>>>>
>>>>>>>>>>> For testing purposes I added a third HA host which turn out to
>>>>>>>>>>> behave exactly the same.
>>>>>>>>>>>
>>>>>>>>>>> Any help would be appreciated!
>>>>>>>>>>> Thanks
>>>>>>>>>>> Cheers
>>>>>>>>>>> Richard
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> /dev/null
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> /dev/null
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list
>>>>>>> Users at ovirt.org
>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> /dev/null
>>>>>
>>>
>>>
>>> --
>>> /dev/null
>>>
>
More information about the Users
mailing list