[ovirt-users] HA agent fails to start

Simone Tiraboschi stirabos at redhat.com
Thu Apr 14 16:46:44 UTC 2016


On Thu, Apr 14, 2016 at 4:04 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
> On 04/14/2016 02:14 PM, Simone Tiraboschi wrote:
>> On Thu, Apr 14, 2016 at 12:51 PM, Richard Neuboeck
>> <hawk at tbi.univie.ac.at> wrote:
>>> On 04/13/2016 10:00 AM, Simone Tiraboschi wrote:
>>>> On Wed, Apr 13, 2016 at 9:38 AM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>> The answers file shows the setup time of both machines.
>>>>>
>>>>> On both machines hosted-engine.conf got rotated right before I wrote
>>>>> this mail. Is it possible that I managed to interrupt the rotation with
>>>>> the reboot so the backup was accurate but the update not yet written to
>>>>> hosted-engine.conf?
>>>>
>>>> AFAIK we don't have any rotation mechanism for that file; something
>>>> else you have in place on that host?
>>>
>>> Those machines are all CentOS 7.2 minimal installs. The only
>>> adaptation I do is installing vim, removing postfix and installing
>>> exim, removing firewalld and installing iptables-service. Then I add
>>> the oVirt repos (3.6 and 3.6-snapshot) and deploy the host.
>>>
>>> But checking lsof shows that 'ovirt-ha-agent --no-daemon' has access
>>> to the config file (and the one ending with ~):
>>>
>>> # lsof | grep 'hosted-engine.conf~'
>>> ovirt-ha- 193446                   vdsm  351u      REG
>>> 253,0        1021            135070683
>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>
>> This is not that much relevant if the file was renamed after
>> ovirt-ha-agent opened it.
>> Try this:
>>
>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# tail -n1 -f
>> /etc/ovirt-hosted-engine/hosted-engine.conf &
>> [1] 28866
>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# port=
>>
>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# lsof | grep hosted-engine.conf
>> tail      28866                  root    3r      REG
>> 253,0      1014    1595898 /etc/ovirt-hosted-engine/hosted-engine.conf
>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# mv
>> /etc/ovirt-hosted-engine/hosted-engine.conf
>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# lsof | grep hosted-engine.conf
>> tail      28866                  root    3r      REG
>> 253,0      1014    1595898
>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>> [root at c72he20160405h1 ovirt-hosted-engine-setup]#
>>
>
> I've issued the commands you suggested but I don't know how that
> helps to find the process accessing the config files.
>
> After moving the hosted-engine.conf file the HA agent crashed
> logging the information that the config file is not available.
>
> Here is the output from every command:
>
> # tail -n1 -f /etc/ovirt-hosted-engine/hosted-engine.conf &
> [1] 167865
> [root at cube-two ~]# port=
> # lsof | grep hosted-engine.conf
> ovirt-ha- 166609                   vdsm    5u      REG
> 253,0        1021            134433491
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm    7u      REG
> 253,0        1021            134433453
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm    8u      REG
> 253,0        1021            134433489
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm    9u      REG
> 253,0        1021            134433493
> /etc/ovirt-hosted-engine/hosted-engine.conf~
> ovirt-ha- 166609                   vdsm   10u      REG
> 253,0        1021            134433495
> /etc/ovirt-hosted-engine/hosted-engine.conf
> tail      167865                   root    3r      REG
> 253,0        1021            134433493
> /etc/ovirt-hosted-engine/hosted-engine.conf~
> # mv /etc/ovirt-hosted-engine/hosted-engine.conf
> /etc/ovirt-hosted-engine/hosted-engine.conf_123
> # lsof | grep hosted-engine.conf
> ovirt-ha- 166609                   vdsm    5u      REG
> 253,0        1021            134433491
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm    7u      REG
> 253,0        1021            134433453
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm    8u      REG
> 253,0        1021            134433489
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm    9u      REG
> 253,0        1021            134433493
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm   10u      REG
> 253,0        1021            134433495
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
> ovirt-ha- 166609                   vdsm   12u      REG
> 253,0        1021            134433498
> /etc/ovirt-hosted-engine/hosted-engine.conf~
> ovirt-ha- 166609                   vdsm   13u      REG
> 253,0        1021            134433499
> /etc/ovirt-hosted-engine/hosted-engine.conf_123
> tail      167865                   root    3r      REG
> 253,0        1021            134433493
> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>
>
>> The issue is understanding who renames that file on your host.
>
> From what I've seen so far it looks like a child of vdsm accesses
> /etc/ovirt-hosted-engine/hosted-engine.conf periodically but is not
> responsible for the ~ file.
>
> # auditctl -w /etc/ovirt-hosted-engine/hosted-engine.conf
> and
> # auditctl -w /etc/ovirt-hosted-engine/hosted-engine.conf~
>
> auditd.log shows this:
>
> type=SYSCALL msg=audit(1460639783.613:482590): arch=c000003e
> syscall=2 success=yes exit=75 a0=7f29b400f0b0 a1=0 a2=1b6 a3=24
> items=1 ppid=1 pid=3701 auid=4294967295 uid=36 gid=36 euid=36
> suid=36 fsuid=36 egid=36 sgid=36 fsgid=36 tty=(none) ses=4294967295
> comm="jsonrpc.Executo" exe="/usr/bin/python2.7"
> subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 key=(null)
> type=CWD msg=audit(1460639783.613:482590):  cwd="/"
> type=PATH msg=audit(1460639783.613:482590): item=0
> name="/etc/ovirt-hosted-engine/hosted-engine.conf" inode=134433499
> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
> obj=system_u:object_r:etc_t:s0 objtype=NORMAL
>
>
> Now that the HA agent is dead I'm removing the ~ file and starting
> the HA agent again. The ~ file immediately appears again.
>
> # rm hosted-engine.conf~
> rm: remove regular file ‘hosted-engine.conf~’? y
> [root at cube-two ovirt-hosted-engine]# ls -l
> total 6800
> -rw-r--r--. 1 root root    3252 Apr  8 10:35 answers.conf
> -rw-r--r--. 1 root root 6948582 Apr 14 14:48 ha-trace.log
> -rw-r--r--. 1 root root    1021 Apr 14 15:07 hosted-engine.conf
> -rw-r--r--. 1 root root     413 Apr  8 10:35 iptables.example
> [root at cube-two ovirt-hosted-engine]# systemctl start ovirt-ha-agent
> [root at cube-two ovirt-hosted-engine]# ls -l
> total 6804
> -rw-r--r--. 1 root root    3252 Apr  8 10:35 answers.conf
> -rw-r--r--. 1 root root 6948582 Apr 14 14:48 ha-trace.log
> -rw-r--r--. 1 root root    1021 Apr 14 15:18 hosted-engine.conf
> -rw-r--r--. 1 root root    1021 Apr 14 15:07 hosted-engine.conf~
> -rw-r--r--. 1 root root     413 Apr  8 10:35 iptables.example
>
> The auditd.log shows that ~ file is moved into place but not what
> issued the mv:
>
> type=CONFIG_CHANGE msg=audit(1460639919.277:482750): auid=4294967295
> ses=4294967295 op="updated_rules"
> path="/etc/ovirt-hosted-engine/hosted-engine.conf~" key=(null)
> list=4 res=1
> type=SYSCALL msg=audit(1460639919.277:482751): arch=c000003e
> syscall=82 success=yes exit=0 a0=7ffe4b3c0e90 a1=7ffe4b3bf920
> a2=7f68083a2778 a3=7ffe4b3bf680 items=5 ppid=170233 pid=170234
> auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 eg
> id=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mv"
> exe="/usr/bin/mv" subj=system_u:system_r:unconfined_service_t:s0
> key=(null)
> type=CWD msg=audit(1460639919.277:482751):  cwd="/"
> type=PATH msg=audit(1460639919.277:482751): item=0
> name="/etc/ovirt-hosted-engine/" inode=69555 dev=fd:00 mode=040755
> ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 objtype=PARENT
> type=PATH msg=audit(1460639919.277:482751): item=1
> name="/etc/ovirt-hosted-engine/" inode=69555 dev=fd:00 mode=040755
> ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 objtype=PARENT
> type=PATH msg=audit(1460639919.277:482751): item=2
> name="/etc/ovirt-hosted-engine/hosted-engine.conf" inode=134433453
> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
> obj=system_u:object_r:etc_t:s0 objtype=DELETE
> type=PATH msg=audit(1460639919.277:482751): item=3
> name="/etc/ovirt-hosted-engine/hosted-engine.conf~" inode=134433499
> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
> obj=system_u:object_r:etc_t:s0 objtype=DELETE
> type=PATH msg=audit(1460639919.277:482751): item=4
> name="/etc/ovirt-hosted-engine/hosted-engine.conf~" inode=134433453
> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
> obj=system_u:object_r:etc_t:s0 objtype=CREATE
>
>
>> As a thumb rule, if a file name is appended with a tilde~, it only
>> means that it is a backup created by a text editor or similar program.
>
> If anyone except myself would have access to these systems I would
> guess the same. But since I'm not editing anything in
> /etc/ovirt-hosted-engine there must be another reason. And there is.
>
> Aside from auditd I tried to strace the whole thing just to make
> sure it comes from the HA agent.
>
> [root at cube-two ~]# strace -o ha-trace.log -f
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>
> Looking at the trace log I found this:
>
> 183409 statfs("/etc/ovirt-hosted-engine/.", {f_type=0x58465342,
> f_bsize=4096, f_blocks=13100800, f_bfree=12523576,
> f_bavail=12523576, f_files=52428800, f_ffree=52379892,
> f_fsid={64768, 0}, f_namelen=255, f_frsize=4096}) = 0
> 183409 rename("/etc/ovirt-hosted-engine/hosted-engine.conf",
> "/etc/ovirt-hosted-engine/hosted-engine.conf~") = 0
> 183409 rename("/var/lib/ovirt-hosted-engine-ha/tmpNjTElr",
> "/etc/ovirt-hosted-engine/hosted-engine.conf") = 0
> 183409 newfstatat(AT_FDCWD,
> "/etc/ovirt-hosted-engine/hosted-engine.conf",
> {st_mode=S_IFREG|0600, st_size=1021, ...}, AT_SYMLINK_NOFOLLOW) = 0
> 183409 open("/etc/ovirt-hosted-engine/hosted-engine.conf",
> O_RDONLY|O_NOFOLLOW) = 3
>
>
> Putting it all together I started reading the HA agent sources and
> found the function _wrote_updated_conf_file in
> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/upgrade.py
> which issues a mv -b which creates the ~ file.

This should just trigger during 3.5 to 3.6 upgrade but your host are new.
Can you please attach /var/log/ovirt-hosted-engine-ha/agent.log from
one of them?

> The question now is why is this done so frequently. Especially
> considering since there are no modifications to the file. Is this
> behavior normal?
>
> [root at cube-two ~]# diff /etc/ovirt-hosted-engine/hosted-engine.conf*
> [root at cube-two ~]#
>
>
>>>>> [root at cube-two ~]# ls -l /etc/ovirt-hosted-engine
>>>>> total 16
>>>>> -rw-r--r--. 1 root root 3252 Apr  8 10:35 answers.conf
>>>>> -rw-r--r--. 1 root root 1021 Apr 13 09:31 hosted-engine.conf
>>>>> -rw-r--r--. 1 root root 1021 Apr 13 09:30 hosted-engine.conf~
>>>>>
>>>>> [root at cube-three ~]# ls -l /etc/ovirt-hosted-engine
>>>>> total 16
>>>>> -rw-r--r--. 1 root root 3233 Apr 11 08:02 answers.conf
>>>>> -rw-r--r--. 1 root root 1002 Apr 13 09:31 hosted-engine.conf
>>>>> -rw-r--r--. 1 root root 1002 Apr 13 09:31 hosted-engine.conf~
>>>>>
>>>>> On 12.04.16 16:01, Simone Tiraboschi wrote:
>>>>>> Everything seams fine here,
>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf seams to be correctly
>>>>>> created with the right name.
>>>>>> Can you please check the latest modification time of your
>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~ and compare it with the
>>>>>> setup time?
>>>>>>
>>>>>> On Tue, Apr 12, 2016 at 2:34 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>> On 04/12/2016 11:32 AM, Simone Tiraboschi wrote:
>>>>>>>> On Mon, Apr 11, 2016 at 8:11 AM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>>>> Hi oVirt Group,
>>>>>>>>>
>>>>>>>>> in my attempts to get all aspects of oVirt 3.6 up and running I
>>>>>>>>> stumbled upon something I'm not sure how to fix:
>>>>>>>>>
>>>>>>>>> Initially I installed a hosted engine setup. After that I added
>>>>>>>>> another HA host (with hosted-engine --deploy). The host was
>>>>>>>>> registered in the Engine correctly and HA agent came up as expected.
>>>>>>>>>
>>>>>>>>> However if I reboot the second host (through the Engine UI or
>>>>>>>>> manually) HA agent fails to start. The reason seems to be that
>>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf is empty. The backup
>>>>>>>>> file ending with ~ exists though.
>>>>>>>>
>>>>>>>> Can you please attach hosted-engine-setup logs from your additional hosts?
>>>>>>>> AFAIK our code will never take a ~ ending backup of that file.
>>>>>>>
>>>>>>> ovirt-hosted-engine-setup logs from both additional hosts are
>>>>>>> attached to this mail.
>>>>>>>
>>>>>>>>
>>>>>>>>> Here are the log messages from the journal:
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at systemd[1]: Starting oVirt
>>>>>>>>> Hosted Engine High Availability Monitoring Agent...
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.agent.Agent:ovirt-hosted-engine-ha
>>>>>>>>> agent 1.3.5.3-0.0.master started
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found
>>>>>>>>> certificate common name: cube-two.tbi.univie.ac.at
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>> ovirt-ha-agent
>>>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Hosted
>>>>>>>>> Engine is not configured. Shutting down.
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>> ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Hosted
>>>>>>>>> Engine is not configured. Shutting down.
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down
>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at systemd[1]:
>>>>>>>>> ovirt-ha-agent.service: main process exited, code=exited, status=255/n/a
>>>>>>>>>
>>>>>>>>> If I restore the configuration from the backup file and manually
>>>>>>>>> restart the HA agent it's working properly.
>>>>>>>>>
>>>>>>>>> For testing purposes I added a third HA host which turn out to
>>>>>>>>> behave exactly the same.
>>>>>>>>>
>>>>>>>>> Any help would be appreciated!
>>>>>>>>> Thanks
>>>>>>>>> Cheers
>>>>>>>>> Richard
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> /dev/null
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users at ovirt.org
>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> /dev/null
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>
>>>
>>> --
>>> /dev/null
>>>
>
>
> --
> /dev/null
>



More information about the Users mailing list