[ovirt-users] HA agent fails to start

Simone Tiraboschi stirabos at redhat.com
Thu Apr 14 21:03:06 UTC 2016


On Thu, Apr 14, 2016 at 10:38 PM, Simone Tiraboschi <stirabos at redhat.com> wrote:
> On Thu, Apr 14, 2016 at 6:53 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>> On 14.04.16 18:46, Simone Tiraboschi wrote:
>>> On Thu, Apr 14, 2016 at 4:04 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>> On 04/14/2016 02:14 PM, Simone Tiraboschi wrote:
>>>>> On Thu, Apr 14, 2016 at 12:51 PM, Richard Neuboeck
>>>>> <hawk at tbi.univie.ac.at> wrote:
>>>>>> On 04/13/2016 10:00 AM, Simone Tiraboschi wrote:
>>>>>>> On Wed, Apr 13, 2016 at 9:38 AM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>>> The answers file shows the setup time of both machines.
>>>>>>>>
>>>>>>>> On both machines hosted-engine.conf got rotated right before I wrote
>>>>>>>> this mail. Is it possible that I managed to interrupt the rotation with
>>>>>>>> the reboot so the backup was accurate but the update not yet written to
>>>>>>>> hosted-engine.conf?
>>>>>>>
>>>>>>> AFAIK we don't have any rotation mechanism for that file; something
>>>>>>> else you have in place on that host?
>>>>>>
>>>>>> Those machines are all CentOS 7.2 minimal installs. The only
>>>>>> adaptation I do is installing vim, removing postfix and installing
>>>>>> exim, removing firewalld and installing iptables-service. Then I add
>>>>>> the oVirt repos (3.6 and 3.6-snapshot) and deploy the host.
>>>>>>
>>>>>> But checking lsof shows that 'ovirt-ha-agent --no-daemon' has access
>>>>>> to the config file (and the one ending with ~):
>>>>>>
>>>>>> # lsof | grep 'hosted-engine.conf~'
>>>>>> ovirt-ha- 193446                   vdsm  351u      REG
>>>>>> 253,0        1021            135070683
>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>>>
>>>>> This is not that much relevant if the file was renamed after
>>>>> ovirt-ha-agent opened it.
>>>>> Try this:
>>>>>
>>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# tail -n1 -f
>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf &
>>>>> [1] 28866
>>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# port=
>>>>>
>>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# lsof | grep hosted-engine.conf
>>>>> tail      28866                  root    3r      REG
>>>>> 253,0      1014    1595898 /etc/ovirt-hosted-engine/hosted-engine.conf
>>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# mv
>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]# lsof | grep hosted-engine.conf
>>>>> tail      28866                  root    3r      REG
>>>>> 253,0      1014    1595898
>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>>>> [root at c72he20160405h1 ovirt-hosted-engine-setup]#
>>>>>
>>>>
>>>> I've issued the commands you suggested but I don't know how that
>>>> helps to find the process accessing the config files.
>>>>
>>>> After moving the hosted-engine.conf file the HA agent crashed
>>>> logging the information that the config file is not available.
>>>>
>>>> Here is the output from every command:
>>>>
>>>> # tail -n1 -f /etc/ovirt-hosted-engine/hosted-engine.conf &
>>>> [1] 167865
>>>> [root at cube-two ~]# port=
>>>> # lsof | grep hosted-engine.conf
>>>> ovirt-ha- 166609                   vdsm    5u      REG
>>>> 253,0        1021            134433491
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm    7u      REG
>>>> 253,0        1021            134433453
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm    8u      REG
>>>> 253,0        1021            134433489
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm    9u      REG
>>>> 253,0        1021            134433493
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>> ovirt-ha- 166609                   vdsm   10u      REG
>>>> 253,0        1021            134433495
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>>> tail      167865                   root    3r      REG
>>>> 253,0        1021            134433493
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>> # mv /etc/ovirt-hosted-engine/hosted-engine.conf
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>>> # lsof | grep hosted-engine.conf
>>>> ovirt-ha- 166609                   vdsm    5u      REG
>>>> 253,0        1021            134433491
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm    7u      REG
>>>> 253,0        1021            134433453
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm    8u      REG
>>>> 253,0        1021            134433489
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm    9u      REG
>>>> 253,0        1021            134433493
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm   10u      REG
>>>> 253,0        1021            134433495
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>> ovirt-ha- 166609                   vdsm   12u      REG
>>>> 253,0        1021            134433498
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>> ovirt-ha- 166609                   vdsm   13u      REG
>>>> 253,0        1021            134433499
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf_123
>>>> tail      167865                   root    3r      REG
>>>> 253,0        1021            134433493
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf (deleted)
>>>>
>>>>
>>>>> The issue is understanding who renames that file on your host.
>>>>
>>>> From what I've seen so far it looks like a child of vdsm accesses
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf periodically but is not
>>>> responsible for the ~ file.
>>>>
>>>> # auditctl -w /etc/ovirt-hosted-engine/hosted-engine.conf
>>>> and
>>>> # auditctl -w /etc/ovirt-hosted-engine/hosted-engine.conf~
>>>>
>>>> auditd.log shows this:
>>>>
>>>> type=SYSCALL msg=audit(1460639783.613:482590): arch=c000003e
>>>> syscall=2 success=yes exit=75 a0=7f29b400f0b0 a1=0 a2=1b6 a3=24
>>>> items=1 ppid=1 pid=3701 auid=4294967295 uid=36 gid=36 euid=36
>>>> suid=36 fsuid=36 egid=36 sgid=36 fsgid=36 tty=(none) ses=4294967295
>>>> comm="jsonrpc.Executo" exe="/usr/bin/python2.7"
>>>> subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 key=(null)
>>>> type=CWD msg=audit(1460639783.613:482590):  cwd="/"
>>>> type=PATH msg=audit(1460639783.613:482590): item=0
>>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf" inode=134433499
>>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>>> obj=system_u:object_r:etc_t:s0 objtype=NORMAL
>>>>
>>>>
>>>> Now that the HA agent is dead I'm removing the ~ file and starting
>>>> the HA agent again. The ~ file immediately appears again.
>>>>
>>>> # rm hosted-engine.conf~
>>>> rm: remove regular file ‘hosted-engine.conf~’? y
>>>> [root at cube-two ovirt-hosted-engine]# ls -l
>>>> total 6800
>>>> -rw-r--r--. 1 root root    3252 Apr  8 10:35 answers.conf
>>>> -rw-r--r--. 1 root root 6948582 Apr 14 14:48 ha-trace.log
>>>> -rw-r--r--. 1 root root    1021 Apr 14 15:07 hosted-engine.conf
>>>> -rw-r--r--. 1 root root     413 Apr  8 10:35 iptables.example
>>>> [root at cube-two ovirt-hosted-engine]# systemctl start ovirt-ha-agent
>>>> [root at cube-two ovirt-hosted-engine]# ls -l
>>>> total 6804
>>>> -rw-r--r--. 1 root root    3252 Apr  8 10:35 answers.conf
>>>> -rw-r--r--. 1 root root 6948582 Apr 14 14:48 ha-trace.log
>>>> -rw-r--r--. 1 root root    1021 Apr 14 15:18 hosted-engine.conf
>>>> -rw-r--r--. 1 root root    1021 Apr 14 15:07 hosted-engine.conf~
>>>> -rw-r--r--. 1 root root     413 Apr  8 10:35 iptables.example
>>>>
>>>> The auditd.log shows that ~ file is moved into place but not what
>>>> issued the mv:
>>>>
>>>> type=CONFIG_CHANGE msg=audit(1460639919.277:482750): auid=4294967295
>>>> ses=4294967295 op="updated_rules"
>>>> path="/etc/ovirt-hosted-engine/hosted-engine.conf~" key=(null)
>>>> list=4 res=1
>>>> type=SYSCALL msg=audit(1460639919.277:482751): arch=c000003e
>>>> syscall=82 success=yes exit=0 a0=7ffe4b3c0e90 a1=7ffe4b3bf920
>>>> a2=7f68083a2778 a3=7ffe4b3bf680 items=5 ppid=170233 pid=170234
>>>> auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 eg
>>>> id=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mv"
>>>> exe="/usr/bin/mv" subj=system_u:system_r:unconfined_service_t:s0
>>>> key=(null)
>>>> type=CWD msg=audit(1460639919.277:482751):  cwd="/"
>>>> type=PATH msg=audit(1460639919.277:482751): item=0
>>>> name="/etc/ovirt-hosted-engine/" inode=69555 dev=fd:00 mode=040755
>>>> ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 objtype=PARENT
>>>> type=PATH msg=audit(1460639919.277:482751): item=1
>>>> name="/etc/ovirt-hosted-engine/" inode=69555 dev=fd:00 mode=040755
>>>> ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 objtype=PARENT
>>>> type=PATH msg=audit(1460639919.277:482751): item=2
>>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf" inode=134433453
>>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>>> obj=system_u:object_r:etc_t:s0 objtype=DELETE
>>>> type=PATH msg=audit(1460639919.277:482751): item=3
>>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf~" inode=134433499
>>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>>> obj=system_u:object_r:etc_t:s0 objtype=DELETE
>>>> type=PATH msg=audit(1460639919.277:482751): item=4
>>>> name="/etc/ovirt-hosted-engine/hosted-engine.conf~" inode=134433453
>>>> dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00
>>>> obj=system_u:object_r:etc_t:s0 objtype=CREATE
>>>>
>>>>
>>>>> As a thumb rule, if a file name is appended with a tilde~, it only
>>>>> means that it is a backup created by a text editor or similar program.
>>>>
>>>> If anyone except myself would have access to these systems I would
>>>> guess the same. But since I'm not editing anything in
>>>> /etc/ovirt-hosted-engine there must be another reason. And there is.
>>>>
>>>> Aside from auditd I tried to strace the whole thing just to make
>>>> sure it comes from the HA agent.
>>>>
>>>> [root at cube-two ~]# strace -o ha-trace.log -f
>>>> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>>>>
>>>> Looking at the trace log I found this:
>>>>
>>>> 183409 statfs("/etc/ovirt-hosted-engine/.", {f_type=0x58465342,
>>>> f_bsize=4096, f_blocks=13100800, f_bfree=12523576,
>>>> f_bavail=12523576, f_files=52428800, f_ffree=52379892,
>>>> f_fsid={64768, 0}, f_namelen=255, f_frsize=4096}) = 0
>>>> 183409 rename("/etc/ovirt-hosted-engine/hosted-engine.conf",
>>>> "/etc/ovirt-hosted-engine/hosted-engine.conf~") = 0
>>>> 183409 rename("/var/lib/ovirt-hosted-engine-ha/tmpNjTElr",
>>>> "/etc/ovirt-hosted-engine/hosted-engine.conf") = 0
>>>> 183409 newfstatat(AT_FDCWD,
>>>> "/etc/ovirt-hosted-engine/hosted-engine.conf",
>>>> {st_mode=S_IFREG|0600, st_size=1021, ...}, AT_SYMLINK_NOFOLLOW) = 0
>>>> 183409 open("/etc/ovirt-hosted-engine/hosted-engine.conf",
>>>> O_RDONLY|O_NOFOLLOW) = 3
>>>>
>>>>
>>>> Putting it all together I started reading the HA agent sources and
>>>> found the function _wrote_updated_conf_file in
>>>> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/upgrade.py
>>>> which issues a mv -b which creates the ~ file.
>>>
>>> This should just trigger during 3.5 to 3.6 upgrade but your host are new.
>>> Can you please attach /var/log/ovirt-hosted-engine-ha/agent.log from
>>> one of them?
>>
>> The agent.log of host cube-two is attached to this mail.
>
> Yes, you are right:
> it's looping trying to fix a path in the config file (on 3.5 we didn't
> check if an NFS path was ending with a '/' while for other reasons it
> wasn't working on 3.6 and so we need to fix it) but its doesn't seams
> you case and so the strange loop.
>
> Now I need to understand why it enters there.
> Can you please execute
>  tree /rhev/data-center/
> and post me the output?
>
> Thanks again

OK, I think it's just a side effect if this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1317699

The local mount point of NFS and GlusterFS storage domain are slightly
different.
The hosted-engine storage domain autoimport procedure didn't probably
recognized the gluster storage domain as a gluster one and so marked
it as nfs in the engine and so it got mounted twice on different paths
(by ovirt-ha-agent at boot time and by engine further on);
ovirt-ha-agent notices it and think that's due to the old trailing
slash issue on NFS paths and so it tries to fix but the there is
nothing to fix in the config file and so the infinite loop.

I'll try to push a patch ASAP.
Can you please still provide the output of
 tree /rhev/data-center/

>>>> The question now is why is this done so frequently. Especially
>>>> considering since there are no modifications to the file. Is this
>>>> behavior normal?
>>>>
>>>> [root at cube-two ~]# diff /etc/ovirt-hosted-engine/hosted-engine.conf*
>>>> [root at cube-two ~]#
>>>>
>>>>
>>>>>>>> [root at cube-two ~]# ls -l /etc/ovirt-hosted-engine
>>>>>>>> total 16
>>>>>>>> -rw-r--r--. 1 root root 3252 Apr  8 10:35 answers.conf
>>>>>>>> -rw-r--r--. 1 root root 1021 Apr 13 09:31 hosted-engine.conf
>>>>>>>> -rw-r--r--. 1 root root 1021 Apr 13 09:30 hosted-engine.conf~
>>>>>>>>
>>>>>>>> [root at cube-three ~]# ls -l /etc/ovirt-hosted-engine
>>>>>>>> total 16
>>>>>>>> -rw-r--r--. 1 root root 3233 Apr 11 08:02 answers.conf
>>>>>>>> -rw-r--r--. 1 root root 1002 Apr 13 09:31 hosted-engine.conf
>>>>>>>> -rw-r--r--. 1 root root 1002 Apr 13 09:31 hosted-engine.conf~
>>>>>>>>
>>>>>>>> On 12.04.16 16:01, Simone Tiraboschi wrote:
>>>>>>>>> Everything seams fine here,
>>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf seams to be correctly
>>>>>>>>> created with the right name.
>>>>>>>>> Can you please check the latest modification time of your
>>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf~ and compare it with the
>>>>>>>>> setup time?
>>>>>>>>>
>>>>>>>>> On Tue, Apr 12, 2016 at 2:34 PM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>>>>> On 04/12/2016 11:32 AM, Simone Tiraboschi wrote:
>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:11 AM, Richard Neuboeck <hawk at tbi.univie.ac.at> wrote:
>>>>>>>>>>>> Hi oVirt Group,
>>>>>>>>>>>>
>>>>>>>>>>>> in my attempts to get all aspects of oVirt 3.6 up and running I
>>>>>>>>>>>> stumbled upon something I'm not sure how to fix:
>>>>>>>>>>>>
>>>>>>>>>>>> Initially I installed a hosted engine setup. After that I added
>>>>>>>>>>>> another HA host (with hosted-engine --deploy). The host was
>>>>>>>>>>>> registered in the Engine correctly and HA agent came up as expected.
>>>>>>>>>>>>
>>>>>>>>>>>> However if I reboot the second host (through the Engine UI or
>>>>>>>>>>>> manually) HA agent fails to start. The reason seems to be that
>>>>>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf is empty. The backup
>>>>>>>>>>>> file ending with ~ exists though.
>>>>>>>>>>>
>>>>>>>>>>> Can you please attach hosted-engine-setup logs from your additional hosts?
>>>>>>>>>>> AFAIK our code will never take a ~ ending backup of that file.
>>>>>>>>>>
>>>>>>>>>> ovirt-hosted-engine-setup logs from both additional hosts are
>>>>>>>>>> attached to this mail.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Here are the log messages from the journal:
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at systemd[1]: Starting oVirt
>>>>>>>>>>>> Hosted Engine High Availability Monitoring Agent...
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.agent.Agent:ovirt-hosted-engine-ha
>>>>>>>>>>>> agent 1.3.5.3-0.0.master started
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found
>>>>>>>>>>>> certificate common name: cube-two.tbi.univie.ac.at
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>>> ovirt-ha-agent
>>>>>>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Hosted
>>>>>>>>>>>> Engine is not configured. Shutting down.
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>>> ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Hosted
>>>>>>>>>>>> Engine is not configured. Shutting down.
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at ovirt-ha-agent[3747]:
>>>>>>>>>>>> INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down
>>>>>>>>>>>> Apr 11 07:29:39 cube-two.tbi.univie.ac.at systemd[1]:
>>>>>>>>>>>> ovirt-ha-agent.service: main process exited, code=exited, status=255/n/a
>>>>>>>>>>>>
>>>>>>>>>>>> If I restore the configuration from the backup file and manually
>>>>>>>>>>>> restart the HA agent it's working properly.
>>>>>>>>>>>>
>>>>>>>>>>>> For testing purposes I added a third HA host which turn out to
>>>>>>>>>>>> behave exactly the same.
>>>>>>>>>>>>
>>>>>>>>>>>> Any help would be appreciated!
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Richard
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> /dev/null
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> /dev/null
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> /dev/null
>>>>>>
>>>>
>>>>
>>>> --
>>>> /dev/null
>>>>
>>



More information about the Users mailing list