[ovirt-users] vm.conf on one of the node is missing

Simone Tiraboschi stirabos at redhat.com
Fri Nov 25 08:52:11 UTC 2016


On Fri, Nov 25, 2016 at 7:52 AM, knarra <knarra at redhat.com> wrote:

> On 11/25/2016 03:35 AM, Simone Tiraboschi wrote:
>
>
> On Thu, Nov 24, 2016 at 3:35 PM, Simone Tiraboschi <stirabos at redhat.com>
> wrote:
>
>>
>>
>> On Thu, Nov 24, 2016 at 3:23 PM, knarra <knarra at redhat.com> wrote:
>>
>>> On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:
>>>
>>>
>>>
>>> On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra at redhat.com> wrote:
>>>
>>>> On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra at redhat.com> wrote:
>>>>
>>>>> On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra at redhat.com> wrote:
>>>>>
>>>>>> On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra at redhat.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>     I have three nodes with glusterfs as storage domain. For some
>>>>>>> reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is
>>>>>>> missing and due to this on one of my host i see that Hosted Engine HA : Not
>>>>>>> Active. Once i copy the file from some other node and restart
>>>>>>> ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then
>>>>>>> this happens again. Can some please help me identify why this happens.
>>>>>>> Below is the log i see in ovirt-ha-agent.logs.
>>>>>>>
>>>>>>>
>>>>>>> https://paste.fedoraproject.org/489120/79990345/
>>>>>>>
>>>>>>>
>>>>>> Once the engine correctly imported the hosted-engine storage domain,
>>>>>> a couple of OVF_STORE volumes will appear there.
>>>>>> Every modification to the engine VM configuration will be written by
>>>>>> the engine into that OVF_STORE, so all the ovirt-ha-agent running on the
>>>>>> hosted-engine hosts will be able to re-start the engine VM with a coherent
>>>>>> configuration.
>>>>>>
>>>>>> Till the engine imports the hosted-engine storage domain,
>>>>>> ovirt-ha-agent will fall back to the initial vm.conf.
>>>>>>
>>>>>> In you case the OVF_STORE volume is there,
>>>>>> but the agent fails extracting the engine VM configuration:
>>>>>> MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::
>>>>>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>>> Extracting Engine VM OVF from the OVF_STORE
>>>>>> MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::
>>>>>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>>> OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:
>>>>>> _engine/27f054c3-c245-4039-b42a-c28b37043016/i
>>>>>> mages/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f
>>>>>> 8-ac2e-4c85fc3f8699
>>>>>> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::
>>>>>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>>> Unable to extract HEVM OVF
>>>>>>
>>>>>> So it tries to rollback to the initial vm.conf, but also that one
>>>>>> seams to miss some values and so the agent is failing:
>>>>>> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir
>>>>>> t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error:
>>>>>> ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf,
>>>>>> key=memSize'' - trying to restart agent
>>>>>>
>>>>>> Both of the issue seams storage related, could yuo please share your
>>>>>> gluster logs?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> kasturi
>>>>>>>
>>>>>>>
>>>>>> Hi Simone,
>>>>>>
>>>>>>     Below [1] is the link for the sosreports on the first two hosts.
>>>>>> The third host has some issue. Once it is up will give the sosreport from
>>>>>> there as well.
>>>>>>
>>>>>
>>>>> And the host where you see the initial issue was the third one?
>>>>>
>>>>> It is on the first host.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> It seams that host1 is failing reading from the the hosted-engine
>>>> storage domain:
>>>>
>>>> [2016-11-24 12:33:43.678467] W [MSGID: 114031]
>>>> [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>>> remote operation failed. Path: / (00000000-0000-0000-0000-000000000001)
>>>> [Transport endpoint is not connected]
>>>> [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind]
>>>> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
>>>> 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3)
>>>> op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c)
>>>> [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind]
>>>> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
>>>> 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3)
>>>> op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d)
>>>> [2016-11-24 12:33:43.679001] W [MSGID: 114031]
>>>> [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>>> remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3
>>>> 7043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490
>>>> (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not
>>>> connected]
>>>> [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind]
>>>> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
>>>> 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3)
>>>> op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e)
>>>> [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind]
>>>> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
>>>> 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
>>>> at 2016-11-24 12:33:13.673743 (xid=0x82a1f)
>>>> [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request]
>>>> 0-engine-client-2: not connected (priv->connected = 0)
>>>> [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>>> 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program:
>>>> GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2)
>>>> [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>>> 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program:
>>>> GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)
>>>> [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>>> 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program:
>>>> GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)
>>>> [2016-11-24 12:33:43.682441] W [MSGID: 114031]
>>>> [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>>> remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016
>>>> (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not
>>>> connected]
>>>> [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk]
>>>> 0-engine-client-2: socket disconnected
>>>> [2016-11-24 12:33:43.682536] I [MSGID: 114018]
>>>> [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from
>>>> engine-client-2. Client process will keep trying to connect to glusterd
>>>> until brick's port is available
>>>> [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>>> 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program:
>>>> GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)
>>>> The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk]
>>>> 0-engine-client-2: remote operation failed. Path:
>>>> /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c)
>>>> [Transport endpoint is not connected]" repeated 2 times between [2016-11-24
>>>> 12:33:43.682441] and [2016-11-24 12:33:43.682599]
>>>> [2016-11-24 12:33:43.688324] W [MSGID: 114031]
>>>> [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>>> remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000)
>>>> [Transport endpoint is not connected]
>>>>
>>>> Before that there was a lot of self-healing activities.
>>>>
>>>> simone, these logs indicates that first host is not able to connect to
>>>> the brick in the third host since it was powered down.  It reads remote
>>>> operation failed on engine_client_2 .
>>>>
>>>
>>> Can you please share the output of
>>>  source /etc/ovirt-hosted-engine/hosted-engine.conf
>>>  find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}"
>>> -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf -
>>> vm.conf  2>/dev/null' {} {} \;
>>> executed on your first host?
>>>
>>> output for the above command https://paste.fedoraproject.or
>>> g/489159/97176147/
>>>
>>>
>> This one looks fine,
>> now the point is why the OVF_STORE doesn't seams so,
>>
>> Could you please try executing:
>> sudo -u vdsm dd if=/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/
>> 27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf
>> 7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 | tar -xOvf -
>> 21e0e248-19bf-47b3-b72f-6a3740d9ff43.ovf
>>
>>
>
> OK, found: it's definitively a regression introduced (just on master) with
> https://gerrit.ovirt.org/#/c/66103/
>
> The code that read and writes the HE configuration and the OVF_STORE
> from/to the volume on shared storage (with dd) is shared between
> ovirt-hosted-engine-setup (which runs as root) and ovirt-ha-agent (which
> runs as vdsm user).
> In order to behave correctly (against root squash on NFC for instance), it
> was always running 'sudo -u vdsm dd ...' instead of just 'dd' regardless of
> the active user.
> But since  https://gerrit.ovirt.org/#/c/66103/ , vdsm user cannot run
> anymore 'sudo -u vdsm dd ...' and so this issue.
>
> I opened a bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1398443
>
> Is it the same reason why vm.conf file is missing from
> /var/run/ovirt-hosted-engine-ha folder ?
>

/var/run/ovirt-hosted-engine-ha just just a temporary location on the host
where it saves the file extracted from the OVF_STORE.
The temporary file in not that since it fails reading the OVF_STORE and the
initial vm.conf due to the missing sudoers rule.


>
>
> Kasturi, you got it just on one of your hosts since the other two run an
> older VDSM build.
>
>
>
>>
>> Simone, i ran the following steps and i see that vm.conf file is missing.
>>>
>>> 1) Installed three nodes with glusterfs as storage domains.
>>>
>>> 2) Configured power management on all the nodes and ran the command
>>> 'poweroff -f' on one of the node.
>>>
>>> 3) Engine tried to fence the host and it logged the message 'Host is
>>> rebooting' but this never came up.
>>>
>>> 4) After some time i manually powered on the host and once the host is
>>> up i see that Hosted Engine HA : Not active.
>>>
>>> 5) I see the following error message in broker.log and vm.conf file is
>>> missing.
>>>
>>> Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::o
>>> virt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error
>>> handling request, data: 'set-storage-domain
>>>  FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42
>>> a-c28b37043016'
>>> Traceback (most recent call last):
>>>   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
>>> line 166, in handle
>>>     data)
>>>   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
>>> line 299, in _dispatch
>>>     .set_storage_domain(client, sd_type, **options)
>>>   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
>>> line 66, in set_storage_domain
>>>     self._backends[client].connect()
>>>   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
>>> line 461, in connect
>>>     self._dom_type)
>>>   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
>>> line 108, in get_domain_path
>>>     " in {1}".format(sd_uuid, parent))
>>> BackendFailureException: path to storage domain
>>> 27f054c3-c245-4039-b42a-c28b37043016 not found in
>>> /rhev/data-center/mnt/glusterSD
>>>
>>> Thanks
>>> kasturi
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>> [1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> kasturi
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20161125/d2d64c8a/attachment-0001.html>


More information about the Users mailing list