On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:
On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra(a)redhat.com> wrote:
> On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:
>
>
>
> On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra(a)redhat.com> wrote:
>
>> On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:
>>
>>
>>
>> On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra(a)redhat.com> wrote:
>>
>>> Hi,
>>>
>>> I have three nodes with glusterfs as storage domain. For some
>>> reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is
>>> missing and due to this on one of my host i see that Hosted Engine HA : Not
>>> Active. Once i copy the file from some other node and restart
>>> ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then
>>> this happens again. Can some please help me identify why this happens.
>>> Below is the log i see in ovirt-ha-agent.logs.
>>>
>>>
>>>
https://paste.fedoraproject.org/489120/79990345/
>>>
>>>
>> Once the engine correctly imported the hosted-engine storage domain, a
>> couple of OVF_STORE volumes will appear there.
>> Every modification to the engine VM configuration will be written by the
>> engine into that OVF_STORE, so all the ovirt-ha-agent running on the
>> hosted-engine hosts will be able to re-start the engine VM with a coherent
>> configuration.
>>
>> Till the engine imports the hosted-engine storage domain, ovirt-ha-agent
>> will fall back to the initial vm.conf.
>>
>> In you case the OVF_STORE volume is there,
>> but the agent fails extracting the engine VM configuration:
>> MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> Extracting Engine VM OVF from the OVF_STORE
>> MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:
>> _engine/27f054c3-c245-4039-b42a-c28b37043016/i
>> mages/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-
>> 47f8-ac2e-4c85fc3f8699
>> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> Unable to extract HEVM OVF
>>
>> So it tries to rollback to the initial vm.conf, but also that one seams
>> to miss some values and so the agent is failing:
>> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir
>> t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error:
>> ''Configuration value not found:
file=/var/run/ovirt-hosted-engine-ha/vm.conf,
>> key=memSize'' - trying to restart agent
>>
>> Both of the issue seams storage related, could yuo please share your
>> gluster logs?
>>
>>
>>>
>>>
>>> Thanks
>>>
>>> kasturi
>>>
>>>
>> Hi Simone,
>>
>> Below [1] is the link for the sosreports on the first two hosts. The
>> third host has some issue. Once it is up will give the sosreport from there
>> as well.
>>
>
> And the host where you see the initial issue was the third one?
>
> It is on the first host.
>
>
>
>
It seams that host1 is failing reading from the the hosted-engine storage
domain:
[2016-11-24 12:33:43.678467] W [MSGID: 114031]
[client-rpc-fops.c:2938:client3_3_lookup_cbk]
0-engine-client-2: remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001) [Transport endpoint is not
connected]
[2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
(--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
0-engine-client-2: forced unwinding frame type(GlusterFS 3.3)
op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c)
[2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
(--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
0-engine-client-2: forced unwinding frame type(GlusterFS 3.3)
op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d)
[2016-11-24 12:33:43.679001] W [MSGID: 114031]
[client-rpc-fops.c:2938:client3_3_lookup_cbk]
0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-
c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/
38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992)
[Transport endpoint is not connected]
[2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
(--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
0-engine-client-2: forced unwinding frame type(GlusterFS 3.3)
op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e)
[2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
(--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] )))))
0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
at 2016-11-24 12:33:13.673743 (xid=0x82a1f)
[2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request]
0-engine-client-2: not connected (priv->connected = 0)
[2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit]
0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2)
[2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit]
0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)
[2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit]
0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)
[2016-11-24 12:33:43.682441] W [MSGID: 114031]
[client-rpc-fops.c:2938:client3_3_lookup_cbk]
0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016
(a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not
connected]
[2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk]
0-engine-client-2: socket disconnected
[2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify]
0-engine-client-2: disconnected from engine-client-2. Client process will
keep trying to connect to glusterd until brick's port is available
[2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit]
0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)
The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk]
0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016
(a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not
connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and
[2016-11-24 12:33:43.682599]
[2016-11-24 12:33:43.688324] W [MSGID: 114031]
[client-rpc-fops.c:2938:client3_3_lookup_cbk]
0-engine-client-2: remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Transport endpoint is not
connected]
Before that there was a lot of self-healing activities.
simone, these logs indicates that first host is not able to connect to the
brick in the third host since it was powered down. It reads remote
operation failed on engine_client_2 .
Can you please share the output of
source /etc/ovirt-hosted-engine/hosted-engine.conf
find /rhev/data-center/ -path
"*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec
sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf
2>/dev/null' {} {} \;
executed on your first host?