[ovirt-users] vm.conf on one of the node is missing

knarra knarra at redhat.com
Thu Nov 24 14:23:55 UTC 2016


On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:
>
>
> On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra at redhat.com 
> <mailto:knarra at redhat.com>> wrote:
>
>     On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:
>>
>>
>>     On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra at redhat.com
>>     <mailto:knarra at redhat.com>> wrote:
>>
>>         On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:
>>>
>>>
>>>         On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra at redhat.com
>>>         <mailto:knarra at redhat.com>> wrote:
>>>
>>>             On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:
>>>>
>>>>
>>>>             On Thu, Nov 24, 2016 at 1:26 PM, knarra
>>>>             <knarra at redhat.com <mailto:knarra at redhat.com>> wrote:
>>>>
>>>>                 Hi,
>>>>
>>>>                     I have three nodes with glusterfs as storage
>>>>                 domain. For some reason i see that vm.conf from
>>>>                 /var/run/ovirt-hosted-engine-ha is missing and due
>>>>                 to this on one of my host i see that Hosted Engine
>>>>                 HA : Not Active. Once i copy the file from some
>>>>                 other node and restart ovirt-ha-broker and
>>>>                 ovirt-ha-agent services everything works fine. But
>>>>                 then this happens again. Can some please help me
>>>>                 identify why this happens. Below is the log i see
>>>>                 in ovirt-ha-agent.logs.
>>>>
>>>>
>>>>                 https://paste.fedoraproject.org/489120/79990345/
>>>>                 <https://paste.fedoraproject.org/489120/79990345/>
>>>>
>>>>
>>>>             Once the engine correctly imported the hosted-engine
>>>>             storage domain, a couple of OVF_STORE volumes will
>>>>             appear there.
>>>>             Every modification to the engine VM configuration will
>>>>             be written by the engine into that OVF_STORE, so all
>>>>             the ovirt-ha-agent running on the hosted-engine hosts
>>>>             will be able to re-start the engine VM with a coherent
>>>>             configuration.
>>>>
>>>>             Till the engine imports the hosted-engine storage
>>>>             domain, ovirt-ha-agent will fall back to the initial
>>>>             vm.conf.
>>>>
>>>>             In you case the OVF_STORE volume is there,
>>>>             but the agent fails extracting the engine VM configuration:
>>>>             MainThread::INFO::2016-11-24
>>>>             17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>             Extracting Engine VM OVF from the OVF_STORE
>>>>             MainThread::INFO::2016-11-24
>>>>             17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>             OVF_STORE volume path:
>>>>             /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699
>>>>             MainThread::ERROR::2016-11-24
>>>>             17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>             Unable to extract HEVM OVF
>>>>
>>>>             So it tries to rollback to the initial vm.conf, but
>>>>             also that one seams to miss some values and so the
>>>>             agent is failing:
>>>>             MainThread::ERROR::2016-11-24
>>>>             17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>>>>             Error: ''Configuration value not found:
>>>>             file=/var/run/ovirt-hosted-engine-ha/vm.conf,
>>>>             key=memSize'' - trying to restart agent
>>>>
>>>>             Both of the issue seams storage related, could yuo
>>>>             please share your gluster logs?
>>>>
>>>>
>>>>                 Thanks
>>>>
>>>>                 kasturi
>>>>
>>>>
>>>             Hi Simone,
>>>
>>>                 Below [1] is the link for the sosreports on the
>>>             first two hosts. The third host has some issue. Once it
>>>             is up will give the sosreport from there as well.
>>>
>>>
>>>         And the host where you see the initial issue was the third one?
>>         It is on the first host.
>>
>>
>>     It seams that host1 is failing reading from the the hosted-engine
>>     storage domain:
>>
>>     [2016-11-24 12:33:43.678467] W [MSGID: 114031]
>>     [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>     remote operation failed. Path: /
>>     (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
>>     connected]
>>     [2016-11-24 12:33:43.678747] E
>>     [rpc-clnt.c:365:saved_frames_unwind] (-->
>>     /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>     (-->
>>     /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>     (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>     ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS
>>     3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178
>>     (xid=0x82a1c)
>>     [2016-11-24 12:33:43.678982] E
>>     [rpc-clnt.c:365:saved_frames_unwind] (-->
>>     /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>     (-->
>>     /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>     (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>     ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS
>>     3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637
>>     (xid=0x82a1d)
>>     [2016-11-24 12:33:43.679001] W [MSGID: 114031]
>>     [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>     remote operation failed. Path:
>>     /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490
>>     (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not
>>     connected]
>>     [2016-11-24 12:33:43.679303] E
>>     [rpc-clnt.c:365:saved_frames_unwind] (-->
>>     /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>     (-->
>>     /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>     (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>     ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS
>>     3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856
>>     (xid=0x82a1e)
>>     [2016-11-24 12:33:43.679596] E
>>     [rpc-clnt.c:365:saved_frames_unwind] (-->
>>     /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>     (-->
>>     /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>     (-->
>>     /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>     (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>     ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP)
>>     op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f)
>>     [2016-11-24 12:33:43.682310] I
>>     [socket.c:3401:socket_submit_request] 0-engine-client-2: not
>>     connected (priv->connected = 0)
>>     [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>     0-engine-client-2: failed to submit rpc-request (XID: 0x82a20
>>     Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport
>>     (engine-client-2)
>>     [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>     0-engine-client-2: failed to submit rpc-request (XID: 0x82a21
>>     Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>>     (engine-client-2)
>>     [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>     0-engine-client-2: failed to submit rpc-request (XID: 0x82a22
>>     Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>>     (engine-client-2)
>>     [2016-11-24 12:33:43.682441] W [MSGID: 114031]
>>     [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>     remote operation failed. Path:
>>     /27f054c3-c245-4039-b42a-c28b37043016
>>     (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not
>>     connected]
>>     [2016-11-24 12:33:43.682492] W
>>     [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket
>>     disconnected
>>     [2016-11-24 12:33:43.682536] I [MSGID: 114018]
>>     [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected
>>     from engine-client-2. Client process will keep trying to connect
>>     to glusterd until brick's port is available
>>     [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit]
>>     0-engine-client-2: failed to submit rpc-request (XID: 0x82a23
>>     Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>>     (engine-client-2)
>>     The message "W [MSGID: 114031]
>>     [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>     remote operation failed. Path:
>>     /27f054c3-c245-4039-b42a-c28b37043016
>>     (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not
>>     connected]" repeated 2 times between [2016-11-24 12:33:43.682441]
>>     and [2016-11-24 12:33:43.682599]
>>     [2016-11-24 12:33:43.688324] W [MSGID: 114031]
>>     [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2:
>>     remote operation failed. Path: (null)
>>     (00000000-0000-0000-0000-000000000000) [Transport endpoint is not
>>     connected]
>>
>>     Before that there was a lot of self-healing activities.
>>
>     simone, these logs indicates that first host is not able to
>     connect to the brick in the third host since it was powered down. 
>     It reads remote operation failed on engine_client_2 .
>
>
> Can you please share the output of
>  source /etc/ovirt-hosted-engine/hosted-engine.conf
>  find /rhev/data-center/ -path 
> "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f 
> -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 
>  2>/dev/null' {} {} \;
> executed on your first host?
output for the above command 
https://paste.fedoraproject.org/489159/97176147/

Simone, i ran the following steps and i see that vm.conf file is missing.

1) Installed three nodes with glusterfs as storage domains.

2) Configured power management on all the nodes and ran the command 
'poweroff -f' on one of the node.

3) Engine tried to fence the host and it logged the message 'Host is 
rebooting' but this never came up.

4) After some time i manually powered on the host and once the host is 
up i see that Hosted Engine HA : Not active.

5) I see the following error message in broker.log and vm.conf file is 
missing.

Thread-1::ERROR::2016-11-24 
19:32:11,651::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) 
Error handling request, data: 'set-storage-domain
  FilesystemBackend dom_type=glusterfs 
sd_uuid=27f054c3-c245-4039-b42a-c28b37043016'
Traceback (most recent call last):
   File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", 
line 166, in handle
     data)
   File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", 
line 299, in _dispatch
     .set_storage_domain(client, sd_type, **options)
   File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", 
line 66, in set_storage_domain
     self._backends[client].connect()
   File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", 
line 461, in connect
     self._dom_type)
   File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", 
line 108, in get_domain_path
     " in {1}".format(sd_uuid, parent))
BackendFailureException: path to storage domain 
27f054c3-c245-4039-b42a-c28b37043016 not found in 
/rhev/data-center/mnt/glusterSD

Thanks
kasturi

>
>
>>
>>>             [1]
>>>             http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/
>>>             <http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/>
>>>
>>>             Thanks
>>>
>>>             kasturi
>>>
>>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20161124/a20eb6e8/attachment-0001.html>


More information about the Users mailing list