[ovirt-users] vm.conf on one of the node is missing

knarra knarra at redhat.com
Fri Nov 25 06:52:28 UTC 2016


On 11/25/2016 03:35 AM, Simone Tiraboschi wrote:
>
> On Thu, Nov 24, 2016 at 3:35 PM, Simone Tiraboschi 
> <stirabos at redhat.com <mailto:stirabos at redhat.com>> wrote:
>
>
>
>     On Thu, Nov 24, 2016 at 3:23 PM, knarra <knarra at redhat.com
>     <mailto:knarra at redhat.com>> wrote:
>
>         On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:
>>
>>
>>         On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra at redhat.com
>>         <mailto:knarra at redhat.com>> wrote:
>>
>>             On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:
>>>
>>>
>>>             On Thu, Nov 24, 2016 at 2:39 PM, knarra
>>>             <knarra at redhat.com <mailto:knarra at redhat.com>> wrote:
>>>
>>>                 On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:
>>>>
>>>>
>>>>                 On Thu, Nov 24, 2016 at 2:08 PM, knarra
>>>>                 <knarra at redhat.com <mailto:knarra at redhat.com>> wrote:
>>>>
>>>>                     On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:
>>>>>
>>>>>
>>>>>                     On Thu, Nov 24, 2016 at 1:26 PM, knarra
>>>>>                     <knarra at redhat.com <mailto:knarra at redhat.com>>
>>>>>                     wrote:
>>>>>
>>>>>                         Hi,
>>>>>
>>>>>                             I have three nodes with glusterfs as
>>>>>                         storage domain. For some reason i see that
>>>>>                         vm.conf from
>>>>>                         /var/run/ovirt-hosted-engine-ha is missing
>>>>>                         and due to this on one of my host i see
>>>>>                         that Hosted Engine HA : Not Active. Once i
>>>>>                         copy the file from some other node and
>>>>>                         restart ovirt-ha-broker and ovirt-ha-agent
>>>>>                         services everything works fine. But then
>>>>>                         this happens again. Can some please help
>>>>>                         me identify why this happens. Below is the
>>>>>                         log i see in ovirt-ha-agent.logs.
>>>>>
>>>>>
>>>>>                         https://paste.fedoraproject.org/489120/79990345/
>>>>>                         <https://paste.fedoraproject.org/489120/79990345/>
>>>>>
>>>>>
>>>>>                     Once the engine correctly imported the
>>>>>                     hosted-engine storage domain, a couple of
>>>>>                     OVF_STORE volumes will appear there.
>>>>>                     Every modification to the engine VM
>>>>>                     configuration will be written by the engine
>>>>>                     into that OVF_STORE, so all the ovirt-ha-agent
>>>>>                     running on the hosted-engine hosts will be
>>>>>                     able to re-start the engine VM with a coherent
>>>>>                     configuration.
>>>>>
>>>>>                     Till the engine imports the hosted-engine
>>>>>                     storage domain, ovirt-ha-agent will fall back
>>>>>                     to the initial vm.conf.
>>>>>
>>>>>                     In you case the OVF_STORE volume is there,
>>>>>                     but the agent fails extracting the engine VM
>>>>>                     configuration:
>>>>>                     MainThread::INFO::2016-11-24
>>>>>                     17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>>                     Extracting Engine VM OVF from the OVF_STORE
>>>>>                     MainThread::INFO::2016-11-24
>>>>>                     17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>>                     OVF_STORE volume path:
>>>>>                     /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699
>>>>>                     MainThread::ERROR::2016-11-24
>>>>>                     17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>>>>>                     Unable to extract HEVM OVF
>>>>>
>>>>>                     So it tries to rollback to the initial
>>>>>                     vm.conf, but also that one seams to miss some
>>>>>                     values and so the agent is failing:
>>>>>                     MainThread::ERROR::2016-11-24
>>>>>                     17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>>>>>                     Error: ''Configuration value not found:
>>>>>                     file=/var/run/ovirt-hosted-engine-ha/vm.conf,
>>>>>                     key=memSize'' - trying to restart agent
>>>>>
>>>>>                     Both of the issue seams storage related, could
>>>>>                     yuo please share your gluster logs?
>>>>>
>>>>>
>>>>>                         Thanks
>>>>>
>>>>>                         kasturi
>>>>>
>>>>>
>>>>                     Hi Simone,
>>>>
>>>>                         Below [1] is the link for the sosreports on
>>>>                     the first two hosts. The third host has some
>>>>                     issue. Once it is up will give the sosreport
>>>>                     from there as well.
>>>>
>>>>
>>>>                 And the host where you see the initial issue was
>>>>                 the third one?
>>>                 It is on the first host.
>>>
>>>
>>>             It seams that host1 is failing reading from the the
>>>             hosted-engine storage domain:
>>>
>>>             [2016-11-24 12:33:43.678467] W [MSGID: 114031]
>>>             [client-rpc-fops.c:2938:client3_3_lookup_cbk]
>>>             0-engine-client-2: remote operation failed. Path: /
>>>             (00000000-0000-0000-0000-000000000001) [Transport
>>>             endpoint is not connected]
>>>             [2016-11-24 12:33:43.678747] E
>>>             [rpc-clnt.c:365:saved_frames_unwind] (-->
>>>             /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>>             ))))) 0-engine-client-2: forced unwinding frame
>>>             type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24
>>>             12:33:07.495178 (xid=0x82a1c)
>>>             [2016-11-24 12:33:43.678982] E
>>>             [rpc-clnt.c:365:saved_frames_unwind] (-->
>>>             /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>>             ))))) 0-engine-client-2: forced unwinding frame
>>>             type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24
>>>             12:33:08.770637 (xid=0x82a1d)
>>>             [2016-11-24 12:33:43.679001] W [MSGID: 114031]
>>>             [client-rpc-fops.c:2938:client3_3_lookup_cbk]
>>>             0-engine-client-2: remote operation failed. Path:
>>>             /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490
>>>             (1090c25b-9c90-434e-a133-faf9647cc992) [Transport
>>>             endpoint is not connected]
>>>             [2016-11-24 12:33:43.679303] E
>>>             [rpc-clnt.c:365:saved_frames_unwind] (-->
>>>             /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>>             ))))) 0-engine-client-2: forced unwinding frame
>>>             type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24
>>>             12:33:11.096856 (xid=0x82a1e)
>>>             [2016-11-24 12:33:43.679596] E
>>>             [rpc-clnt.c:365:saved_frames_unwind] (-->
>>>             /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4]
>>>             (-->
>>>             /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0]
>>>             ))))) 0-engine-client-2: forced unwinding frame
>>>             type(GF-DUMP) op(NULL(2)) called at 2016-11-24
>>>             12:33:13.673743 (xid=0x82a1f)
>>>             [2016-11-24 12:33:43.682310] I
>>>             [socket.c:3401:socket_submit_request] 0-engine-client-2:
>>>             not connected (priv->connected = 0)
>>>             [2016-11-24 12:33:43.682328] W
>>>             [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2:
>>>             failed to submit rpc-request (XID: 0x82a20 Program:
>>>             GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport
>>>             (engine-client-2)
>>>             [2016-11-24 12:33:43.682391] W
>>>             [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2:
>>>             failed to submit rpc-request (XID: 0x82a21 Program:
>>>             GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>>>             (engine-client-2)
>>>             [2016-11-24 12:33:43.682441] W
>>>             [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2:
>>>             failed to submit rpc-request (XID: 0x82a22 Program:
>>>             GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>>>             (engine-client-2)
>>>             [2016-11-24 12:33:43.682441] W [MSGID: 114031]
>>>             [client-rpc-fops.c:2938:client3_3_lookup_cbk]
>>>             0-engine-client-2: remote operation failed. Path:
>>>             /27f054c3-c245-4039-b42a-c28b37043016
>>>             (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport
>>>             endpoint is not connected]
>>>             [2016-11-24 12:33:43.682492] W
>>>             [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk]
>>>             0-engine-client-2: socket disconnected
>>>             [2016-11-24 12:33:43.682536] I [MSGID: 114018]
>>>             [client.c:2280:client_rpc_notify] 0-engine-client-2:
>>>             disconnected from engine-client-2. Client process will
>>>             keep trying to connect to glusterd until brick's port is
>>>             available
>>>             [2016-11-24 12:33:43.682562] W
>>>             [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2:
>>>             failed to submit rpc-request (XID: 0x82a23 Program:
>>>             GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>>>             (engine-client-2)
>>>             The message "W [MSGID: 114031]
>>>             [client-rpc-fops.c:2938:client3_3_lookup_cbk]
>>>             0-engine-client-2: remote operation failed. Path:
>>>             /27f054c3-c245-4039-b42a-c28b37043016
>>>             (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport
>>>             endpoint is not connected]" repeated 2 times between
>>>             [2016-11-24 12:33:43.682441] and [2016-11-24
>>>             12:33:43.682599]
>>>             [2016-11-24 12:33:43.688324] W [MSGID: 114031]
>>>             [client-rpc-fops.c:2938:client3_3_lookup_cbk]
>>>             0-engine-client-2: remote operation failed. Path: (null)
>>>             (00000000-0000-0000-0000-000000000000) [Transport
>>>             endpoint is not connected]
>>>
>>>             Before that there was a lot of self-healing activities.
>>>
>>             simone, these logs indicates that first host is not able
>>             to connect to the brick in the third host since it was
>>             powered down.  It reads remote operation failed on
>>             engine_client_2 .
>>
>>
>>         Can you please share the output of
>>          source /etc/ovirt-hosted-engine/hosted-engine.conf
>>          find /rhev/data-center/ -path
>>         "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}"
>>         -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar
>>         -xOvf - vm.conf  2>/dev/null' {} {} \;
>>         executed on your first host?
>         output for the above command
>         https://paste.fedoraproject.org/489159/97176147/
>         <https://paste.fedoraproject.org/489159/97176147/>
>
>
>     This one looks fine,
>     now the point is why the OVF_STORE doesn't seams so,
>
>     Could you please try executing:
>     sudo -u vdsm dd
>     if=/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 |
>     tar -xOvf - 21e0e248-19bf-47b3-b72f-6a3740d9ff43.ovf
>
>
> OK, found: it's definitively a regression introduced (just on master) 
> with https://gerrit.ovirt.org/#/c/66103/
>
> The code that read and writes the HE configuration and the OVF_STORE 
> from/to the volume on shared storage (with dd) is shared between 
> ovirt-hosted-engine-setup (which runs as root) and ovirt-ha-agent 
> (which runs as vdsm user).
> In order to behave correctly (against root squash on NFC for 
> instance), it was always running 'sudo -u vdsm dd ...' instead of just 
> 'dd' regardless of the active user.
> But since https://gerrit.ovirt.org/#/c/66103/ , vdsm user cannot run 
> anymore 'sudo -u vdsm dd ...' and so this issue.
>
> I opened a bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1398443
Is it the same reason why vm.conf file is missing from 
/var/run/ovirt-hosted-engine-ha folder ?
>
> Kasturi, you got it just on one of your hosts since the other two run 
> an older VDSM build.
>
>
>         Simone, i ran the following steps and i see that vm.conf file
>         is missing.
>
>         1) Installed three nodes with glusterfs as storage domains.
>
>         2) Configured power management on all the nodes and ran the
>         command 'poweroff -f' on one of the node.
>
>         3) Engine tried to fence the host and it logged the message
>         'Host is rebooting' but this never came up.
>
>         4) After some time i manually powered on the host and once the
>         host is up i see that Hosted Engine HA : Not active.
>
>         5) I see the following error message in broker.log and vm.conf
>         file is missing.
>
>         Thread-1::ERROR::2016-11-24
>         19:32:11,651::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
>         Error handling request, data: 'set-storage-domain
>          FilesystemBackend dom_type=glusterfs
>         sd_uuid=27f054c3-c245-4039-b42a-c28b37043016'
>         Traceback (most recent call last):
>           File
>         "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
>         line 166, in handle
>             data)
>           File
>         "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
>         line 299, in _dispatch
>             .set_storage_domain(client, sd_type, **options)
>           File
>         "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
>         line 66, in set_storage_domain
>             self._backends[client].connect()
>           File
>         "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
>         line 461, in connect
>             self._dom_type)
>           File
>         "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
>         line 108, in get_domain_path
>             " in {1}".format(sd_uuid, parent))
>         BackendFailureException: path to storage domain
>         27f054c3-c245-4039-b42a-c28b37043016 not found in
>         /rhev/data-center/mnt/glusterSD
>
>         Thanks
>         kasturi
>
>>
>>
>>>
>>>>                     [1]
>>>>                     http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/
>>>>                     <http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/>
>>>>
>>>>                     Thanks
>>>>
>>>>                     kasturi
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20161125/ba03fa2f/attachment-0001.html>


More information about the Users mailing list