vm.conf on one of the node is missing

newer
Timout with large image uploads...

knarra

24 Nov 2016 24 Nov '16

1:26 p.m.

Hi, I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs. https://paste.fedoraproject.org/489120/79990345/ Thanks kasturi

Show replies by date

Simone Tiraboschi

24 Nov 24 Nov

1:45 p.m.

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

...

Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration. Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf. In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent Both of the issue seams storage related, could yuo please share your gluster logs?

...

Thanks

kasturi

knarra

2:08 p.m.

This is a multi-part message in MIME format. --------------A3C3683023E0EEA956340EAD Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

...

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/ <https://paste.fedoraproject.org/489120/79990345/>

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

Thanks

kasturi

Hi Simone, Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well. [1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/ Thanks kasturi --------------A3C3683023E0EEA956340EAD Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:<br> </div> <blockquote cite="mid:CAN8-ONq1qKA_-Sb_NCFPDsbhFRgBY6hrAXUfbg_utEjfAqSAAg@mail.gmail.com" type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 1:26 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br> <br> I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-h<wbr>a is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.<br> <br> <br> <a moz-do-not-send="true" href="https://paste.fedoraproject.org/489120/79990345/" rel="noreferrer" target="_blank">https://paste.fedoraproject.or<wbr>g/489120/79990345/</a><br> <br> </blockquote> <div><br> </div> Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there.<br> Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.<br> <br> Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.<br> <br> In you case the OVF_STORE volume is there,<br> but the agent fails extracting the engine VM configuration:<br> MainThread::<a class="moz-txt-link-freetext" href="INFO::2016-11-24">INFO::2016-11-24</a> 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE<br> MainThread::<a class="moz-txt-link-freetext" href="INFO::2016-11-24">INFO::2016-11-24</a> 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699<br> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF<br> <br> So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing:<br> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent<br> <br> Both of the issue seams storage related, could yuo please share your gluster logs?<br> <br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Thanks<br> <br> kasturi<br> <br> </blockquote> </div> <br> </div> </div> </blockquote> <p>Hi Simone,</p> <p> Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.</p> <p>[1] <a class="moz-txt-link-freetext" href="http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/">http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/</a></p> <p>Thanks</p> <p>kasturi<br> </p> </body> </html> --------------A3C3683023E0EEA956340EAD--

Simone Tiraboschi

2:26 p.m.

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com> wrote:

...

On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

...
Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778- 9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205:: ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

...
Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

...

[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/

Thanks

kasturi

knarra

2:39 p.m.

This is a multi-part message in MIME format. --------------DB00ED00F6B934BF3B5014CE Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

...

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/ <https://paste.fedoraproject.org/489120/79990345/>

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

...

[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/ <http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/>

Thanks

kasturi

--------------DB00ED00F6B934BF3B5014CE Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:<br> </div> <blockquote cite="mid:CAN8-ONoLO_bReAa=OMseF_HqV=eV1eBrYYt7n+YapC5joTw9qg@mail.gmail.com" type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:08 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div bgcolor="#FFFFFF" text="#000000"> <div> <div class="h5"> <div class="m_-6170493833456119351moz-cite-prefix">On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 1:26 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br> <br> I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-h<wbr>a is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.<br> <br> <br> <a moz-do-not-send="true" href="https://paste.fedoraproject.org/489120/79990345/" rel="noreferrer" target="_blank">https://paste.fedoraproject.or<wbr>g/489120/79990345/</a><br> <br> </blockquote> <div><br> </div> Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there.<br> Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.<br> <br> Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.<br> <br> In you case the OVF_STORE volume is there,<br> but the agent fails extracting the engine VM configuration:<br> MainThread::<a moz-do-not-send="true" class="m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,914::ovf_store::112::<wbr>ovirt_hosted_engine_ha.lib.<wbr>ovf.ovf_store.OVFStore::(<wbr>getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE<br> MainThread::<a moz-do-not-send="true" class="m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,919::ovf_store::119::<wbr>ovirt_hosted_engine_ha.lib.<wbr>ovf.ovf_store.OVFStore::(<wbr>getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/<wbr>glusterSD/10.70.36.79:_engine/<wbr>27f054c3-c245-4039-b42a-<wbr>c28b37043016/images/fdf49778-<wbr>9a06-49c6-bf7a-a0f12425911c/<wbr>8c954add-6bcf-47f8-ac2e-<wbr>4c85fc3f8699<br> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::<wbr>ovirt_hosted_engine_ha.lib.<wbr>ovf.ovf_store.OVFStore::(<wbr>getEngineVMOVF) Unable to extract HEVM OVF<br> <br> So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing:<br> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::<wbr>ovirt_hosted_engine_ha.agent.<wbr>agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-<wbr>engine-ha/vm.conf, key=memSize'' - trying to restart agent<br> <br> Both of the issue seams storage related, could yuo please share your gluster logs?<br> <br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Thanks<br> <br> kasturi<br> <br> </blockquote> </div> <br> </div> </div> </blockquote> </div> </div> <p>Hi Simone,</p> <p> Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.</p> </div> </blockquote> <div><br> </div> <div>And the host where you see the initial issue was the third one? <br> </div> </div> </div> </div> </blockquote> It is on the first host.<br> <blockquote cite="mid:CAN8-ONoLO_bReAa=OMseF_HqV=eV1eBrYYt7n+YapC5joTw9qg@mail.gmail.com" type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div> </div> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div bgcolor="#FFFFFF" text="#000000"> <p>[1] <a moz-do-not-send="true" class="m_-6170493833456119351moz-txt-link-freetext" href="http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/" target="_blank">http://rhsqe-repo.lab.eng.blr.<wbr>redhat.com/sosreports/HC/vm_<wbr>conf/</a></p> <p>Thanks</p> <p>kasturi<br> </p> </div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </body> </html> --------------DB00ED00F6B934BF3B5014CE--

Simone Tiraboschi

2:57 p.m.

On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com> wrote:

...

On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

...
Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6- bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

...
Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain: [2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] Before that there was a lot of self-healing activities.

...

[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/

...
Thanks

kasturi

knarra

3:06 p.m.

This is a multi-part message in MIME format. --------------ED262CEF8D2685FEA71EB1A1 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

...

On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/ <https://paste.fedoraproject.org/489120/79990345/>

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

...

...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/ <http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/>

Thanks

kasturi

--------------ED262CEF8D2685FEA71EB1A1 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:<br> </div> <blockquote cite="mid:CAN8-ONrtn9nx-m7-mBxx0=4=uKXB=mjo3shAWoscmOyXDvMiZA@mail.gmail.com" type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:39 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-"> <div class="gmail-m_-2672125017527102966moz-cite-prefix">On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:08 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <div> <div class="gmail-m_-2672125017527102966h5"> <div class="gmail-m_-2672125017527102966m_-6170493833456119351moz-cite-prefix">On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 1:26 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br> <br> I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-h<wbr>a is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.<br> <br> <br> <a moz-do-not-send="true" href="https://paste.fedoraproject.org/489120/79990345/" rel="noreferrer" target="_blank">https://paste.fedoraproject.or<wbr>g/489120/79990345/</a><br> <br> </blockquote> <div><br> </div> Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there.<br> Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.<br> <br> Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.<br> <br> In you case the OVF_STORE volume is there,<br> but the agent fails extracting the engine VM configuration:<br> MainThread::<a moz-do-not-send="true" class="gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,914::ovf_store::112::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) Extracting Engine VM OVF from the OVF_STORE<br> MainThread::<a moz-do-not-send="true" class="gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,919::ovf_store::119::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterS<wbr>D/10.70.36.79:_engine/27f054c3<wbr>-c245-4039-b42a-c28b37043016/<wbr>images/fdf49778-9a06-49c6-<wbr>bf7a-a0f12425911c/8c954add-<wbr>6bcf-47f8-ac2e-4c85fc3f8699<br> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) Unable to extract HEVM OVF<br> <br> So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing:<br> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir<wbr>t_hosted_engine_ha.agent.agent<wbr>.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-eng<wbr>ine-ha/vm.conf, key=memSize'' - trying to restart agent<br> <br> Both of the issue seams storage related, could yuo please share your gluster logs?<br> <br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Thanks<br> <br> kasturi<br> <br> </blockquote> </div> <br> </div> </div> </blockquote> </div> </div> <p>Hi Simone,</p> <p> Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.</p> </div> </blockquote> <div><br> </div> <div>And the host where you see the initial issue was the third one? <br> </div> </div> </div> </div> </blockquote> </span> It is on the first host.<span class="gmail-"><br> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div> </div> </div> </div> </div> </blockquote> </span></div> </blockquote> <div><br> </div> <div>It seams that host1 is failing reading from the the hosted-engine storage domain:</div> <div><br> </div> <div> <div>[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c)</div> <div>[2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d)</div> <div>[2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e)</div> <div>[2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f)</div> <div>[2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0)</div> <div>[2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected</div> <div>[2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available</div> <div>[2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599]</div> <div>[2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]</div> </div> <div><br> </div> <div>Before that there was a lot of self-healing activities.</div> <div><br> </div> </div> </div> </div> </blockquote> simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .<br> <blockquote cite="mid:CAN8-ONrtn9nx-m7-mBxx0=4=uKXB=mjo3shAWoscmOyXDvMiZA@mail.gmail.com" type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div><br> </div> <div> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-"> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <p>[1] <a moz-do-not-send="true" class="gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext" href="http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/" target="_blank">http://rhsqe-repo.lab.eng.blr.<wbr>redhat.com/sosreports/HC/vm_co<wbr>nf/</a></p> <p>Thanks</p> <p>kasturi<br> </p> </div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </span></div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </body> </html> --------------ED262CEF8D2685FEA71EB1A1--

Simone Tiraboschi

3:17 p.m.

On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra@redhat.com> wrote:

...

On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

...
Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/i mages/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf- 47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

...
Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a- c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/ 38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

Can you please share the output of source /etc/ovirt-hosted-engine/hosted-engine.conf find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; executed on your first host?

...

...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/

...
Thanks

kasturi

knarra

3:23 p.m.

This is a multi-part message in MIME format. --------------08CE69774B978D28F256FCDD Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:

...

On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/ <https://paste.fedoraproject.org/489120/79990345/>

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

Can you please share the output of source /etc/ovirt-hosted-engine/hosted-engine.conf find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; executed on your first host?

output for the above command https://paste.fedoraproject.org/489159/97176147/ Simone, i ran the following steps and i see that vm.conf file is missing. 1) Installed three nodes with glusterfs as storage domains. 2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node. 3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up. 4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active. 5) I see the following error message in broker.log and vm.conf file is missing. Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42a-c28b37043016' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37043016 not found in /rhev/data-center/mnt/glusterSD Thanks kasturi

...

...
...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/ <http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/>

Thanks

kasturi

--------------08CE69774B978D28F256FCDD Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:<br> </div> <blockquote cite="mid:CAN8-ONqt1w1uF-GqQ-8Hr2cN+E61AatDRpmnU=KB_LW8b9OBig@mail.gmail.com" type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 3:06 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <div> <div class="gmail-h5"> <div class="gmail-m_-1038946535864355601moz-cite-prefix">On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:39 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-m_-1038946535864355601gmail-"> <div class="gmail-m_-1038946535864355601gmail-m_-2672125017527102966moz-cite-prefix">On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:08 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <div> <div class="gmail-m_-1038946535864355601gmail-m_-2672125017527102966h5"> <div class="gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-cite-prefix">On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 1:26 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br> <br> I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-h<wbr>a is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.<br> <br> <br> <a moz-do-not-send="true" href="https://paste.fedoraproject.org/489120/79990345/" rel="noreferrer" target="_blank">https://paste.fedoraproject.or<wbr>g/489120/79990345/</a><br> <br> </blockquote> <div><br> </div> Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there.<br> Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.<br> <br> Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.<br> <br> In you case the OVF_STORE volume is there,<br> but the agent fails extracting the engine VM configuration:<br> MainThread::<a moz-do-not-send="true" class="gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,914::ovf_store::112::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) Extracting Engine VM OVF from the OVF_STORE<br> MainThread::<a moz-do-not-send="true" class="gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,919::ovf_store::119::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterS<wbr>D/10.70.36.79:_engine/27f054c3<wbr>-c245-4039-b42a-c28b37043016/i<wbr>mages/fdf49778-9a06-49c6-bf7a-<wbr>a0f12425911c/8c954add-6bcf-<wbr>47f8-ac2e-4c85fc3f8699<br> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) Unable to extract HEVM OVF<br> <br> So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing:<br> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir<wbr>t_hosted_engine_ha.agent.agent<wbr>.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-eng<wbr>ine-ha/vm.conf, key=memSize'' - trying to restart agent<br> <br> Both of the issue seams storage related, could yuo please share your gluster logs?<br> <br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Thanks<br> <br> kasturi<br> <br> </blockquote> </div> <br> </div> </div> </blockquote> </div> </div> <p>Hi Simone,</p> <p> Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.</p> </div> </blockquote> <div><br> </div> <div>And the host where you see the initial issue was the third one? <br> </div> </div> </div> </div> </blockquote> </span> It is on the first host.<span class="gmail-m_-1038946535864355601gmail-"><br> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div> </div> </div> </div> </div> </blockquote> </span></div> </blockquote> <div><br> </div> <div>It seams that host1 is failing reading from the the hosted-engine storage domain:</div> <div><br> </div> <div> <div>[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:<wbr>client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_<wbr>unwind] (--> /lib64/libglusterfs.so.0(_gf_<wbr>log_callingfn+0x192)[<wbr>0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_unwind+0x1de)[<wbr>0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_destroy+0xe)[<wbr>0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[<wbr>0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c)</div> <div>[2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_<wbr>unwind] (--> /lib64/libglusterfs.so.0(_gf_<wbr>log_callingfn+0x192)[<wbr>0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_unwind+0x1de)[<wbr>0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_destroy+0xe)[<wbr>0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[<wbr>0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d)</div> <div>[2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:<wbr>client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-<wbr>c28b37043016/images/39960f40-<wbr>4aae-4714-ba73-1637785fae7c/<wbr>38fa3519-f21e-4671-8c69-<wbr>d1497ff8a490 (1090c25b-9c90-434e-a133-<wbr>faf9647cc992) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_<wbr>unwind] (--> /lib64/libglusterfs.so.0(_gf_<wbr>log_callingfn+0x192)[<wbr>0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_unwind+0x1de)[<wbr>0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_destroy+0xe)[<wbr>0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[<wbr>0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e)</div> <div>[2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_<wbr>unwind] (--> /lib64/libglusterfs.so.0(_gf_<wbr>log_callingfn+0x192)[<wbr>0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_unwind+0x1de)[<wbr>0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_<wbr>frames_destroy+0xe)[<wbr>0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[<wbr>0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f)</div> <div>[2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_<wbr>request] 0-engine-client-2: not connected (priv->connected = 0)</div> <div>[2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_<wbr>submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_<wbr>submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_<wbr>submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:<wbr>client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-<wbr>c28b37043016 (a64398f5-3fa3-48fe-9d40-<wbr>d3860876cc2c) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_<wbr>ping_cbk] 0-engine-client-2: socket disconnected</div> <div>[2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_<wbr>notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available</div> <div>[2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_<wbr>submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>The message "W [MSGID: 114031] [client-rpc-fops.c:2938:<wbr>client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-<wbr>c28b37043016 (a64398f5-3fa3-48fe-9d40-<wbr>d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599]</div> <div>[2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:<wbr>client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-<wbr>000000000000) [Transport endpoint is not connected]</div> </div> <div><br> </div> <div>Before that there was a lot of self-healing activities.</div> <div><br> </div> </div> </div> </div> </blockquote> </div> </div> simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .</div> </blockquote> <div><br> </div> <div>Can you please share the output of</div> source /etc/ovirt-hosted-engine/hosted-engine.conf<br> find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; <div>executed on your first host?</div> </div> </div> </div> </blockquote> output for the above command <a class="moz-txt-link-freetext" href="https://paste.fedoraproject.org/489159/97176147/">https://paste.fedoraproject.org/489159/97176147/</a><br> <br> Simone, i ran the following steps and i see that vm.conf file is missing.<br> <br> 1) Installed three nodes with glusterfs as storage domains.<br> <br> 2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node.<br> <br> 3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up.<br> <br> 4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active.<br> <br> 5) I see the following error message in broker.log and vm.conf file is missing.<br> <br> Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain<br> FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42a-c28b37043016'<br> Traceback (most recent call last):<br> File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle<br> data)<br> File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch<br> .set_storage_domain(client, sd_type, **options)<br> File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain<br> self._backends[client].connect()<br> File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect<br> self._dom_type)<br> File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path<br> " in {1}".format(sd_uuid, parent))<br> BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37043016 not found in /rhev/data-center/mnt/glusterSD<br> <br> Thanks<br> kasturi<br> <br> <blockquote cite="mid:CAN8-ONqt1w1uF-GqQ-8Hr2cN+E61AatDRpmnU=KB_LW8b9OBig@mail.gmail.com" type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div><br> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-"><br> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div><br> </div> <div> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-m_-1038946535864355601gmail-"> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <p>[1] <a moz-do-not-send="true" class="gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext" href="http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/" target="_blank">http://rhsqe-repo.lab.eng.blr.<wbr>redhat.com/sosreports/HC/vm_co<wbr>nf/</a></p> <p>Thanks</p> <p>kasturi<br> </p> </div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </span></div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </span></div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </body> </html> --------------08CE69774B978D28F256FCDD--

Simone Tiraboschi

3:35 p.m.

On Thu, Nov 24, 2016 at 3:23 PM, knarra <knarra@redhat.com> wrote:

...

On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

...
Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/i mages/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f 8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

...
Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3 7043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

Can you please share the output of source /etc/ovirt-hosted-engine/hosted-engine.conf find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_ image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; executed on your first host?

output for the above command https://paste.fedoraproject. org/489159/97176147/

This one looks fine, now the point is why the OVF_STORE doesn't seams so, Could you please try executing: sudo -u vdsm dd if=/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/ 27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778- 9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 | tar -xOvf - 21e0e248-19bf-47b3-b72f-6a3740d9ff43.ovf Simone, i ran the following steps and i see that vm.conf file is missing.

...

1) Installed three nodes with glusterfs as storage domains.

2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node.

3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up.

4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active.

5) I see the following error message in broker.log and vm.conf file is missing.

Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192:: ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039- b42a-c28b37043016' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37043016 not found in /rhev/data-center/mnt/glusterSD

Thanks kasturi

...
...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/

...
Thanks

kasturi

Simone Tiraboschi

11:05 p.m.

On Thu, Nov 24, 2016 at 3:35 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:

...

On Thu, Nov 24, 2016 at 3:23 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

...
Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/i mages/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f 8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

...
Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3 7043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

Can you please share the output of source /etc/ovirt-hosted-engine/hosted-engine.conf find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; executed on your first host?

output for the above command https://paste.fedoraproject.or g/489159/97176147/

This one looks fine, now the point is why the OVF_STORE doesn't seams so,

Could you please try executing: sudo -u vdsm dd if=/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/ 27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6- bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 | tar -xOvf - 21e0e248-19bf-47b3-b72f-6a3740d9ff43.ovf

OK, found: it's definitively a regression introduced (just on master) with https://gerrit.ovirt.org/#/c/66103/ The code that read and writes the HE configuration and the OVF_STORE from/to the volume on shared storage (with dd) is shared between ovirt-hosted-engine-setup (which runs as root) and ovirt-ha-agent (which runs as vdsm user). In order to behave correctly (against root squash on NFC for instance), it was always running 'sudo -u vdsm dd ...' instead of just 'dd' regardless of the active user. But since https://gerrit.ovirt.org/#/c/66103/ , vdsm user cannot run anymore 'sudo -u vdsm dd ...' and so this issue. I opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1398443 Kasturi, you got it just on one of your hosts since the other two run an older VDSM build.

...

Simone, i ran the following steps and i see that vm.conf file is missing.

...
1) Installed three nodes with glusterfs as storage domains.

2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node.

3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up.

4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active.

5) I see the following error message in broker.log and vm.conf file is missing.

Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::o virt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42 a-c28b37043016' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37043016 not found in /rhev/data-center/mnt/glusterSD

Thanks kasturi

...
...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/

...
Thanks

kasturi

knarra

25 Nov 25 Nov

7:52 a.m.

This is a multi-part message in MIME format. --------------F9EAA1F1E33BF85629810339 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 11/25/2016 03:35 AM, Simone Tiraboschi wrote:

...

On Thu, Nov 24, 2016 at 3:35 PM, Simone Tiraboschi <stirabos@redhat.com <mailto:stirabos@redhat.com>> wrote:

On Thu, Nov 24, 2016 at 3:23 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

...
On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com <mailto:knarra@redhat.com>> wrote:

Hi,

I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.

https://paste.fedoraproject.org/489120/79990345/ <https://paste.fedoraproject.org/489120/79990345/>

Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

Thanks

kasturi

Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

Can you please share the output of source /etc/ovirt-hosted-engine/hosted-engine.conf find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; executed on your first host?

output for the above command https://paste.fedoraproject.org/489159/97176147/ <https://paste.fedoraproject.org/489159/97176147/>

This one looks fine, now the point is why the OVF_STORE doesn't seams so,

Could you please try executing: sudo -u vdsm dd if=/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 | tar -xOvf - 21e0e248-19bf-47b3-b72f-6a3740d9ff43.ovf

OK, found: it's definitively a regression introduced (just on master) with https://gerrit.ovirt.org/#/c/66103/

The code that read and writes the HE configuration and the OVF_STORE from/to the volume on shared storage (with dd) is shared between ovirt-hosted-engine-setup (which runs as root) and ovirt-ha-agent (which runs as vdsm user). In order to behave correctly (against root squash on NFC for instance), it was always running 'sudo -u vdsm dd ...' instead of just 'dd' regardless of the active user. But since https://gerrit.ovirt.org/#/c/66103/ , vdsm user cannot run anymore 'sudo -u vdsm dd ...' and so this issue.

I opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1398443

Is it the same reason why vm.conf file is missing from /var/run/ovirt-hosted-engine-ha folder ?

...

Kasturi, you got it just on one of your hosts since the other two run an older VDSM build.

Simone, i ran the following steps and i see that vm.conf file is missing.

1) Installed three nodes with glusterfs as storage domains.

2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node.

3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up.

4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active.

5) I see the following error message in broker.log and vm.conf file is missing.

Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42a-c28b37043016' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37043016 not found in /rhev/data-center/mnt/glusterSD

Thanks kasturi

...
...
...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/ <http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/>

Thanks

kasturi

--------------F9EAA1F1E33BF85629810339 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">On 11/25/2016 03:35 AM, Simone Tiraboschi wrote:<br> </div> <blockquote cite="mid:CAN8-ONo7LBndPE9PCkai5mNWONEVZwFC-EjiMHuc-q3n2=wzgw@mail.gmail.com" type="cite"> <div dir="ltr"> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 3:35 PM, Simone Tiraboschi <span dir="ltr"><<a moz-do-not-send="true" href="mailto:stirabos@redhat.com" target="_blank">stirabos@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote"> <div> <div class="gmail-h5">On Thu, Nov 24, 2016 at 3:23 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <div> <div class="gmail-m_8283562645140910209gmail-h5"> <div class="gmail-m_8283562645140910209gmail-m_4061716222045576993moz-cite-prefix">On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 3:06 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <div> <div class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-h5"> <div class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601moz-cite-prefix">On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:39 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-"> <div class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-m_-2672125017527102966moz-cite-prefix">On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 2:08 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <div> <div class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-m_-2672125017527102966h5"> <div class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-cite-prefix">On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:<br> </div> <blockquote type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Nov 24, 2016 at 1:26 PM, knarra <span dir="ltr"><<a moz-do-not-send="true" href="mailto:knarra@redhat.com" target="_blank">knarra@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br> <br> I have three nodes with glusterfs as storage domain. For some reason i see that vm.conf from /var/run/ovirt-hosted-engine-h<wbr>a is missing and due to this on one of my host i see that Hosted Engine HA : Not Active. Once i copy the file from some other node and restart ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then this happens again. Can some please help me identify why this happens. Below is the log i see in ovirt-ha-agent.logs.<br> <br> <br> <a moz-do-not-send="true" href="https://paste.fedoraproject.org/489120/79990345/" rel="noreferrer" target="_blank">https://paste.fedoraproject.or<wbr>g/489120/79990345/</a><br> <br> </blockquote> <div><br> </div> Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there.<br> Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.<br> <br> Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.<br> <br> In you case the OVF_STORE volume is there,<br> but the agent fails extracting the engine VM configuration:<br> MainThread::<a moz-do-not-send="true" class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,914::ovf_store::112::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) Extracting Engine VM OVF from the OVF_STORE<br> MainThread::<a moz-do-not-send="true" class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext">INFO::2016-11-24</a> 17:55:04,919::ovf_store::119::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterS<wbr>D/10.70.36.79:_engine/27f054c3<wbr>-c245-4039-b42a-c28b37043016/i<wbr>mages/fdf49778-9a06-49c6-bf7a-<wbr>a0f12425911c/8c954add-6bcf-47f<wbr>8-ac2e-4c85fc3f8699<br> MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124::<wbr>ovirt_hosted_engine_ha.lib.ovf<wbr>.ovf_store.OVFStore::(getEngin<wbr>eVMOVF) Unable to extract HEVM OVF<br> <br> So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing:<br> MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir<wbr>t_hosted_engine_ha.agent.agent<wbr>.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-eng<wbr>ine-ha/vm.conf, key=memSize'' - trying to restart agent<br> <br> Both of the issue seams storage related, could yuo please share your gluster logs?<br> <br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Thanks<br> <br> kasturi<br> <br> </blockquote> </div> <br> </div> </div> </blockquote> </div> </div> <p>Hi Simone,</p> <p> Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.</p> </div> </blockquote> <div><br> </div> <div>And the host where you see the initial issue was the third one? <br> </div> </div> </div> </div> </blockquote> </span> It is on the first host.<span class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-"><br> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div> </div> </div> </div> </div> </blockquote> </span></div> </blockquote> <div><br> </div> <div>It seams that host1 is failing reading from the the hosted-engine storage domain:</div> <div><br> </div> <div> <div>[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client<wbr>3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-00000<wbr>0000001) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_u<wbr>nwind] (--> /lib64/libglusterfs.so.0(_gf_l<wbr>og_callingfn+0x192)[0x7f077eba<wbr>1642] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_unwind+0x1de)[0x7f077e9677<wbr>5e] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_destroy+0xe)[0x7f077e96786<wbr>e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[0x7f0<wbr>77e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c)</div> <div>[2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_u<wbr>nwind] (--> /lib64/libglusterfs.so.0(_gf_l<wbr>og_callingfn+0x192)[0x7f077eba<wbr>1642] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_unwind+0x1de)[0x7f077e9677<wbr>5e] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_destroy+0xe)[0x7f077e96786<wbr>e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[0x7f0<wbr>77e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d)</div> <div>[2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client<wbr>3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3<wbr>7043016/images/39960f40-4aae-4<wbr>714-ba73-1637785fae7c/38fa3519<wbr>-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf96<wbr>47cc992) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_u<wbr>nwind] (--> /lib64/libglusterfs.so.0(_gf_l<wbr>og_callingfn+0x192)[0x7f077eba<wbr>1642] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_unwind+0x1de)[0x7f077e9677<wbr>5e] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_destroy+0xe)[0x7f077e96786<wbr>e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[0x7f0<wbr>77e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e)</div> <div>[2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_u<wbr>nwind] (--> /lib64/libglusterfs.so.0(_gf_l<wbr>og_callingfn+0x192)[0x7f077eba<wbr>1642] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_unwind+0x1de)[0x7f077e9677<wbr>5e] (--> /lib64/libgfrpc.so.0(saved_fra<wbr>mes_destroy+0xe)[0x7f077e96786<wbr>e] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x84)[0x7f0<wbr>77e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f)</div> <div>[2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_r<wbr>equest] 0-engine-client-2: not connected (priv->connected = 0)</div> <div>[2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_subm<wbr>it] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_subm<wbr>it] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_subm<wbr>it] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>[2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client<wbr>3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3<wbr>7043016 (a64398f5-3fa3-48fe-9d40-d3860<wbr>876cc2c) [Transport endpoint is not connected]</div> <div>[2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_<wbr>ping_cbk] 0-engine-client-2: socket disconnected</div> <div>[2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_noti<wbr>fy] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available</div> <div>[2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_subm<wbr>it] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2)</div> <div>The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client<wbr>3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3<wbr>7043016 (a64398f5-3fa3-48fe-9d40-d3860<wbr>876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599]</div> <div>[2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client<wbr>3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-00000<wbr>0000000) [Transport endpoint is not connected]</div> </div> <div><br> </div> <div>Before that there was a lot of self-healing activities.</div> <div><br> </div> </div> </div> </div> </blockquote> </div> </div> simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .</div> </blockquote> <div><br> </div> <div>Can you please share the output of</div> source /etc/ovirt-hosted-engine/hoste<wbr>d-engine.conf<br> find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_ima<wbr>ge_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; <div>executed on your first host?</div> </div> </div> </div> </blockquote> </div> </div> output for the above command <a moz-do-not-send="true" class="gmail-m_8283562645140910209gmail-m_4061716222045576993moz-txt-link-freetext" href="https://paste.fedoraproject.org/489159/97176147/" target="_blank">https://paste.fedoraproject.or<wbr>g/489159/97176147/</a><br> <br> </div> </blockquote> <div><br> </div> </div> </div> <div>This one looks fine,</div> <div>now the point is why the OVF_STORE doesn't seams so,</div> <div><br> </div> <div>Could you please try executing:</div> <div>sudo -u vdsm dd if=<span style="font-size:12.8px">/rhev/data-center/mnt/</span><span style="font-size:12.8px">glust<wbr>erSD/10.70.36.79:_engine/</span><span style="font-size:12.8px">27f05<wbr>4c3-c245-4039-b42a-</span><span style="font-size:12.8px">c28b3704301<wbr>6/images/fdf49778-</span><span style="font-size:12.8px">9a06-49c6-<wbr>bf7a-a0f12425911c/</span><span style="font-size:12.8px">8c954add-<wbr>6bcf-47f8-ac2e-</span><span style="font-size:12.8px">4c85fc3f8699</span> | tar -xOvf - 21e0e248-19bf-47b3-b72f-<wbr>6a3740d9ff43.ovf<br> </div> <div> <div class="gmail-h5"> <div> </div> </div> </div> </div> </div> </div> </blockquote> <div><br> </div> <div>OK, found: it's definitively a regression introduced (just on master) with <a moz-do-not-send="true" href="https://gerrit.ovirt.org/#/c/66103/">https://gerrit.ovirt.org/#/c/66103/</a></div> <div><br> </div> <div>The code that read and writes the HE configuration and the OVF_STORE from/to the volume on shared storage (with dd) is shared between ovirt-hosted-engine-setup (which runs as root) and ovirt-ha-agent (which runs as vdsm user).</div> <div>In order to behave correctly (against root squash on NFC for instance), it was always running 'sudo -u vdsm dd ...' instead of just 'dd' regardless of the active user.</div> <div>But since <a moz-do-not-send="true" href="https://gerrit.ovirt.org/#/c/66103/">https://gerrit.ovirt.org/#/c/66103/</a> , vdsm user cannot run anymore 'sudo -u vdsm dd ...' and so this issue.</div> <div><br> </div> <div>I opened a bug:</div> <div><a moz-do-not-send="true" href="https://bugzilla.redhat.com/show_bug.cgi?id=1398443">https://bugzilla.redhat.com/show_bug.cgi?id=1398443</a><br> </div> </div> </div> </div> </blockquote> Is it the same reason why vm.conf file is missing from /var/run/ovirt-hosted-engine-ha folder ?<br> <blockquote cite="mid:CAN8-ONo7LBndPE9PCkai5mNWONEVZwFC-EjiMHuc-q3n2=wzgw@mail.gmail.com" type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div><br> </div> <div>Kasturi, you got it just on one of your hosts since the other two run an older VDSM build.<br> </div> <div><br> </div> <div> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div> <div class="gmail-h5"> <div><br> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> Simone, i ran the following steps and i see that vm.conf file is missing.<br> <br> 1) Installed three nodes with glusterfs as storage domains.<br> <br> 2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node.<br> <br> 3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up.<br> <br> 4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active.<br> <br> 5) I see the following error message in broker.log and vm.conf file is missing.<br> <br> Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::o<wbr>virt_hosted_engine_ha.broker.l<wbr>istener.ConnectionHandler::(ha<wbr>ndle) Error handling request, data: 'set-storage-domain<br> FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42<wbr>a-c28b37043016'<br> Traceback (most recent call last):<br> File "/usr/lib/python2.7/site-packa<wbr>ges/ovirt_hosted_engine_ha/<wbr>broker/listener.py", line 166, in handle<br> data)<br> File "/usr/lib/python2.7/site-packa<wbr>ges/ovirt_hosted_engine_ha/<wbr>broker/listener.py", line 299, in _dispatch<br> .set_storage_domain(client, sd_type, **options)<br> File "/usr/lib/python2.7/site-packa<wbr>ges/ovirt_hosted_engine_ha/<wbr>broker/storage_broker.py", line 66, in set_storage_domain<br> self._backends[client].connect<wbr>()<br> File "/usr/lib/python2.7/site-packa<wbr>ges/ovirt_hosted_engine_ha/<wbr>lib/storage_backends.py", line 461, in connect<br> self._dom_type)<br> File "/usr/lib/python2.7/site-packa<wbr>ges/ovirt_hosted_engine_ha/<wbr>lib/storage_backends.py", line 108, in get_domain_path<br> " in {1}".format(sd_uuid, parent))<br> BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37<wbr>043016 not found in /rhev/data-center/mnt/glusterS<wbr>D<br> <br> Thanks<br> kasturi<span class="gmail-m_8283562645140910209gmail-"><br> <br> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div><br> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-"><br> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <div><br> </div> <div> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"><span class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-"> <blockquote type="cite"> <div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <p>[1] <a moz-do-not-send="true" class="gmail-m_8283562645140910209gmail-m_4061716222045576993gmail-m_-1038946535864355601gmail-m_-2672125017527102966m_-6170493833456119351moz-txt-link-freetext" href="http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/" target="_blank">http://rhsqe-repo.lab.eng.blr.<wbr>redhat.com/sosreports/HC/vm_co<wbr>nf/</a></p> <p>Thanks</p> <p>kasturi<br> </p> </div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </span></div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </span></div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </span></div> </blockquote> </div> </div> </div> <br> </div> </div> </blockquote> </div> <br> </div> </div> </blockquote> <p><br> </p> </body> </html> --------------F9EAA1F1E33BF85629810339--

Simone Tiraboschi

9:52 a.m.

On Fri, Nov 25, 2016 at 7:52 AM, knarra <knarra@redhat.com> wrote:

...

On 11/25/2016 03:35 AM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 3:35 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:

...
On Thu, Nov 24, 2016 at 3:23 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 07:47 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 3:06 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 07:27 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:39 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:56 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 2:08 PM, knarra <knarra@redhat.com> wrote:

...
On 11/24/2016 06:15 PM, Simone Tiraboschi wrote:

On Thu, Nov 24, 2016 at 1:26 PM, knarra <knarra@redhat.com> wrote:

> Hi, > > I have three nodes with glusterfs as storage domain. For some > reason i see that vm.conf from /var/run/ovirt-hosted-engine-ha is > missing and due to this on one of my host i see that Hosted Engine HA : Not > Active. Once i copy the file from some other node and restart > ovirt-ha-broker and ovirt-ha-agent services everything works fine. But then > this happens again. Can some please help me identify why this happens. > Below is the log i see in ovirt-ha-agent.logs. > > > https://paste.fedoraproject.org/489120/79990345/ > > Once the engine correctly imported the hosted-engine storage domain, a couple of OVF_STORE volumes will appear there. Every modification to the engine VM configuration will be written by the engine into that OVF_STORE, so all the ovirt-ha-agent running on the hosted-engine hosts will be able to re-start the engine VM with a coherent configuration.

Till the engine imports the hosted-engine storage domain, ovirt-ha-agent will fall back to the initial vm.conf.

In you case the OVF_STORE volume is there, but the agent fails extracting the engine VM configuration: MainThread::INFO::2016-11-24 17:55:04,914::ovf_store::112:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-11-24 17:55:04,919::ovf_store::119:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/glusterSD/10.70.36.79: _engine/27f054c3-c245-4039-b42a-c28b37043016/i mages/fdf49778-9a06-49c6-bf7a-a0f12425911c/8c954add-6bcf-47f 8-ac2e-4c85fc3f8699 MainThread::ERROR::2016-11-24 17:55:04,928::ovf_store::124:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF

So it tries to rollback to the initial vm.conf, but also that one seams to miss some values and so the agent is failing: MainThread::ERROR::2016-11-24 17:55:04,974::agent::205::ovir t_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent

Both of the issue seams storage related, could yuo please share your gluster logs?

> > > Thanks > > kasturi > > Hi Simone,

Below [1] is the link for the sosreports on the first two hosts. The third host has some issue. Once it is up will give the sosreport from there as well.

And the host where you see the initial issue was the third one?

It is on the first host.

It seams that host1 is failing reading from the the hosted-engine storage domain:

[2016-11-24 12:33:43.678467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2016-11-24 12:33:43.678747] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:07.495178 (xid=0x82a1c) [2016-11-24 12:33:43.678982] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:08.770637 (xid=0x82a1d) [2016-11-24 12:33:43.679001] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b3 7043016/images/39960f40-4aae-4714-ba73-1637785fae7c/38fa3519-f21e-4671-8c69-d1497ff8a490 (1090c25b-9c90-434e-a133-faf9647cc992) [Transport endpoint is not connected] [2016-11-24 12:33:43.679303] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-11-24 12:33:11.096856 (xid=0x82a1e) [2016-11-24 12:33:43.679596] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f077eba1642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f077e96775e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f077e96786e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f077e968fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f077e9698a0] ))))) 0-engine-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-11-24 12:33:13.673743 (xid=0x82a1f) [2016-11-24 12:33:43.682310] I [socket.c:3401:socket_submit_request] 0-engine-client-2: not connected (priv->connected = 0) [2016-11-24 12:33:43.682328] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a20 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682391] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a21 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a22 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) [2016-11-24 12:33:43.682441] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected] [2016-11-24 12:33:43.682492] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-engine-client-2: socket disconnected [2016-11-24 12:33:43.682536] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-11-24 12:33:43.682562] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-engine-client-2: failed to submit rpc-request (XID: 0x82a23 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (engine-client-2) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: /27f054c3-c245-4039-b42a-c28b37043016 (a64398f5-3fa3-48fe-9d40-d3860876cc2c) [Transport endpoint is not connected]" repeated 2 times between [2016-11-24 12:33:43.682441] and [2016-11-24 12:33:43.682599] [2016-11-24 12:33:43.688324] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-engine-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

Before that there was a lot of self-healing activities.

simone, these logs indicates that first host is not able to connect to the brick in the third host since it was powered down. It reads remote operation failed on engine_client_2 .

Can you please share the output of source /etc/ovirt-hosted-engine/hosted-engine.conf find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xOvf - vm.conf 2>/dev/null' {} {} \; executed on your first host?

output for the above command https://paste.fedoraproject.or g/489159/97176147/

This one looks fine, now the point is why the OVF_STORE doesn't seams so,

Could you please try executing: sudo -u vdsm dd if=/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/ 27f054c3-c245-4039-b42a-c28b37043016/images/fdf49778-9a06-49c6-bf 7a-a0f12425911c/8c954add-6bcf-47f8-ac2e-4c85fc3f8699 | tar -xOvf - 21e0e248-19bf-47b3-b72f-6a3740d9ff43.ovf

OK, found: it's definitively a regression introduced (just on master) with https://gerrit.ovirt.org/#/c/66103/

The code that read and writes the HE configuration and the OVF_STORE from/to the volume on shared storage (with dd) is shared between ovirt-hosted-engine-setup (which runs as root) and ovirt-ha-agent (which runs as vdsm user). In order to behave correctly (against root squash on NFC for instance), it was always running 'sudo -u vdsm dd ...' instead of just 'dd' regardless of the active user. But since https://gerrit.ovirt.org/#/c/66103/ , vdsm user cannot run anymore 'sudo -u vdsm dd ...' and so this issue.

I opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1398443

Is it the same reason why vm.conf file is missing from /var/run/ovirt-hosted-engine-ha folder ?

/var/run/ovirt-hosted-engine-ha just just a temporary location on the host where it saves the file extracted from the OVF_STORE. The temporary file in not that since it fails reading the OVF_STORE and the initial vm.conf due to the missing sudoers rule.

...

Kasturi, you got it just on one of your hosts since the other two run an older VDSM build.

...
Simone, i ran the following steps and i see that vm.conf file is missing.

...
1) Installed three nodes with glusterfs as storage domains.

2) Configured power management on all the nodes and ran the command 'poweroff -f' on one of the node.

3) Engine tried to fence the host and it logged the message 'Host is rebooting' but this never came up.

4) After some time i manually powered on the host and once the host is up i see that Hosted Engine HA : Not active.

5) I see the following error message in broker.log and vm.conf file is missing.

Thread-1::ERROR::2016-11-24 19:32:11,651::listener::192::o virt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=glusterfs sd_uuid=27f054c3-c245-4039-b42 a-c28b37043016' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 27f054c3-c245-4039-b42a-c28b37043016 not found in /rhev/data-center/mnt/glusterSD

Thanks kasturi

...
...
[1] http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/vm_conf/

...
Thanks

kasturi

3320

Age (days ago)

3321

Last active (days ago)

List overview

Download

12 comments

2 participants

participants (2)

knarra
Simone Tiraboschi

vm.conf on one of the node is missing

tags

participants (2)