<div dir="ltr">I took a HE VM down and stopped ovirt-ha-agents on both hosts. <div>Tried hosted-engine --reinitialize-<wbr>lockspace the command just silently executes and I'm not sure if it doing something at all.<br>I also tried to clean the metadata. On one host it went correct, on second host it always failing with following messages: <br><br><div>INFO:ovirt_hosted_engine_ha.<wbr>agent.hosted_engine.<wbr>HostedEngine:VDSM domain monitor status: PENDING</div><div>INFO:ovirt_hosted_engine_ha.<wbr>agent.hosted_engine.<wbr>HostedEngine:VDSM domain monitor status: PENDING</div><div>INFO:ovirt_hosted_engine_ha.<wbr>agent.hosted_engine.<wbr>HostedEngine:VDSM domain monitor status: PENDING</div><div>INFO:ovirt_hosted_engine_ha.<wbr>agent.hosted_engine.<wbr>HostedEngine:VDSM domain monitor status: PENDING</div><div>ERROR:ovirt_hosted_engine_ha.<wbr>agent.hosted_engine.<wbr>HostedEngine:Failed to start monitoring domain (sd_uuid=4a7f8717-9bb0-4d80-<wbr>8016-498fa4b88162, host_id=2): timeout during domain acquisition</div><div>ERROR:ovirt_hosted_engine_ha.<wbr>agent.agent.Agent:Traceback (most recent call last):</div><div> File "/usr/lib/python2.7/site-<wbr>packages/ovirt_hosted_engine_<wbr>ha/agent/agent.py", line 191, in _run_agent</div><div> return action(he)</div><div> File "/usr/lib/python2.7/site-<wbr>packages/ovirt_hosted_engine_<wbr>ha/agent/agent.py", line 67, in action_clean</div><div> return he.clean(options.force_<wbr>cleanup)</div><div> File "/usr/lib/python2.7/site-<wbr>packages/ovirt_hosted_engine_<wbr>ha/agent/hosted_engine.py", line 345, in clean</div><div> self._initialize_domain_<wbr>monitor()</div><div> File "/usr/lib/python2.7/site-<wbr>packages/ovirt_hosted_engine_<wbr>ha/agent/hosted_engine.py", line 829, in _initialize_domain_monitor</div><div> raise Exception(msg)</div><div>Exception: Failed to start monitoring domain (sd_uuid=4a7f8717-9bb0-4d80-<wbr>8016-498fa4b88162, host_id=2): timeout during domain acquisition</div><div><br></div><div>ERROR:ovirt_hosted_engine_ha.<wbr>agent.agent.Agent:Trying to restart agent</div><div>WARNING:ovirt_hosted_engine_<wbr>ha.agent.agent.Agent:<wbr>Restarting agent, attempt '0'</div><div>ERROR:ovirt_hosted_engine_ha.<wbr>agent.agent.Agent:Too many errors occurred, giving up. Please review the log and consider filing a bug.</div><div>INFO:ovirt_hosted_engine_ha.<wbr>agent.agent.Agent:Agent shutting down</div><div><br></div><div>I'm not an expert when it comes to read the sanlock but the output looks a bit strange to me: </div><div><br>from first host (host_id=2)</div><div><br></div><div><div>[root@ovirt1 ~]# sanlock client status </div><div>daemon <a href="http://b1d7fea2-e8a9-4645-b449-97702fc3808e.ovirt1.tel" target="_blank">b1d7fea2-e8a9-4645-b449-<wbr>97702fc3808e.ovirt1.tel</a></div><div>p -1 helper</div><div>p -1 listener</div><div>p -1 status</div><div>p 3763 </div><div>p 62861 quaggaVM</div><div>p 63111 powerDNS</div><div>p 107818 pjsip_freepbx_14</div><div>p 109092 revizorro_dev</div><div>p 109589 routerVM</div><div>s hosted-engine:2:/var/run/vdsm/<wbr>storage/4a7f8717-9bb0-4d80-<wbr>8016-498fa4b88162/093faa75-<wbr>5e33-4559-84fa-1f1f8d48153b/<wbr>911c7637-b49d-463e-b186-<wbr>23b404e50769:0</div><div>s a40cc3a9-54d6-40fd-acee-<wbr>525ef29c8ce3:2:/rhev/data-<wbr>center/mnt/glusterSD/<a href="http://ovirt2.telia.ru" target="_blank">ovirt2.<wbr>telia.ru</a>\:_data/a40cc3a9-54d6-<wbr>40fd-acee-525ef29c8ce3/dom_md/<wbr>ids:0</div><div>s 4a7f8717-9bb0-4d80-8016-<wbr>498fa4b88162:1:/rhev/data-<wbr>center/mnt/glusterSD/<a href="http://ovirt2.telia.ru" target="_blank">ovirt2.<wbr>telia.ru</a>\:_engine/4a7f8717-<wbr>9bb0-4d80-8016-498fa4b88162/<wbr>dom_md/ids:0</div><div>r a40cc3a9-54d6-40fd-acee-<wbr>525ef29c8ce3:SDM:/rhev/data-<wbr>center/mnt/glusterSD/<a href="http://ovirt2.telia.ru" target="_blank">ovirt2.<wbr>telia.ru</a>\:_data/a40cc3a9-54d6-<wbr>40fd-acee-525ef29c8ce3/dom_md/<wbr>leases:1048576:49 p 3763</div></div><div><br></div><div> </div><div>from second host (host_id=1)</div><div><br></div><div><div>[root@ovirt2 ~]# sanlock client status </div><div>daemon <a href="http://9263e081-e5ea-416b-866a-0a73fe32fe16.ovirt2.tel" target="_blank">9263e081-e5ea-416b-866a-<wbr>0a73fe32fe16.ovirt2.tel</a></div><div>p -1 helper</div><div>p -1 listener</div><div>p 150440 CentOS-Desk</div><div>p 151061 centos-dev-box</div><div>p 151288 revizorro_nfq</div><div>p 151954 gitlabVM</div><div>p -1 status</div><div>s hosted-engine:1:/var/run/vdsm/<wbr>storage/4a7f8717-9bb0-4d80-<wbr>8016-498fa4b88162/093faa75-<wbr>5e33-4559-84fa-1f1f8d48153b/<wbr>911c7637-b49d-463e-b186-<wbr>23b404e50769:0</div><div>s a40cc3a9-54d6-40fd-acee-<wbr>525ef29c8ce3:1:/rhev/data-<wbr>center/mnt/glusterSD/<a href="http://ovirt2.telia.ru" target="_blank">ovirt2.<wbr>telia.ru</a>\:_data/a40cc3a9-54d6-<wbr>40fd-acee-525ef29c8ce3/dom_md/<wbr>ids:0</div><div>s 4a7f8717-9bb0-4d80-8016-<wbr>498fa4b88162:1:/rhev/data-<wbr>center/mnt/glusterSD/<a href="http://ovirt2.telia.ru" target="_blank">ovirt2.<wbr>telia.ru</a>\:_engine/4a7f8717-<wbr>9bb0-4d80-8016-498fa4b88162/<wbr>dom_md/ids:0 ADD<br><br>Not sure if there is a problem with locspace 4a7f8717-9bb0-4d80-8016-<wbr>498fa4b88162, but both hosts showing 1 as a host_id here. Is this correct? Should't they have different Id's here?<br><br>Once ha-agent's has been started hosted-engine --vm-status showing 'unknow-stale-data' for the second host. And HE just doesn't start on second host at all. <br>Host redeployment haven't helped as well. </div></div><div><br></div><div>Any advises on this?<br>Regards,</div><div>Artem</div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 19, 2018 at 9:32 PM, Artem Tambovskiy <span dir="ltr"><<a href="mailto:artem.tambovskiy@gmail.com" target="_blank">artem.tambovskiy@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><div dir="ltr"><div class="gmail_quote"><div dir="ltr">Thanks Martin.<div><br></div><div>As you suggested I updated hosted-engine.conf with correct host_id values and restarted ovirt-ha-agent services on both hosts and now I run into the problem with status "unknown-stale-data" :(<br>And second host still doesn't looks as capable to run HE.</div><div><br></div><div>Should I stop HE VM, bring down ovirt-ha-agents and reinitialize-lockspace and start ovirt-ha-agents again?</div><div><br></div><div>Regards,</div><div>Artem<br><br><br></div></div><div class="m_-533449850839018416HOEnZb"><div class="m_-533449850839018416h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 19, 2018 at 6:45 PM, Martin Sivak <span dir="ltr"><<a href="mailto:msivak@redhat.com" target="_blank">msivak@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Artem,<br>
<br>
just a restart of ovirt-ha-agent services should be enough.<br>
<br>
Best regards<br>
<br>
Martin Sivak<br>
<br>
On Mon, Feb 19, 2018 at 4:40 PM, Artem Tambovskiy<br>
<div class="m_-533449850839018416m_-4283445358333830619HOEnZb"><div class="m_-533449850839018416m_-4283445358333830619h5"><<a href="mailto:artem.tambovskiy@gmail.com" target="_blank">artem.tambovskiy@gmail.com</a>> wrote:<br>
> Ok, understood.<br>
> Once I set correct host_id on both hosts how to take changes in force? With<br>
> minimal downtime? Or i need reboot both hosts anyway?<br>
><br>
> Regards,<br>
> Artem<br>
><br>
> 19 февр. 2018 г. 18:18 пользователь "Simone Tiraboschi"<br>
> <<a href="mailto:stirabos@redhat.com" target="_blank">stirabos@redhat.com</a>> написал:<br>
><br>
>><br>
>><br>
>> On Mon, Feb 19, 2018 at 4:12 PM, Artem Tambovskiy<br>
>> <<a href="mailto:artem.tambovskiy@gmail.com" target="_blank">artem.tambovskiy@gmail.com</a>> wrote:<br>
>>><br>
>>><br>
>>> Thanks a lot, Simone!<br>
>>><br>
>>> This is clearly shows a problem:<br>
>>><br>
>>> [root@ov-eng ovirt-engine]# sudo -u postgres psql -d engine -c 'select<br>
>>> vds_name, vds_spm_id from vds'<br>
>>> vds_name | vds_spm_id<br>
>>> -----------------+------------<br>
>>> ovirt1.local | 2<br>
>>> ovirt2.local | 1<br>
>>> (2 rows)<br>
>>><br>
>>> While hosted-engine.conf on ovirt1.local have host_id=1, and ovirt2.local<br>
>>> host_id=2. So totally opposite values.<br>
>>> So how to get this fixed in the simple way? Update the engine DB?<br>
>><br>
>><br>
>> I'd suggest to manually fix /etc/ovirt-hosted-engine/hoste<wbr>d-engine.conf on<br>
>> both the hosts<br>
>><br>
>>><br>
>>><br>
>>> Regards,<br>
>>> Artem<br>
>>><br>
>>> On Mon, Feb 19, 2018 at 5:37 PM, Simone Tiraboschi <<a href="mailto:stirabos@redhat.com" target="_blank">stirabos@redhat.com</a>><br>
>>> wrote:<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On Mon, Feb 19, 2018 at 12:13 PM, Artem Tambovskiy<br>
>>>> <<a href="mailto:artem.tambovskiy@gmail.com" target="_blank">artem.tambovskiy@gmail.com</a>> wrote:<br>
>>>>><br>
>>>>> Hello,<br>
>>>>><br>
>>>>> Last weekend my cluster suffered form a massive power outage due to<br>
>>>>> human mistake.<br>
>>>>> I'm using SHE setup with Gluster, I managed to bring the cluster up<br>
>>>>> quickly, but once again I have a problem with duplicated host_id<br>
>>>>> (<a href="https://bugzilla.redhat.com/show_bug.cgi?id=1543988" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/s<wbr>how_bug.cgi?id=1543988</a>) on second host and due<br>
>>>>> to this second host is not capable to run HE.<br>
>>>>><br>
>>>>> I manually updated file hosted_engine.conf with correct host_id and<br>
>>>>> restarted agent & broker - no effect. Than I rebooted the host itself -<br>
>>>>> still no changes. How to fix this issue?<br>
>>>><br>
>>>><br>
>>>> I'd suggest to run this command on the engine VM:<br>
>>>> sudo -u postgres scl enable rh-postgresql95 -- psql -d engine -c<br>
>>>> 'select vds_name, vds_spm_id from vds'<br>
>>>> (just sudo -u postgres psql -d engine -c 'select vds_name, vds_spm_id<br>
>>>> from vds' if still on 4.1) and check<br>
>>>> /etc/ovirt-hosted-engine/hoste<wbr>d-engine.conf on all the involved host.<br>
>>>> Maybe you can also have a leftover configuration file on undeployed<br>
>>>> host.<br>
>>>><br>
>>>> When you find a conflict you should manually bring down sanlock<br>
>>>> In doubt a reboot of both the hosts will solve for sure.<br>
>>>><br>
>>>><br>
>>>>><br>
>>>>><br>
>>>>> Regards,<br>
>>>>> Artem<br>
>>>>><br>
>>>>> ______________________________<wbr>_________________<br>
>>>>> Users mailing list<br>
>>>>> <a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>
>>>>> <a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>
>>>>><br>
>>>><br>
>>><br>
>>><br>
>>><br>
>>> ______________________________<wbr>_________________<br>
>>> Users mailing list<br>
>>> <a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>
>>> <a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>
>>><br>
>><br>
><br>
> ______________________________<wbr>_________________<br>
> Users mailing list<br>
> <a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>
> <a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>
><br>
</div></div></blockquote></div><br></div>
</div></div></div><br></div>
</div></div></blockquote></div><br></div>