On Tue, Sep 15, 2020 at 5:40 PM Rapsilber, Marcus
<Marcus.Rapsilber(a)isotravel.com> wrote:
This is the result of "hosted-engine --vm-status" on the first node, which
currently runs the hosted-engine:
--== Host ipc1.dc (id: 1) status ==--
Host ID : 1
Host timestamp : 89980
Score : 3400
Engine status : {"vm": "up", "health":
"good", "detail": "Up"}
Hostname : ipc1.dc
Local maintenance : False
stopped : False
crc32 : 256cb440
conf_on_shared_storage : True
local_conf_timestamp : 89980
Status up-to-date : True
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=89980 (Tue Sep 15 16:17:00 2020)
I assume this is more-or-less the time when you ran this command (this
is updated routinely, I don't remember how often).
host-id=1
score=3400
vm_conf_refresh_time=89980 (Tue Sep 15 16:17:00 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
--== Host ipc3.dc (id: 2) status ==--
Host ID : 2
Host timestamp : 65213
Score : 3400
Engine status : unknown stale-data
Hostname : ipc3.dc
Local maintenance : False
stopped : False
crc32 : c4f62c8b
conf_on_shared_storage : True
local_conf_timestamp : 65213
Status up-to-date : False
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=65213 (Wed Sep 9 11:01:18 2020)
So this is 6 days old. Is this from before you started the
reinstallation plan, or in the middle of it? Can you check/remember?
host-id=2
score=3400
vm_conf_refresh_time=65213 (Wed Sep 9 11:01:18 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
--== Host ipc2.dc (id: 3) status ==--
Host ID : 3
Host timestamp : 93167
Score : 3400
Engine status : {"vm": "down",
"health": "bad", "detail": "unknown",
"reason": "vm not running on this host"}
Hostname : ipc2.dc
Local maintenance : False
stopped : False
crc32 : f02f19b0
conf_on_shared_storage : True
local_conf_timestamp : 93167
Status up-to-date : True
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=93167 (Tue Sep 15 16:16:58 2020)
host-id=3
score=3400
vm_conf_refresh_time=93167 (Tue Sep 15 16:16:58 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
For the new added node it is:
"The hosted engine configuration has not been retrieved from shared storage. Please
ensure that ovirt-ha-agent is running and the storage server is reachable. "
I also asked to check/share the logs. Did you find there anything?
Can it mount the shared storage?
If not, you should check this manually and first troubleshoot mount
issues. Without this, reinstalling everything won't help you.
I don't know anything about gluster. Is gluster status on this host
ok? Is its status as seen by the other hosts ok?
But the mentioned service status seem to be ok, too. But Actually I've noticed it
restarting from time to time.
This is normal - on certain conditions, both of these services restart
themselves upon severe errors, just to be on the safe side.
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability
Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2020-09-15 10:13:11 EDT; 2min 11s ago
Main PID: 23971 (ovirt-ha-broker)
Tasks: 11 (limit: 100744)
Memory: 29.3M
CGroup: /system.slice/ovirt-ha-broker.service
└─23971 /usr/libexec/platform-python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
Sep 15 10:13:11 ipc3.dc systemd[1]: Started oVirt Hosted Engine High Availability
Communications Broker.
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2020-09-15 10:13:22 EDT; 2min 1s ago
Main PID: 24165 (ovirt-ha-agent)
Tasks: 2 (limit: 100744)
Memory: 27.2M
CGroup: /system.slice/ovirt-ha-agent.service
└─24165 /usr/libexec/platform-python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
Sometimes it says:
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor
preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Tue 2020-09-15 10:23:15
EDT; 4s ago
Process: 28372 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited,
status=157)
Main PID: 28372 (code=exited, status=157)
And sometimes it's:
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2020-09-15 10:23:14 EDT; 5min ago
Main PID: 28370 (ovirt-ha-broker)
Tasks: 11 (limit: 100744)
Memory: 29.7M
CGroup: /system.slice/ovirt-ha-broker.service
└─28370 /usr/libexec/platform-python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
Sep 15 10:23:14 ipc3.dc systemd[1]: Started oVirt Hosted Engine High Availability
Communications Broker.
Sep 15 10:27:31 ipc3.dc ovirt-ha-broker[28370]: ovirt-ha-broker
ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start
monitoring domain (sd_uuid=e83f0c32-bb91-4909-8e80-6fa974b61968, >Sep 15 10:27:31
ipc3.dc ovirt-ha-broker[28370]: ovirt-ha-broker
ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC
call: Failed to start monitoring domain (sd_uuid=e83f0c32-bb>Sep 15 10:28:02 ipc3.dc
ovirt-ha-broker[28370]: ovirt-ha-broker
ovirt_hosted_engine_ha.broker.notifications.Notifications ERROR [Errno 111] Connection
refused
Traceback (most recent call last):
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/notifications.py",
line 29, in send_email
timeout=float(cfg["smtp-timeout"]))
File
"/usr/lib64/python3.6/smtplib.py", line 251, in __init__
(code, msg) = self.connect(host,
port)
File
"/usr/lib64/python3.6/smtplib.py", line 336, in connect
self.sock = self._get_socket(host,
port, self.timeout)
File
"/usr/lib64/python3.6/smtplib.py", line 307, in _get_socket
self.source_address)
File
"/usr/lib64/python3.6/socket.py", line 724, in create_connection
raise err
File
"/usr/lib64/python3.6/socket.py", line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111]
Connection refused
This is normal and can be ignored. It's because you didn't configure a
mail server on the machine.
You should configure it to work with the credentials you provided
during deploy, if you want to get notifications.
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2020-09-15 10:23:25 EDT; 5min ago
Main PID: 28520 (ovirt-ha-agent)
Tasks: 2 (limit: 100744)
Memory: 27.8M
CGroup: /system.slice/ovirt-ha-agent.service
└─28520 /usr/libexec/platform-python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
Sep 15 10:23:25 ipc3.dc systemd[1]: Started oVirt Hosted Engine High Availability
Monitoring Agent.
Sep 15 10:28:02 ipc3.dc ovirt-ha-agent[28520]: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning
for OVF_STORE due to Command Volume.getInfo with args {'volu>
(code=100, message=Cannot inquire
Lease(name='66b004b7-504c-4376-acc1-27890b17213b',
path='/rhev/data-center/mnt/glusterSD/ipc1.dc:_engine/e83f0c32-bb91-4909-8e80->
I can't tell from this alone if it succeeded to mount (and just failed
to access OVF_STORE inside) or failed to mount.
I think at this point we've even managed to make it worse. Now we got several
different problems on all 3 nodes, like:
- HSMGetTaskStatusVDS failed
- SpmStopVDS failed
- HSMGetAllTasksStatusesVDS failed
- Sync errors
We're going to reinstall the whole cluster from scratch.
If you can, this is probably safest. I mean, if you can re-create the
data, or have reliable backups.
But I think the initial issue/scenario to replace (add) a host and
make it run the hosted-engine is still not solved at this point.
As I said, I do not know gluster. Please check this list's archive, as
well as gluster list(s), and the Internet at large, to learn how to
properly restore a failed gluster member host. I agree it's important
to be able to exercise this, and AFAIK should work well - many people
on this list use gluster successfully.
From oVirt's POV, specifically hosted-engine daemons, if the storage
is ok, things should (eventually) work as expected - you put the dead
host to maintenance, reinstall/re-add it with the hosted-engine
checkbox marked, and everything should work. If not, I suggest to
first check the logs, and --vm-status, and wait until things seem to
stabilize, before deciding to take some action.
Good luck and best regards,
Thanks and greetings
Marcus
-----Ursprüngliche Nachricht-----
Von: Yedidyah Bar David <didi(a)redhat.com>
Gesendet: Dienstag, 15. September 2020 14:04
An: Rapsilber, Marcus <Marcus.Rapsilber(a)isotravel.com>
Cc: users <users(a)ovirt.org>
Betreff: Re: [ovirt-users] Enable a cluster node to run the hosted engine
On Tue, Sep 15, 2020 at 2:40 PM Rapsilber, Marcus <Marcus.Rapsilber(a)isotravel.com>
wrote:
>
> I'm not sure if this log files tells anything about the problem why the node
"ipc3.dc" isn't capable to run the hosted engine.
> Today we tried the whole procedure again. But this time we didn't install the
new node via single node cluster setup. It was a manual setup of the cluster storage. When
we've added the host ("New Host") we made sure that "Hosted engine
deployment action" was set to "deploy". Nevertheless we're still not
able to allow the new node to run the hosted engine. The grey crown is missing.
What's the output of 'hosted-engine --vm-status' on this host, and on other
hosts (that are ok)?
>
> What is the criteria for a host to be able to run the hosted engine? Is some special
service required?
> Do we have to install another package? Or is there an Ansible script that does the
required setup?
Generally speaking, it should be fully automatic, if you mark the checkbox in "Add
host", and AFAICT, the log you attached looks ok.
Also:
- The host needs to be in the same DC/cluster, needs to have access to the shared
storage, etc.
You can try to start the services manually, if they are not up:
systemctl status ovirt-ha-broker ovirt-ha-agent systemctl start ovirt-ha-broker
ovirt-ha-agent
- and/or check their logs, in /var/log/ovirt-hosted-engine-ha .
Best regards,
>
> Thanks and greetings
> Marcus
>
> -----Ursprüngliche Nachricht-----
> Von: Yedidyah Bar David <didi(a)redhat.com>
> Gesendet: Dienstag, 15. September 2020 09:33
> An: Rapsilber, Marcus <Marcus.Rapsilber(a)isotravel.com>
> Cc: users <users(a)ovirt.org>
> Betreff: Re: [ovirt-users] Enable a cluster node to run the hosted
> engine
>
> On Tue, Sep 15, 2020 at 10:10 AM Rapsilber, Marcus
<Marcus.Rapsilber(a)isotravel.com> wrote:
> >
> > Hello again,
> >
> > to answer your question, how did I make a clean install and reintegrate the
node in the cluster? Maybe my approach was a bit awkward/inconvenient, but this is what I
did:
> > - Install CentOS 8
> > - Install oVirt Repository and packages: cockpit-ovirt-dashboard,
> > vdsm-gluster, ovirt-host
> > - Remove the Gluster bricks of the old node from the
> > data/engine/vmstore volumes
> > - Process a single cluster node installation on the new node via the
> > oVirt Dashboard, in order to setup Gluster and the bricks
> > (hosted-engine setup was skipped)
> > - On the new node: Delete the vmstore/engine/data volumes and the
> > file metadata in the bricks folder
> > - Added the bricks to the volumes of the existing cluster again
> > - Added the host to the cluster
> >
> > Would you suggest a better approach to setup a new node for an existing
cluster?
>
> Sorry, I have no experience with gluster, so can't comment on your particular
steps, although they sound reasonable.
> the main missing thing is enabling hosted-engine when adding the host to the
engine.
>
> >
> > At this point I'm not sure if I just overlooked the "hosted engine
deployment action" when I've added the new host. Unfortunately I cannot try to
edit the host anymore since my colleague did another reinstall of the node.
>
> Very well.
>
> If this happens again, please tell us.
>
> Best regards,
>
> >
> > Thanks so far and greetings,
> > Marcus
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Yedidyah Bar David <didi(a)redhat.com>
> > Gesendet: Montag, 14. September 2020 10:56
> > An: Rapsilber, Marcus <Marcus.Rapsilber(a)isotravel.com>
> > Cc: users <users(a)ovirt.org>
> > Betreff: Re: [ovirt-users] Enable a cluster node to run the hosted
> > engine
> >
> > On Mon, Sep 14, 2020 at 11:18 AM <rap(a)isogmbh.de> wrote:
> > >
> > > Hi there,
> > >
> > > currently my team is evaluating oVirt and we're also testing several
fail scenarios, backup and so on.
> > > One scenario was:
> > > - hyperconverged oVirt cluster with 3 nodes
> > > - self-hosted engine
> > > - simulate the break down of one of the nodes by power off
> > > - to replace it make a clean install of a new node and reintegrate
> > > it in the cluster
> >
> > How exactly did you do that?
> >
> > >
> > > Actually everything worked out fine. The new installed node and related
bricks (vmstore, data, engine) were added to the existing Gluster storage and it was added
to the oVirt cluster (as host).
> > >
> > > But there's one remaining problem: The new host doesn't have the
grey crown, which means it's unable to run the hosted engine. How can I achieve that?
> > > I also found out that the ovirt-ha-agent and ovirt-ha-broker isn't
started/enabled on that node. Reason is that the
/etc/ovirt-hosted-engine/hosted-engine.conf doesn't exist. I guess this is not only a
problem concerning the hosted engine, but also for HA VM's.
> >
> > When you add a host to the engine, one of the options in the dialog is to
deploy it as a hosted-engine.
> > If you don't, you won't get this crown, nor these services, nor its
status in 'hosted-engine --vm-status'.
> >
> > If you didn't, perhaps try to move to maintenance and reinstall, adding
this option.
> >
> > If you did choose it, that's perhaps a bug - please check/share relevant
logs (e.g. in /var/log/ovirt-engine, including host-deploy/).
> >
> > Best regards,
> >
> > >
> > > Thank you for any advice and greetings, Marcus
> > > _______________________________________________
> > > Users mailing list -- users(a)ovirt.org To unsubscribe send an email
> > > to users-leave(a)ovirt.org Privacy
> > > Statement:
> > >
https://protection.retarus.com/v1?u=https%3A%2F%2Fwww.ovirt.org%2F
> > > pr
> > > iv
> > >
acy-policy.html&c=3ilYjgr&r=338RVlOwLz6SWhhP16s8RO&k=7s1&s=i9ZtAxZ
> > > 4H jh a7cyQljzYgZoSsOuJ5qnJkh0cU75rfgL oVirt Code of Conduct:
> > >
https://protection.retarus.com/v1?u=https%3A%2F%2Fwww.ovirt.org%2F
> > > co
> > > mm
> > >
unity%2Fabout%2Fcommunity-guidelines%2F&c=3ilYjgr&r=5VIqhyv90pUj07
> > > OG Zz 9qix&k=7s1&s=UiOTbmf9BSOB46ff91IjO7G8dkMWHzi2GOIcveqAySn
> > > List Archives:
> >
> >
> >
> > --
> > Didi
> >
>
>
> --
> Didi
>
--
Didi
--
Didi