[Help] The engine is not stable on HostedEngine with GlusterFS Herperverged deployment

Hi, I deployed a oVirt (4.3.10) cluster with HostedEngine and GlusterFS volumes (engine, vmstore, data), the glusterfs cluster on node1/node2/node3, and the engine vm can be running on those 3 nodes. Then I added a 4th nodes into cluster. But, when I operates on Eninge Web Portal, it's always reports 503 error, then I checked `hsoted-engine --vm-status`, see below: ``` [root@vhost1 ~]# hosted-engine –vm-status –== Host vhost1.yhmk.lan (id: 1) status ==– conf_on_shared_storage : True Status up-to-date : True Hostname : vhost1.<span style=”background-color: rgb(255, 255, 255); color: rgb(51, 51, 51);”>alatest</span>.lan Host ID : 1 Engine status : {“reason”: “bad vm status”, “health”: “bad”, “vm”: “down_unexpected”, “detail”: “Down”} Score : 0 stopped : False Local maintenance : False crc32 : 1f25baff local_conf_timestamp : 1253650 Host timestamp : 1253649 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1253649 (Thu Apr 8 08:05:48 2021) host-id=1 score=0 vm_conf_refresh_time=1253650 (Thu Apr 8 08:05:48 2021) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Thu Jan 15 20:23:29 1970 –== Host vhost2.yhmk.lan (id: 2) status ==– conf_on_shared_storage : True Status up-to-date : True Hostname : vhost2.<span style=”background-color: rgb(255, 255, 255); color: rgb(51, 51, 51);”>alatest</span>.lan Host ID : 2 Engine status : {“reason”: “vm not running on this host”, “health”: “bad”, “vm”: “down_unexpected”, “detail”: “unknown”} Score : 3400 stopped : False Local maintenance : False crc32 : 539fc30c local_conf_timestamp : 1253343 Host timestamp : 1253343 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1253343 (Thu Apr 8 08:05:46 2021) host-id=2 score=3400 vm_conf_refresh_time=1253343 (Thu Apr 8 08:05:46 2021) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False –== Host vhost3.yhmk.lan (id: 3) status ==– conf_on_shared_storage : True Status up-to-date : True Hostname : vhost3.alatest.lan Host ID : 3 Engine status : {“reason”: “bad vm status”, “health”: “bad”, “vm”: “up”, “detail”: “Powering up”} Score : 3400 stopped : False Local maintenance : False crc32 : 4072e0b8 local_conf_timestamp : 1252345 Host timestamp : 1252345 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1252345 (Thu Apr 8 08:05:42 2021) host-id=3 score=3400 vm_conf_refresh_time=1252345 (Thu Apr 8 08:05:42 2021) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False ``` Then, wait a moment, can access web portal again, and check the hosts status, alway reports one or more hosts with label `unavaiable as HA score`, but it will dispear later. And I found, sometimes the engine vm will migrated to another nodes whill this problem occur. So, seems the HostedEngine is not stable, always occur this problem, could you please help me with this ? Thanks!

On Thu, Apr 8, 2021 at 8:26 AM <mengz.you@outlook.com> wrote:
Hi,
I deployed a oVirt (4.3.10) cluster with HostedEngine and GlusterFS volumes (engine, vmstore, data), the glusterfs cluster on node1/node2/node3, and the engine vm can be running on those 3 nodes. Then I added a 4th nodes into cluster.
How exactly did you add it? What is "cluster"? - the cluster inside the engine's DB (what you see in the admin UI) - gluster cluster - hosted-engine cluster (to do that, you should choose that option when you add the host)
But, when I operates on Eninge Web Portal, it's always reports 503 error,
What do you mean in "always"? You can't use the web portal anymore at all?
then I checked `hsoted-engine --vm-status`, see below:
``` [root@vhost1 ~]# hosted-engine –vm-status
–== Host vhost1.yhmk.lan (id: 1) status ==–
conf_on_shared_storage : True Status up-to-date : True Hostname : vhost1.<span style=”background-color: rgb(255, 255, 255); color: rgb(51, 51, 51);”>alatest</span>.lan Host ID : 1 Engine status : {“reason”: “bad vm status”, “health”: “bad”, “vm”: “down_unexpected”, “detail”: “Down”} Score : 0 stopped : False Local maintenance : False crc32 : 1f25baff local_conf_timestamp : 1253650 Host timestamp : 1253649 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1253649 (Thu Apr 8 08:05:48 2021) host-id=1 score=0 vm_conf_refresh_time=1253650 (Thu Apr 8 08:05:48 2021) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Thu Jan 15 20:23:29 1970
–== Host vhost2.yhmk.lan (id: 2) status ==–
conf_on_shared_storage : True Status up-to-date : True Hostname : vhost2.<span style=”background-color: rgb(255, 255, 255); color: rgb(51, 51, 51);”>alatest</span>.lan Host ID : 2 Engine status : {“reason”: “vm not running on this host”, “health”: “bad”, “vm”: “down_unexpected”, “detail”: “unknown”} Score : 3400 stopped : False Local maintenance : False crc32 : 539fc30c local_conf_timestamp : 1253343 Host timestamp : 1253343 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1253343 (Thu Apr 8 08:05:46 2021) host-id=2 score=3400 vm_conf_refresh_time=1253343 (Thu Apr 8 08:05:46 2021) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False
–== Host vhost3.yhmk.lan (id: 3) status ==–
conf_on_shared_storage : True Status up-to-date : True Hostname : vhost3.alatest.lan Host ID : 3 Engine status : {“reason”: “bad vm status”, “health”: “bad”, “vm”: “up”, “detail”: “Powering up”} Score : 3400 stopped : False Local maintenance : False crc32 : 4072e0b8 local_conf_timestamp : 1252345 Host timestamp : 1252345 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1252345 (Thu Apr 8 08:05:42 2021) host-id=3 score=3400 vm_conf_refresh_time=1252345 (Thu Apr 8 08:05:42 2021) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False ```
Then, wait a moment, can access web portal again, and check the hosts status, alway reports one or more hosts with label `unavaiable as HA score`, but it will dispear later. And I found, sometimes the engine vm will migrated to another nodes whill this problem occur.
So, seems the HostedEngine is not stable, always occur this problem, could you please help me with this ? Thanks!
Please check/share /var/log/ovirt-hosted-engine-ha/* on all hosts. Thanks. Best regards, -- Didi

Hi, Didi The "aways" means, when I operates on Admin Portal normally, somtimes the operation reports a dialog box with 503 error. And wait some times, it reback to login page, then I can login to do some operations again. I do not add the 4th nodes to glusterfs, just add it to oVirt cluster as computer host.

On Thu, Apr 8, 2021 at 2:40 PM <mengz.you@outlook.com> wrote:
Hi, Didi
The "aways" means, when I operates on Admin Portal normally, somtimes the operation reports a dialog box with 503 error. And wait some times, it reback to login page, then I can login to do some operations again.
This indeed sounds like the engine or engine vm are/were sometimes rebooted/restarted, which might be due to misconfiguration somewhere.
I do not add the 4th nodes to glusterfs, just add it to oVirt cluster as computer host.
OK, and did you choose to add it to the hosted-engine cluster? Did you check the -ha logs? Best regards, -- Didi

Hi, Sorry, what's your meanings for "choose to add it to the hosted-engine cluster"? I just add the 4th without choose to hosted-engine run (means the hosted-egnine vm will not mirgrated the 4th host, just in 1, 2, 3 hosts". Sorry, where the -ha logs? on hosts or engine vm? And, whether this related to DNS server? seems lots of queries to DNS server

On Thu, Apr 8, 2021 at 6:02 PM <mengz.you@outlook.com> wrote:
Hi,
Sorry, what's your meanings for "choose to add it to the hosted-engine cluster"?
Compute -> Hosts -> New -> "Hosted Engine" -> set "Choose hosted engine deployment action" to "Deploy"
I just add the 4th without choose to hosted-engine run (means the hosted-egnine vm will not mirgrated the 4th host, just in 1, 2, 3 hosts".
If indeed so, it means it's not part of your hosted-engine cluster, and if so, it means that your problems are likely not related to it.
Sorry, where the -ha logs? on hosts or engine vm?
Hosts
And, whether this related to DNS server? seems lots of queries to DNS server
Might be, yes. Perhaps you misconfigured something? Set the wrong name in the engine, or hosts, or DNS? Best regards, -- Didi

Hi, Didi Another question about glusterfs, I only used host1, host2 and host3 as gluster cluster, see: ``` [root@vhost2 ~]# gluster pool list UUID Hostname State b30d556d-73f9-4d63-8bc6-8ee541684b9c gs3.yhmk.sg Connected 2eb2f9fe-0237-4825-bf89-facc9d99534a gs1.yhmk.sg Connected c6e85a13-5186-4433-a3c8-0284f826361d localhost Connected ``` But, why I run `gluster pool list` on host4, the result is: ``` [root@vhost4 ~]# gluster pool list UUID Hostname State c6e85a13-5186-4433-a3c8-0284f826361d gs2.yhmk.sg Disconnected 2eb2f9fe-0237-4825-bf89-facc9d99534a gs1.yhmk.sg Connected b30d556d-73f9-4d63-8bc6-8ee541684b9c gs3.yhmk.sg Connected 4259036b-f4db-4d89-bb88-4dcf3c5f882c localhost Connected [root@vhost4 ~]# gluster volume list data engine vmstore ``` It's strange, why the host4 in the cluster, and can check the volume info?

It's an effect that also had me puzzled for a long time: To my understanding gluster volume command should only ever show peers that contribute bricks to a volume, not peers in general. Now perhaps an exception needs to be made for hosts that have been enabled to run the management engine, as that might require Gluster insights. But I have also seen that with hosts that I only added temporarily to oVirt as a compute node and I believe I have even seen really ill effects: Given G1+G2+G3 holding 3R or 2R+1A bricks for the typical oVirt volumes engine/data/vmstore, I had been adding C1+C2+C3 as mere hosts. When I then rebooted G3, while C1+C2 were also down, all of a sudden G1+G2 would shut down their bricks (and refuse to bring them back up) for lack of quota... I had to bring C1+C2 back up to regain quota and then delete C1+C2 from oVirt to take them out of the quorum for a volume to which they contributed no bricks. And then often enough, I had to actually detach them as peers via the gluster CLI, because the GUI didn't finish the job. Now of course, that only works when G1+G2+G3 are actually all up, too, because otherwise peers can't be detached.... I've just posted a query on this issue yesterday: Hopefully someone from the development team will shed some insights into the logic, so we can test better and potentially open an issue to fix.

Hi, Didi And I checked `/var/log/ovirt-engine/engine.log`, it reports lots of WARN about [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn], see below ``` 2021-04-12 03:15:58,840+08 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [41356b43] EVENT_ID: GLUSTER_SERVER_STATUS_DISCONNECTED(4,163), Gluster server vhost4 set to DISCONNECTED on cluster RQ940Cluster. 2021-04-12 03:15:58,841+08 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler7) [41356b43] START, GlusterVolumesListVDSCommand(HostName = vhost2, GlusterVolumesListVDSParameters:{hostId='044dc235-e067-4d39-9c4f-e32b1be6f2df'}), log id: 57eb6144 2021-04-12 03:15:59,120+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs1.yhmk.sg:/gluster_bricks/engine/engine' of volume '5b9caa26-9e8e-42a5-b39f-99a623eb9f1c' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,125+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs2.yhmk.sg:/gluster_bricks/engine/engine' of volume '5b9caa26-9e8e-42a5-b39f-99a623eb9f1c' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,128+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs3.yhmk.sg:/gluster_bricks/engine/engine' of volume '5b9caa26-9e8e-42a5-b39f-99a623eb9f1c' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,131+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs1.yhmk.sg:/gluster_bricks/data/data' of volume '28f02a31-3a37-4c53-bd10-b85339860984' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,134+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs2.yhmk.sg:/gluster_bricks/data/data' of volume '28f02a31-3a37-4c53-bd10-b85339860984' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,138+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs3.yhmk.sg:/gluster_bricks/data/data' of volume '28f02a31-3a37-4c53-bd10-b85339860984' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,141+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs1.yhmk.sg:/gluster_bricks/vmstore/vmstore' of volume '3b2d9e50-1645-4471-aa79-5778d5cc5d04' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,144+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs2.yhmk.sg:/gluster_bricks/vmstore/vmstore' of volume '3b2d9e50-1645-4471-aa79-5778d5cc5d04' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' 2021-04-12 03:15:59,147+08 WARN [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListReturn] (DefaultQuartzScheduler7) [41356b43] Could not associate brick 'gs3.yhmk.sg:/gluster_bricks/vmstore/vmstore' of volume '3b2d9e50-1645-4471-aa79-5778d5cc5d04' with correct network as no gluster network found in cluster 'ee5c7870-8ca2-11eb-b58f-00163e2d1b80' ``` What's those logs means? And from admin web portal, I see all bricks are up sttatus.
participants (3)
-
mengz.you@outlook.com
-
Thomas Hoberg
-
Yedidyah Bar David