Hosted-Engine constantly dies

older
oVirt 4.3.2 Importing VMs from a...

Strahil Nikolov

31 Mar 2019 31 Mar '19

9:47 p.m.

Hi Guys, As I'm still quite new in oVirt - I have some problems finding the problem on this one.My Hosted Engine (4.3.2) is constantly dieing (even when the Global Maintenance is enabled).My interpretation of the logs indicates some lease problem , but I don't get the whole picture ,yet. I'm attaching the output of 'journalctl -f | grep -Ev "Started Session|session opened|session closed"' after I have tried to power on the hosted engine (hosted-engine --vm-start). The nodes are fully updated and I don't see anything in the gluster v5.5 logs, but I can double check. Any hints are appreciated and thanks in advance. Best Regards,Strahil Nikolov

Attachments:

attachment.html (text/html — 1.1 KB)
hosted-engine-crash.obj (application/octet-stream — 46.3 KB)

Show replies by date

Simone Tiraboschi

1 Apr 1 Apr

7:16 a.m.

On Sun, Mar 31, 2019 at 11:49 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Guys,

As I'm still quite new in oVirt - I have some problems finding the problem on this one. My Hosted Engine (4.3.2) is constantly dieing (even when the Global Maintenance is enabled). My interpretation of the logs indicates some lease problem , but I don't get the whole picture ,yet.

I'm attaching the output of 'journalctl -f | grep -Ev "Started Session|session opened|session closed"' after I have tried to power on the hosted engine (hosted-engine --vm-start).

The nodes are fully updated and I don't see anything in the gluster v5.5 logs, but I can double check.

Can you please attach gluster logs fro the same time frame?

...

Any hints are appreciated and thanks in advance.

Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TRQL5EOCRLELX4...

Strahil Nikolov

12:31 p.m.

Hi Simone, I am attaching the gluster logs from ovirt1.I hope you see something I missed. Best Regards,Strahil Nikolov

Simone Tiraboschi

1 p.m.

On Mon, Apr 1, 2019 at 2:31 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Simone,

I am attaching the gluster logs from ovirt1. I hope you see something I missed.

Sorry, it looks empty.

...

Best Regards, Strahil Nikolov

Strahil Nikolov

1:42 p.m.

Hi Simone,

...

Sorry, it looks empty. Sadly it's true. This one should be OK.

Best Regards,Strahil Nikolov

Simone Tiraboschi

6:57 p.m.

Can you please add also /var/log/sanlock.log ? On Mon, Apr 1, 2019 at 3:42 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Simone,

...
Sorry, it looks empty.

Sadly it's true. This one should be OK.

Best Regards, Strahil Nikolov

Strahil Nikolov

2 Apr 2 Apr

11:36 a.m.

Hi Simone, I'm attaching the sanlock.log file from ovirt1.Would you be able to share any thoughts ,as I feel hard to decode that one ?I would be happy to resolve such issues on my own - if that is possible. P.S.: Ignore the errors from this morning - I was trying to reinitialize dom_md/ids , but in the end i had to restore it from the backup. Best Regards,Strahil Nikolov В понеделник, 1 април 2019 г., 21:58:04 ч. Гринуич+3, Simone Tiraboschi <stirabos@redhat.com> написа: Can you please add also /var/log/sanlock.log ? On Mon, Apr 1, 2019 at 3:42 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote: Hi Simone,

...

Sorry, it looks empty. Sadly it's true. This one should be OK.

Best Regards,Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UOHGTMV7BBEAQB...

Strahil Nikolov

5 Apr 5 Apr

8:48 a.m.

Hi Simone, a short mail chain in gluster-users Amar confirmed my suspicion that Gluster v5.5 is performing a little bit slower than 3.12.15 .In result the sanlock reservations take too much time. I have updated my setup and uncached (used lvm caching in writeback mode) my data bricks and used the SSD for the engine volume.Now the engine is running quite well and no more issues were observed. Can you share any thoughts about oVirt being updated to Gluster v6.x ? I know that there are any hooks between vdsm and gluster and I'm not sure how vdsm will react on the new version. Best Regards,Strahil Nikolov

Simone Tiraboschi

9:03 a.m.

On Fri, Apr 5, 2019 at 10:48 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Simone,

a short mail chain in gluster-users Amar confirmed my suspicion that Gluster v5.5 is performing a little bit slower than 3.12.15 . In result the sanlock reservations take too much time.

Thanks for the report!

...

I have updated my setup and uncached (used lvm caching in writeback mode) my data bricks and used the SSD for the engine volume.Now the engine is running quite well and no more issues were observed.

This definitively helps, but for my experience the network speed is really determinant here. Can you describe your network configuration? A 10 Gbps net is definitively fine here. A few bonded 1 Gbps nics could work. A single 1 Gbps nic could be an issue.

...

Can you share any thoughts about oVirt being updated to Gluster v6.x ? I know that there are any hooks between vdsm and gluster and I'm not sure how vdsm will react on the new version.

It's definitively planned, see: https://bugzilla.redhat.com/1693998 <https://bugzilla.redhat.com/show_bug.cgi?id=1693998> I'm not really sure about its time plan.

...

Best Regards, Strahil Nikolov

Strahil Nikolov

12:17 p.m.

...

This definitively helps, but for my experience the network speed is really determinant here.Can you describe your network >configuration? A 10 Gbps net is definitively fine here. A few bonded 1 Gbps nics could work. A single 1 Gbps nic could be an issue.

I have a gigabit interface on my workstations and sadly I have no option for upgrade without switching the hardware. I have observed my network traffic for days with iftop and gtop and I have never reached my Gbit interface's maximum bandwidth, not even the half of it. Even when reseting my bricks (gluster volume reset-brick) and running a full heal - I do not observe more than 50GiB/s utilization. I am not sure if FUSE is using network for accessing the local brick - but I hope that it is not true. Checking disk performance - everything is in the expected ranges. I suspect that the Gluster v5 enhancements are increasing both network and IOPS requirements and my setup was not dealing with it properly.

...

It's definitively planned, see: https://bugzilla.redhat.com/1693998>I'm not really sure about its time plan. I will try to get involved and provide feedback both to oVirt and Gluster dev teams. Best Regards,Strahil Nikolov

Simone Tiraboschi

1:02 p.m.

On Fri, Apr 5, 2019 at 2:18 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

...
This definitively helps, but for my experience the network speed is really determinant here.Can you describe your network >configuration? A 10 Gbps net is definitively fine here. A few bonded 1 Gbps nics could work. A single 1 Gbps nic could be an issue.

I have a gigabit interface on my workstations and sadly I have no option for upgrade without switching the hardware. I have observed my network traffic for days with iftop and gtop and I have never reached my Gbit interface's maximum bandwidth, not even the half of it.

Even when reseting my bricks (gluster volume reset-brick) and running a full heal - I do not observe more than 50GiB/s utilization. I am not sure if FUSE is using network for accessing the local brick - but I hope that it is not true.

GlusterFS is a scalable *network* filesystem: the network is always there. You can use caching technique to read from the local peer first but sooner or later you will have to compare it with data from other peers or sync the data to other peers on writes. According to gluster administration guide: https://docs.gluster.org/en/latest/Administrator%20Guide/Network%20Configura... in the "when to bond" section we can read: network throughput limit of client/server \<\< storage throughput limit 1 GbE (almost always) 10-Gbps links or faster -- for writes, replication doubles the load on the network and replicas are usually on different peers to which the client can transmit in parallel. So if you are using oVirt hyper-converged in replica 3 you have to transmit everything two times over the storage network to sync it with other peers. I'm not really in that details, but if https://bugzilla.redhat.com/1673058 is really like it's described, we even have an 5x overhead with current gluster 5.x. This means that with a 1000 Mbps nic we cannot expect more than: 1000 Mbps / 2 (other replicas) / 5 (overhead in Gluster 5.x ???) / 8 (bit per bytes) = 12.5 MByte per seconds and this is definitively enough to have sanlock failing especially because we don't have just the sanlock load as you can imagine. I'd strongly advice to move to 10 Gigabit Ethernet (nowadays with a few hundred dollars you can buy a 4/5 ports 10GBASE-T copper switch plus 3 nics and the cables just for the gluster network) or to bond a few 1 Gigabit Ethernet links.

...

Checking disk performance - everything is in the expected ranges.

I suspect that the Gluster v5 enhancements are increasing both network and IOPS requirements and my setup was not dealing with it properly.

...
It's definitively planned, see: https://bugzilla.redhat.com/1693998 <https://bugzilla.redhat.com/show_bug.cgi?id=1693998> I'm not really sure about its time plan.

I will try to get involved and provide feedback both to oVirt and Gluster dev teams.

Best Regards, Strahil Nikolov

2437

Age (days ago)

2442

Last active (days ago)

List overview

Download

10 comments

2 participants

participants (2)

Simone Tiraboschi
Strahil Nikolov

Hosted-Engine constantly dies

tags

participants (2)