Ovirt cluster unstable; gluster to blame (again)

older
ETL service aggregation to hourly...

Jim Kusznir

6 Jul 2018 6 Jul '18

10:19 p.m.

hi all: Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down. I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me). I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled). The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible. I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable. I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time). Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution? --Jim

Attachments:

attachment.html (text/html — 3.7 KB)

Show replies by date

Greg Sheremeta

6 Jul 6 Jul

11:17 p.m.

Hi Jim, On Fri, Jul 6, 2018 at 4:22 PM Jim Kusznir <jim@palousetech.com> wrote:

...

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found).

cc'ing some people that might be able to assist.

...

The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

-- GREG SHEREMETA SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX Red Hat NA <https://www.redhat.com/> gshereme@redhat.com IRC: gshereme <https://red.ht/sig>

Darrell Budic

11:52 p.m.

Jim- In additional to my comments on the gluster-users list (go conservative on your cluster-shd settings for all volumes), I have one ovirt specific one that can help you in the situation you’re in, at least if you’re seeing the same client side memory use issue I am on gluster 3.12.9+. Since its client side, you can (temporarily) recover the RAM by putting a node into maintenance (without stopping gluster, and ignoring pending heals if needed), then re-activate it. It will unmount the gluster volumes, restarting the glusterfsds that are hogging the RAM. Then do it to the next node, and the next. Keeps you from having to reboot a node and making your heal situation worse. You may have repeat it occasionally, but it will keep you going, and you can stagger it between nodes and/or just redistribute VMs afterward. -Darrell

...

From: Jim Kusznir <jim@palousetech.com> Subject: [ovirt-users] Ovirt cluster unstable; gluster to blame (again) Date: July 6, 2018 at 3:19:34 PM CDT To: users

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

Jamie Lawrence

11:52 p.m.

Hi Jim, I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues. I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there. I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related. If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far. -j [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...

On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

Jim Kusznir

7 Jul 7 Jul

12:13 a.m.

Thank you for the advice and help I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though. I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it. --Jim On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com> wrote:

...

Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community- guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

Jim Kusznir

2:06 a.m.

So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range. Jim On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...

Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

Johan Bernhardsson

6:13 a.m.

Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top. And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60) Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link. On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...

So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote: Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com> wrote:

Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3...

Jim Kusznir

7:45 a.m.

So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs): [root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else). [root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908 top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched] 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec] 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0] 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2] 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1] Not sure why the system load dropped other than I was trying to take a picture of it :) In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc. I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating. At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that. I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage. Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it. Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been.... --Jim On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...

Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/ community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community- guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

Jim Kusznir

7:51 a.m.

In case it matters, the data-hdd gluster volume uses these hard drives: https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1 This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that. --Jim On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...

So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

Jim Kusznir

8:06 a.m.

I think I should throw one more thing out there: The current batch of problems started essentially today, and I did apply the updates waiting in the ovirt repos (through the ovirt mgmt interface: install updates). Perhaps there is now something from that which is breaking things. On Fri, Jul 6, 2018 at 10:51 PM, Jim Kusznir <jim@palousetech.com> wrote:

...

In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_ detailpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

> On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote: > > hi all: > > Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down. > > I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me). > > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled). > > The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible. > > I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable. > > I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time). > > Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution? > > --Jim > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

Johan Bernhardsson

8:07 a.m.

That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes. So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory. Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives. How many virtual server do you run and how much ram do they consume? On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com> wrote:

...

In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched] 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec] 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0] 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2] 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote: Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com> wrote: Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3...

Jim Kusznir

7:38 p.m.

I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB. Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out. The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host. This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now. Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused. --Jim On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...

That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes.

So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory.

Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives.

How many virtual server do you run and how much ram do they consume?

On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com> wrote:

...
In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_ detailpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

> Hi Jim, > > I don't have any targeted suggestions, because there isn't much to > latch on to. I can say Gluster replica three (no arbiters) on dedicated > servers serving a couple Ovirt VM clusters here have not had these sorts of > issues. > > I suspect your long heal times (and the resultant long periods of > high load) are at least partly related to 1G networking. That is just a > matter of IO - heals of VMs involve moving a lot of bits. My cluster uses > 10G bonded NICs on the gluster and ovirt boxes for storage traffic and > separate bonded 1G for ovirtmgmt and communication with other > machines/people, and we're occasionally hitting the bandwidth ceiling on > the storage network. I'm starting to think about 40/100G, different ways of > splitting up intensive systems, and considering iSCSI for specific volumes, > although I really don't want to go there. > > I don't run FreeNAS[1], but I do run FreeBSD as storage servers for > their excellent ZFS implementation, mostly for backups. ZFS will make your > `heal` problem go away, but not your bandwidth problems, which become worse > (because of fewer NICS pushing traffic). 10G hardware is not exactly in the > impulse-buy territory, but if you can, I'd recommend doing some testing > using it. I think at least some of your problems are related. > > If that's not possible, my next stops would be optimizing everything > I could about sharding, healing and optimizing for serving the shard size > to squeeze as much performance out of 1G as I could, but that will only go > so far. > > -j > > [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. > > > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> > wrote: > > > > hi all: > > > > Once again my production ovirt cluster is collapsing in on > itself. My servers are intermittently unavailable or degrading, customers > are noticing and calling in. This seems to be yet another gluster failure > that I haven't been able to pin down. > > > > I posted about this a while ago, but didn't get anywhere (no > replies that I found). The problem started out as a glusterfsd process > consuming large amounts of ram (up to the point where ram and swap were > exhausted and the kernel OOM killer killed off the glusterfsd process). > For reasons not clear to me at this time, that resulted in any VMs running > on that host and that gluster volume to be paused with I/O error (the > glusterfs process is usually unharmed; why it didn't continue I/O with > other servers is confusing to me). > > > > I have 3 servers and a total of 4 gluster volumes (engine, iso, > data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is > replica 3. The first 3 are backed by an LVM partition (some thin > provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some > internal flash for acceleration). data-hdd is the only thing on the disk. > Servers are Dell R610 with the PERC/6i raid card, with the disks > individually passed through to the OS (no raid enabled). > > > > The above RAM usage issue came from the data-hdd volume. > Yesterday, I cought one of the glusterfsd high ram usage before the > OOM-Killer had to run. I was able to migrate the VMs off the machine and > for good measure, reboot the entire machine (after taking this opportunity > to run the software updates that ovirt said were pending). Upon booting > back up, the necessary volume healing began. However, this time, the > healing caused all three servers to go to very, very high load averages (I > saw just under 200 on one server; typically they've been 40-70) with top > reporting IO Wait at 7-20%. Network for this volume is a dedicated gig > network. According to bwm-ng, initially the network bandwidth would hit > 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All > machines' load averages were still 40+ and gluster volume heal data-hdd > info reported 5 items needing healing. Server's were intermittently > experiencing IO issues, even on the 3 gluster volumes that appeared largely > unaffected. Even the OS activities on the hosts itself (logging in, > running commands) would often be very delayed. The ovirt engine was > seemingly randomly throwing engine down / engine up / engine failed > notifications. Responsiveness on ANY VM was horrific most of the time, > with random VMs being inaccessible. > > > > I let the gluster heal run overnight. By morning, there were > still 5 items needing healing, all three servers were still experiencing > high load, and servers were still largely unstable. > > > > I've noticed that all of my ovirt outages (and I've had a lot, way > more than is acceptable for a production cluster) have come from gluster. > I still have 3 VMs who's hard disk images have become corrupted by my last > gluster crash that I haven't had time to repair / rebuild yet (I believe > this crash was caused by the OOM issue previously mentioned, but I didn't > know it at the time). > > > > Is gluster really ready for production yet? It seems so unstable > to me.... I'm looking at replacing gluster with a dedicated NFS server > likely FreeNAS. Any suggestions? What is the "right" way to do production > storage on this (3 node cluster)? Can I get this gluster volume stable > enough to get my VMs to run reliably again until I can deploy another > storage solution? > > > > --Jim > > _______________________________________________ > > Users mailing list -- users@ovirt.org > > To unsubscribe send an email to users-leave@ovirt.org > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > oVirt Code of Conduct: https://www.ovirt.org/communit > y/about/community-guidelines/ > > List Archives: https://lists.ovirt.org/archiv > es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ > > _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

Jim Kusznir

7:49 p.m.

This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage): top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd 13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd 13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd 16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd 16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python 1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker 25555 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3 10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched 572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H 797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2 877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H 1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10 1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python 3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1 10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs 15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd 16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third. On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <jim@palousetech.com> wrote:

...

I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB.

Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out.

The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.

This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now.

Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused.

--Jim

On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes.

So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory.

Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives.

How many virtual server do you run and how much ram do they consume?

On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com> wrote:

...
In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta ilpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

> Thank you for the advice and help > > I do plan on going 10Gbps networking; haven't quite jumped off that > cliff yet, though. > > I did put my data-hdd (main VM storage volume) onto a dedicated > 1Gbps network, and I've watched throughput on that and never seen more than > 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network > for communication and ovirt migration, but I wanted to break that up > further (separate out VM traffice from migration/mgmt traffic). My three > SSD-backed gluster volumes run the main network too, as I haven't been able > to get them to move to the new network (which I was trying to use as all > gluster). I tried bonding, but that seamed to reduce performance rather > than improve it. > > --Jim > > On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < > jlawrence@squaretrade.com> wrote: > >> Hi Jim, >> >> I don't have any targeted suggestions, because there isn't much to >> latch on to. I can say Gluster replica three (no arbiters) on dedicated >> servers serving a couple Ovirt VM clusters here have not had these sorts of >> issues. >> >> I suspect your long heal times (and the resultant long periods of >> high load) are at least partly related to 1G networking. That is just a >> matter of IO - heals of VMs involve moving a lot of bits. My cluster uses >> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and >> separate bonded 1G for ovirtmgmt and communication with other >> machines/people, and we're occasionally hitting the bandwidth ceiling on >> the storage network. I'm starting to think about 40/100G, different ways of >> splitting up intensive systems, and considering iSCSI for specific volumes, >> although I really don't want to go there. >> >> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for >> their excellent ZFS implementation, mostly for backups. ZFS will make your >> `heal` problem go away, but not your bandwidth problems, which become worse >> (because of fewer NICS pushing traffic). 10G hardware is not exactly in the >> impulse-buy territory, but if you can, I'd recommend doing some testing >> using it. I think at least some of your problems are related. >> >> If that's not possible, my next stops would be optimizing >> everything I could about sharding, healing and optimizing for serving the >> shard size to squeeze as much performance out of 1G as I could, but that >> will only go so far. >> >> -j >> >> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >> >> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> >> wrote: >> > >> > hi all: >> > >> > Once again my production ovirt cluster is collapsing in on >> itself. My servers are intermittently unavailable or degrading, customers >> are noticing and calling in. This seems to be yet another gluster failure >> that I haven't been able to pin down. >> > >> > I posted about this a while ago, but didn't get anywhere (no >> replies that I found). The problem started out as a glusterfsd process >> consuming large amounts of ram (up to the point where ram and swap were >> exhausted and the kernel OOM killer killed off the glusterfsd process). >> For reasons not clear to me at this time, that resulted in any VMs running >> on that host and that gluster volume to be paused with I/O error (the >> glusterfs process is usually unharmed; why it didn't continue I/O with >> other servers is confusing to me). >> > >> > I have 3 servers and a total of 4 gluster volumes (engine, iso, >> data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is >> replica 3. The first 3 are backed by an LVM partition (some thin >> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >> internal flash for acceleration). data-hdd is the only thing on the disk. >> Servers are Dell R610 with the PERC/6i raid card, with the disks >> individually passed through to the OS (no raid enabled). >> > >> > The above RAM usage issue came from the data-hdd volume. >> Yesterday, I cought one of the glusterfsd high ram usage before the >> OOM-Killer had to run. I was able to migrate the VMs off the machine and >> for good measure, reboot the entire machine (after taking this opportunity >> to run the software updates that ovirt said were pending). Upon booting >> back up, the necessary volume healing began. However, this time, the >> healing caused all three servers to go to very, very high load averages (I >> saw just under 200 on one server; typically they've been 40-70) with top >> reporting IO Wait at 7-20%. Network for this volume is a dedicated gig >> network. According to bwm-ng, initially the network bandwidth would hit >> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All >> machines' load averages were still 40+ and gluster volume heal data-hdd >> info reported 5 items needing healing. Server's were intermittently >> experiencing IO issues, even on the 3 gluster volumes that appeared largely >> unaffected. Even the OS activities on the hosts itself (logging in, >> running commands) would often be very delayed. The ovirt engine was >> seemingly randomly throwing engine down / engine up / engine failed >> notifications. Responsiveness on ANY VM was horrific most of the time, >> with random VMs being inaccessible. >> > >> > I let the gluster heal run overnight. By morning, there were >> still 5 items needing healing, all three servers were still experiencing >> high load, and servers were still largely unstable. >> > >> > I've noticed that all of my ovirt outages (and I've had a lot, >> way more than is acceptable for a production cluster) have come from >> gluster. I still have 3 VMs who's hard disk images have become corrupted >> by my last gluster crash that I haven't had time to repair / rebuild yet (I >> believe this crash was caused by the OOM issue previously mentioned, but I >> didn't know it at the time). >> > >> > Is gluster really ready for production yet? It seems so unstable >> to me.... I'm looking at replacing gluster with a dedicated NFS server >> likely FreeNAS. Any suggestions? What is the "right" way to do production >> storage on this (3 node cluster)? Can I get this gluster volume stable >> enough to get my VMs to run reliably again until I can deploy another >> storage solution? >> > >> > --Jim >> > _______________________________________________ >> > Users mailing list -- users@ovirt.org >> > To unsubscribe send an email to users-leave@ovirt.org >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> > oVirt Code of Conduct: https://www.ovirt.org/communit >> y/about/community-guidelines/ >> > List Archives: https://lists.ovirt.org/archiv >> es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ >> >> > _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

Edward Clay

9 Jul 9 Jul

5:08 p.m.

New subject: Ovirt cluster unstable; gluster to blame (again)

Just to add my .02 here. I've opened a bug on this issue where HV/host connected to clusterfs volumes are running out of ram. This seemed to be a bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any longer and 3.12 is what ovirt is using. For example I have a host that was showing 72% of memory consumption with 3 VMs running on it. If I migrate those VMs to another Host memory consumption drops to 52%. If i put this host into maintenance and then activate it it drops down to 2% or so. Since I ran into this issue I've been manually watching memory consumption on each host and migrating VMs from it to others to keep things from dying. I'm hoping with the announcement of gluster 3.12 end of life and the move to gluster 4.1 that this will get fixed or that the patch from 3.13 can get backported so this problem will go away. https://bugzilla.redhat.com/show_bug.cgi?id=1593826 On 07/07/2018 11:49 AM, Jim Kusznir wrote: **Security Notice - This external email is NOT from The Hut Group** This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage): top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd 13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd 13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd 16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd 16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python 1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker 25555 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3 10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched 572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H 797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2 877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H 1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10 1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python 3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1 10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs 15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd 16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third. On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <jim@palousetech.com<mailto:jim@palousetech.com>> wrote: I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB. Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out. The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host. This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now. Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused. --Jim On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se<mailto:johan@kafit.se>> wrote: That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes. So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory. Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives. How many virtual server do you run and how much ram do they consume? On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com<mailto:jim@palousetech.com>> wrote: In case it matters, the data-hdd gluster volume uses these hard drives: https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1<https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1> This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that. --Jim On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com<mailto:jim@palousetech.com>> wrote: So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs): [root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else). [root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908 top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com<http://ovirt3.nwfiber.com> --volfile-id data.ovirt3.nwfiber.com<http://data.ovirt3.nwfiber.com>.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com<http://unifi.palousetech.com>,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com<http://ovirt3.nwfiber.com> --volfile-id engine.ovirt3.nwfiber.com<http://engine.ovirt3.nwfiber.com>.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com<http://billing.nwfiber.com>,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched] 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec] 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com<http://ovirt3.nwfiber.com> --volfile-id iso.ovirt3.nwfiber.com<http://iso.ovirt3.nwfiber.com>.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0] 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2] 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid<http://ustershd.pid> -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1] Not sure why the system load dropped other than I was trying to take a picture of it :) In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc. I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating. At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that. I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage. Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it. Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been.... --Jim On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se<mailto:johan@kafit.se>> wrote: Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top. And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60) Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link. On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com<mailto:jim@palousetech.com>> wrote: So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range. Jim On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com<mailto:jim@palousetech.com>> wrote: Thank you for the advice and help I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though. I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it. --Jim On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com<mailto:jlawrence@squaretrade.com>> wrote: Hi Jim, I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues. I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there. I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related. If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far. -j [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...

On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com<mailto:jim@palousetech.com>> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/<https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/<https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/<https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/>

_______________________________________________ Users mailing list -- users@ovirt.org<mailto:users%40ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave%40ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/<https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/<https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/<https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPE... Edward Clay Systems Administrator The Hut Group<http://www.thehutgroup.com/> Tel: Email: edward.clay@uk2group.com<mailto:edward.clay@uk2group.com> For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries. Confidentiality Notice This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company. Encryptions and Viruses Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail. Monitoring Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes. hgvyjuv

Sahina Bose

5:42 p.m.

see response about bug at https://lists.ovirt.org/archives/list/users@ovirt.org/thread/WRYEBOLNHJZGKKJ... which seems to indicate the referenced bug is fixed at 3.12.2 and higher. Could you attach the statedump of the process to the bug https://bugzilla.redhat.com/show_bug.cgi?id=1593826 as requested? On Mon, Jul 9, 2018 at 8:38 PM, Edward Clay <edward.clay@uk2group.com> wrote:

...

Just to add my .02 here. I've opened a bug on this issue where HV/host connected to clusterfs volumes are running out of ram. This seemed to be a bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any longer and 3.12 is what ovirt is using. For example I have a host that was showing 72% of memory consumption with 3 VMs running on it. If I migrate those VMs to another Host memory consumption drops to 52%. If i put this host into maintenance and then activate it it drops down to 2% or so. Since I ran into this issue I've been manually watching memory consumption on each host and migrating VMs from it to others to keep things from dying. I'm hoping with the announcement of gluster 3.12 end of life and the move to gluster 4.1 that this will get fixed or that the patch from 3.13 can get backported so this problem will go away.

https://bugzilla.redhat.com/show_bug.cgi?id=1593826

On 07/07/2018 11:49 AM, Jim Kusznir wrote:

**Security Notice - This external email is NOT from The Hut Group**

This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage):

top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd

13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd

13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd

16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd

16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python

1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker

25555 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3

10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched

572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H

797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2

877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H

1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10

1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python

3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1

10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs

15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd

16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd

At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third.

On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <jim@palousetech.com> wrote:

...
I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB.

Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out.

The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.

This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now.

Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused.

--Jim

On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes.

So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory.

Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives.

How many virtual server do you run and how much ram do they consume?

On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com> wrote:

...
In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta ilpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

> So far it does not appear to be helping much. I'm still getting VM's > locking up and all kinds of notices from overt engine about non-responsive > hosts. I'm still seeing load averages in the 20-30 range. > > Jim > > On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> > wrote: > >> Thank you for the advice and help >> >> I do plan on going 10Gbps networking; haven't quite jumped off that >> cliff yet, though. >> >> I did put my data-hdd (main VM storage volume) onto a dedicated >> 1Gbps network, and I've watched throughput on that and never seen more than >> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network >> for communication and ovirt migration, but I wanted to break that up >> further (separate out VM traffice from migration/mgmt traffic). My three >> SSD-backed gluster volumes run the main network too, as I haven't been able >> to get them to move to the new network (which I was trying to use as all >> gluster). I tried bonding, but that seamed to reduce performance rather >> than improve it. >> >> --Jim >> >> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < >> jlawrence@squaretrade.com> wrote: >> >>> Hi Jim, >>> >>> I don't have any targeted suggestions, because there isn't much to >>> latch on to. I can say Gluster replica three (no arbiters) on dedicated >>> servers serving a couple Ovirt VM clusters here have not had these sorts of >>> issues. >>> >>> I suspect your long heal times (and the resultant long periods of >>> high load) are at least partly related to 1G networking. That is just a >>> matter of IO - heals of VMs involve moving a lot of bits. My cluster uses >>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and >>> separate bonded 1G for ovirtmgmt and communication with other >>> machines/people, and we're occasionally hitting the bandwidth ceiling on >>> the storage network. I'm starting to think about 40/100G, different ways of >>> splitting up intensive systems, and considering iSCSI for specific volumes, >>> although I really don't want to go there. >>> >>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers >>> for their excellent ZFS implementation, mostly for backups. ZFS will make >>> your `heal` problem go away, but not your bandwidth problems, which become >>> worse (because of fewer NICS pushing traffic). 10G hardware is not exactly >>> in the impulse-buy territory, but if you can, I'd recommend doing some >>> testing using it. I think at least some of your problems are related. >>> >>> If that's not possible, my next stops would be optimizing >>> everything I could about sharding, healing and optimizing for serving the >>> shard size to squeeze as much performance out of 1G as I could, but that >>> will only go so far. >>> >>> -j >>> >>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >>> >>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> >>> wrote: >>> > >>> > hi all: >>> > >>> > Once again my production ovirt cluster is collapsing in on >>> itself. My servers are intermittently unavailable or degrading, customers >>> are noticing and calling in. This seems to be yet another gluster failure >>> that I haven't been able to pin down. >>> > >>> > I posted about this a while ago, but didn't get anywhere (no >>> replies that I found). The problem started out as a glusterfsd process >>> consuming large amounts of ram (up to the point where ram and swap were >>> exhausted and the kernel OOM killer killed off the glusterfsd process). >>> For reasons not clear to me at this time, that resulted in any VMs running >>> on that host and that gluster volume to be paused with I/O error (the >>> glusterfs process is usually unharmed; why it didn't continue I/O with >>> other servers is confusing to me). >>> > >>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, >>> data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is >>> replica 3. The first 3 are backed by an LVM partition (some thin >>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >>> internal flash for acceleration). data-hdd is the only thing on the disk. >>> Servers are Dell R610 with the PERC/6i raid card, with the disks >>> individually passed through to the OS (no raid enabled). >>> > >>> > The above RAM usage issue came from the data-hdd volume. >>> Yesterday, I cought one of the glusterfsd high ram usage before the >>> OOM-Killer had to run. I was able to migrate the VMs off the machine and >>> for good measure, reboot the entire machine (after taking this opportunity >>> to run the software updates that ovirt said were pending). Upon booting >>> back up, the necessary volume healing began. However, this time, the >>> healing caused all three servers to go to very, very high load averages (I >>> saw just under 200 on one server; typically they've been 40-70) with top >>> reporting IO Wait at 7-20%. Network for this volume is a dedicated gig >>> network. According to bwm-ng, initially the network bandwidth would hit >>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All >>> machines' load averages were still 40+ and gluster volume heal data-hdd >>> info reported 5 items needing healing. Server's were intermittently >>> experiencing IO issues, even on the 3 gluster volumes that appeared largely >>> unaffected. Even the OS activities on the hosts itself (logging in, >>> running commands) would often be very delayed. The ovirt engine was >>> seemingly randomly throwing engine down / engine up / engine failed >>> notifications. Responsiveness on ANY VM was horrific most of the time, >>> with random VMs being inaccessible. >>> > >>> > I let the gluster heal run overnight. By morning, there were >>> still 5 items needing healing, all three servers were still experiencing >>> high load, and servers were still largely unstable. >>> > >>> > I've noticed that all of my ovirt outages (and I've had a lot, >>> way more than is acceptable for a production cluster) have come from >>> gluster. I still have 3 VMs who's hard disk images have become corrupted >>> by my last gluster crash that I haven't had time to repair / rebuild yet (I >>> believe this crash was caused by the OOM issue previously mentioned, but I >>> didn't know it at the time). >>> > >>> > Is gluster really ready for production yet? It seems so >>> unstable to me.... I'm looking at replacing gluster with a dedicated NFS >>> server likely FreeNAS. Any suggestions? What is the "right" way to do >>> production storage on this (3 node cluster)? Can I get this gluster volume >>> stable enough to get my VMs to run reliably again until I can deploy >>> another storage solution? >>> > >>> > --Jim >>> > _______________________________________________ >>> > Users mailing list -- users@ovirt.org >>> > To unsubscribe send an email to users-leave@ovirt.org >>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> > oVirt Code of Conduct: https://www.ovirt.org/communit >>> y/about/community-guidelines/ >>> > List Archives: https://lists.ovirt.org/archiv >>> es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ >>> >>> >> _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/communit > y/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archiv > es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/ >

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPE...

Edward Clay Systems Administrator The Hut Group <http://www.thehutgroup.com/>

Tel: Email: edward.clay@uk2group.com

For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.

*Confidentiality Notice* This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company.

*Encryptions and Viruses* Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail.

*Monitoring* Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes. hgvyjuv

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community- guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/

Darrell Budic

9:40 p.m.

I encountered this after upgrading clients to 3.12.9 as well. It’s not present in 3.12.8 or 3.12.6. I’ve added some data I had to that bug, can produce more if needed. Forgot to mention my server cluster is at 3.12.9, and is not showing any problems, it’s just the clients. A test cluster on 3.12.11 also shows it, just slower because it’s got fewer clients on it.

...

From: Sahina Bose <sabose@redhat.com> Subject: [ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again) Date: July 9, 2018 at 10:42:15 AM CDT To: Edward Clay; Jim Kusznir Cc: users

see response about bug at https://lists.ovirt.org/archives/list/users@ovirt.org/thread/WRYEBOLNHJZGKKJ... <https://lists.ovirt.org/archives/list/users@ovirt.org/thread/WRYEBOLNHJZGKKJUNF77TJ7WMBS66ZYK/> which seems to indicate the referenced bug is fixed at 3.12.2 and higher.

Could you attach the statedump of the process to the bug https://bugzilla.redhat.com/show_bug.cgi?id=1593826 <https://bugzilla.redhat.com/show_bug.cgi?id=1593826> as requested?

On Mon, Jul 9, 2018 at 8:38 PM, Edward Clay <edward.clay@uk2group.com <mailto:edward.clay@uk2group.com>> wrote: Just to add my .02 here. I've opened a bug on this issue where HV/host connected to clusterfs volumes are running out of ram. This seemed to be a bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any longer and 3.12 is what ovirt is using. For example I have a host that was showing 72% of memory consumption with 3 VMs running on it. If I migrate those VMs to another Host memory consumption drops to 52%. If i put this host into maintenance and then activate it it drops down to 2% or so. Since I ran into this issue I've been manually watching memory consumption on each host and migrating VMs from it to others to keep things from dying. I'm hoping with the announcement of gluster 3.12 end of life and the move to gluster 4.1 that this will get fixed or that the patch from 3.13 can get backported so this problem will go away.

https://bugzilla.redhat.com/show_bug.cgi?id=1593826 <https://bugzilla.redhat.com/show_bug.cgi?id=1593826>

On 07/07/2018 11:49 AM, Jim Kusznir wrote:

...
**Security Notice - This external email is NOT from The Hut Group**

This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage):

top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd 13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd 13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd 16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd 16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python 1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker 25555 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3 10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched 572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H 797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2 877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H 1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10 1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python 3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1 10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs 15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd 16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd

At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third.

On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <jim@palousetech.com <mailto:jim@palousetech.com>> wrote: I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB.

Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out.

The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.

This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now.

Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused.

--Jim

On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se <mailto:johan@kafit.se>> wrote: That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes.

So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory.

Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives.

How many virtual server do you run and how much ram do they consume?

On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com <mailto:jim@palousetech.com>> wrote:

...
In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1 <https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1>

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com <mailto:jim@palousetech.com>> wrote: So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com <http://ovirt3.nwfiber.com/> --volfile-iddata.ovirt3.nwfiber.com <http://data.ovirt3.nwfiber.com/>.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com <http://unifi.palousetech.com/>,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com <http://ovirt3.nwfiber.com/> --volfile-idengine.ovirt3.nwfiber.com <http://engine.ovirt3.nwfiber.com/>.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com <http://billing.nwfiber.com/>,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched] 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec] 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com <http://ovirt3.nwfiber.com/> --volfile-idiso.ovirt3.nwfiber.com <http://iso.ovirt3.nwfiber.com/>.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0] 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2] 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid <http://ustershd.pid/> -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se <mailto:johan@kafit.se>> wrote: Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com <mailto:jim@palousetech.com>> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com <mailto:jim@palousetech.com>> wrote: Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com <mailto:jlawrence@squaretrade.com>> wrote: Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com <mailto:jim@palousetech.com>> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ <https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT... <https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/>

_______________________________________________ Users mailing list -- users@ovirt.org <mailto:users%40ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave%40ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ <https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3... <https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/>

_______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ <https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPE... <https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPEHNC33WFC2HUYOVL6FB/>

Edward Clay Systems Administrator The Hut Group <http://www.thehutgroup.com/>

Tel: Email: edward.clay@uk2group.com <mailto:edward.clay@uk2group.com>

For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.

Confidentiality Notice This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company.

Encryptions and Viruses Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail.

Monitoring Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes.

hgvyjuv

_______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ <https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y2ZFGU2XDAXPMN... <https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/>

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5YDMKKRWPEACWO...

Alex K

10 Jul 10 Jul

12:03 p.m.

I see also that the last 4 or 5 weeks (after I upgraded from 4.1 to 4.2) I have to almost every week go and refresh the servers (maintenance, reboot) to release the RAM. If I leave them the RAM ill be eventually depleted from gluster services. I am running gluster 3.12.9-1 with ovirt 4.2.4.5-1.el7. Alex On Mon, Jul 9, 2018 at 6:08 PM, Edward Clay <edward.clay@uk2group.com> wrote:

...

Just to add my .02 here. I've opened a bug on this issue where HV/host connected to clusterfs volumes are running out of ram. This seemed to be a bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any longer and 3.12 is what ovirt is using. For example I have a host that was showing 72% of memory consumption with 3 VMs running on it. If I migrate those VMs to another Host memory consumption drops to 52%. If i put this host into maintenance and then activate it it drops down to 2% or so. Since I ran into this issue I've been manually watching memory consumption on each host and migrating VMs from it to others to keep things from dying. I'm hoping with the announcement of gluster 3.12 end of life and the move to gluster 4.1 that this will get fixed or that the patch from 3.13 can get backported so this problem will go away.

https://bugzilla.redhat.com/show_bug.cgi?id=1593826

On 07/07/2018 11:49 AM, Jim Kusznir wrote:

**Security Notice - This external email is NOT from The Hut Group**

This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage):

top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd

13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd

13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd

16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd

16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python

1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker

25555 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3

10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched

572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H

797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2

877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H

1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10

1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python

3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1

10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs

15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd

16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd

At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third.

On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <jim@palousetech.com> wrote:

...
I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB.

Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out.

The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.

This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now.

Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused.

--Jim

On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes.

So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory.

Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives.

How many virtual server do you run and how much ram do they consume?

On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com> wrote:

...
In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta ilpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

> So far it does not appear to be helping much. I'm still getting VM's > locking up and all kinds of notices from overt engine about non-responsive > hosts. I'm still seeing load averages in the 20-30 range. > > Jim > > On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> > wrote: > >> Thank you for the advice and help >> >> I do plan on going 10Gbps networking; haven't quite jumped off that >> cliff yet, though. >> >> I did put my data-hdd (main VM storage volume) onto a dedicated >> 1Gbps network, and I've watched throughput on that and never seen more than >> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network >> for communication and ovirt migration, but I wanted to break that up >> further (separate out VM traffice from migration/mgmt traffic). My three >> SSD-backed gluster volumes run the main network too, as I haven't been able >> to get them to move to the new network (which I was trying to use as all >> gluster). I tried bonding, but that seamed to reduce performance rather >> than improve it. >> >> --Jim >> >> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < >> jlawrence@squaretrade.com> wrote: >> >>> Hi Jim, >>> >>> I don't have any targeted suggestions, because there isn't much to >>> latch on to. I can say Gluster replica three (no arbiters) on dedicated >>> servers serving a couple Ovirt VM clusters here have not had these sorts of >>> issues. >>> >>> I suspect your long heal times (and the resultant long periods of >>> high load) are at least partly related to 1G networking. That is just a >>> matter of IO - heals of VMs involve moving a lot of bits. My cluster uses >>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and >>> separate bonded 1G for ovirtmgmt and communication with other >>> machines/people, and we're occasionally hitting the bandwidth ceiling on >>> the storage network. I'm starting to think about 40/100G, different ways of >>> splitting up intensive systems, and considering iSCSI for specific volumes, >>> although I really don't want to go there. >>> >>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers >>> for their excellent ZFS implementation, mostly for backups. ZFS will make >>> your `heal` problem go away, but not your bandwidth problems, which become >>> worse (because of fewer NICS pushing traffic). 10G hardware is not exactly >>> in the impulse-buy territory, but if you can, I'd recommend doing some >>> testing using it. I think at least some of your problems are related. >>> >>> If that's not possible, my next stops would be optimizing >>> everything I could about sharding, healing and optimizing for serving the >>> shard size to squeeze as much performance out of 1G as I could, but that >>> will only go so far. >>> >>> -j >>> >>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >>> >>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> >>> wrote: >>> > >>> > hi all: >>> > >>> > Once again my production ovirt cluster is collapsing in on >>> itself. My servers are intermittently unavailable or degrading, customers >>> are noticing and calling in. This seems to be yet another gluster failure >>> that I haven't been able to pin down. >>> > >>> > I posted about this a while ago, but didn't get anywhere (no >>> replies that I found). The problem started out as a glusterfsd process >>> consuming large amounts of ram (up to the point where ram and swap were >>> exhausted and the kernel OOM killer killed off the glusterfsd process). >>> For reasons not clear to me at this time, that resulted in any VMs running >>> on that host and that gluster volume to be paused with I/O error (the >>> glusterfs process is usually unharmed; why it didn't continue I/O with >>> other servers is confusing to me). >>> > >>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, >>> data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is >>> replica 3. The first 3 are backed by an LVM partition (some thin >>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >>> internal flash for acceleration). data-hdd is the only thing on the disk. >>> Servers are Dell R610 with the PERC/6i raid card, with the disks >>> individually passed through to the OS (no raid enabled). >>> > >>> > The above RAM usage issue came from the data-hdd volume. >>> Yesterday, I cought one of the glusterfsd high ram usage before the >>> OOM-Killer had to run. I was able to migrate the VMs off the machine and >>> for good measure, reboot the entire machine (after taking this opportunity >>> to run the software updates that ovirt said were pending). Upon booting >>> back up, the necessary volume healing began. However, this time, the >>> healing caused all three servers to go to very, very high load averages (I >>> saw just under 200 on one server; typically they've been 40-70) with top >>> reporting IO Wait at 7-20%. Network for this volume is a dedicated gig >>> network. According to bwm-ng, initially the network bandwidth would hit >>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All >>> machines' load averages were still 40+ and gluster volume heal data-hdd >>> info reported 5 items needing healing. Server's were intermittently >>> experiencing IO issues, even on the 3 gluster volumes that appeared largely >>> unaffected. Even the OS activities on the hosts itself (logging in, >>> running commands) would often be very delayed. The ovirt engine was >>> seemingly randomly throwing engine down / engine up / engine failed >>> notifications. Responsiveness on ANY VM was horrific most of the time, >>> with random VMs being inaccessible. >>> > >>> > I let the gluster heal run overnight. By morning, there were >>> still 5 items needing healing, all three servers were still experiencing >>> high load, and servers were still largely unstable. >>> > >>> > I've noticed that all of my ovirt outages (and I've had a lot, >>> way more than is acceptable for a production cluster) have come from >>> gluster. I still have 3 VMs who's hard disk images have become corrupted >>> by my last gluster crash that I haven't had time to repair / rebuild yet (I >>> believe this crash was caused by the OOM issue previously mentioned, but I >>> didn't know it at the time). >>> > >>> > Is gluster really ready for production yet? It seems so >>> unstable to me.... I'm looking at replacing gluster with a dedicated NFS >>> server likely FreeNAS. Any suggestions? What is the "right" way to do >>> production storage on this (3 node cluster)? Can I get this gluster volume >>> stable enough to get my VMs to run reliably again until I can deploy >>> another storage solution? >>> > >>> > --Jim >>> > _______________________________________________ >>> > Users mailing list -- users@ovirt.org >>> > To unsubscribe send an email to users-leave@ovirt.org >>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> > oVirt Code of Conduct: https://www.ovirt.org/communit >>> y/about/community-guidelines/ >>> > List Archives: https://lists.ovirt.org/archiv >>> es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ >>> >>> >> _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/communit > y/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archiv > es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/ >

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPE...

Edward Clay Systems Administrator The Hut Group <http://www.thehutgroup.com/>

Tel: Email: edward.clay@uk2group.com

For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.

*Confidentiality Notice* This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company.

*Encryptions and Viruses* Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail.

*Monitoring* Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes. hgvyjuv

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community- guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/

Doug Ingham

4:30 p.m.

Hi Jim, Just to throw my 2 cents in, one of my clusters is very similar to yours, & I'm not having any of the issues you complain about. One thing I would strongly recommend you do however is bond your NICs with LACP 802.3ad - either 2x1Gbit for oVirt & 2x1Gbit for Gluster, or bond all of your NICs together & separate the storage & management networks with VLANs. Swap should generally be avoided today, RAM is cheap. 4x R710s, each w/... * 96GB RAM * 6x 7.2k SATA HDDs in RAID 0 (Gluster dist-rep 2+1 arb) * 2x USB sticks in RAID 1 (CentOS) * 4x 1Gbit, bonded with LACP/802.3ad/mode 4 [root@v0 ~]# gluster volume info data Volume Name: data Type: Distributed-Replicate Volume ID: bded65c7-e79e-4bc9-9630-36a69ad2e684 Status: Started Snapshot Count: 0 Number of Bricks: 3 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: s0:/gluster/data/brick Brick2: s1:/gluster/data/brick Brick3: s2:/gluster/data/arbiter (arbiter) Brick4: s2:/gluster/data/brick Brick5: s3:/gluster/data/brick Brick6: s0:/gluster/data/arbiter (arbiter) Options Reconfigured: performance.readdir-ahead: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server features.shard: on cluster.data-self-heal-algorithm: full storage.owner-uid: 36 storage.owner-gid: 36 server.allow-insecure: on network.ping-timeout: 30 features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal: on cluster.metadata-self-heal: on cluster.entry-self-heal: on cluster.granular-entry-heal: enable features.lock-heal: on I've got about 30 VMs running on this setup without issue. Upgrading to 10Gbit will be the next step, however I/O is generally nicely balanced across all of the NICs, so it's rarely as issue. Considering none of this kit is particularly new or high performance, it's not been overly noticeable as all aspects of the load & I/O are very evenly distributed. On 10 July 2018 at 06:03, Alex K <rightkicktech@gmail.com> wrote:

...

I see also that the last 4 or 5 weeks (after I upgraded from 4.1 to 4.2) I have to almost every week go and refresh the servers (maintenance, reboot) to release the RAM. If I leave them the RAM ill be eventually depleted from gluster services. I am running gluster 3.12.9-1 with ovirt 4.2.4.5-1.el7.

Alex

On Mon, Jul 9, 2018 at 6:08 PM, Edward Clay <edward.clay@uk2group.com> wrote:

...
Just to add my .02 here. I've opened a bug on this issue where HV/host connected to clusterfs volumes are running out of ram. This seemed to be a bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any longer and 3.12 is what ovirt is using. For example I have a host that was showing 72% of memory consumption with 3 VMs running on it. If I migrate those VMs to another Host memory consumption drops to 52%. If i put this host into maintenance and then activate it it drops down to 2% or so. Since I ran into this issue I've been manually watching memory consumption on each host and migrating VMs from it to others to keep things from dying. I'm hoping with the announcement of gluster 3.12 end of life and the move to gluster 4.1 that this will get fixed or that the patch from 3.13 can get backported so this problem will go away.

https://bugzilla.redhat.com/show_bug.cgi?id=1593826

On 07/07/2018 11:49 AM, Jim Kusznir wrote:

**Security Notice - This external email is NOT from The Hut Group**

This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage):

top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd

13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd

13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd

16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd

16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python

1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker

25555 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3

10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched

572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H

797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2

877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H

1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10

1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python

3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1

10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs

15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd

16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd

At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third.

On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <jim@palousetech.com> wrote:

...
I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB.

Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out.

The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.

This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now.

Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused.

--Jim

On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
That is a single sata drive that is slow on random I/O and that has to be synced with 2 other servers. Gluster works syncronous so one write has to be written and acknowledged on all the three nodes.

So you have a bottle neck in io on drives and one on network and depending on how many virtual servers you have and how much ram they take you might have memory.

Load spikes when you have a wait somewhere and are overusing capacity. But it's now only CPU that load is counted on. It is waiting for resources so it can be memory or Network or drives.

How many virtual server do you run and how much ram do they consume?

On July 7, 2018 09:51:42 Jim Kusznir <jim@palousetech.com> wrote:

...
In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta ilpage_o05_s00?ie=UTF8&psc=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66 Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+ 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

> Load like that is mostly io based either the machine is swapping or > network is to slow. Check I/o wait in top. > > And the problem where you get oom killer to kill off gluster. That > means that you don't monitor ram usage on the servers? Either it's eating > all your ram and swap gets really io intensive and then is killed off. Or > you have the wrong swap settings in sysctl.conf (there are tons of broken > guides that recommends swappines to 0 but that disables swap on newer > kernels. The proper swappines for only swapping when nesseary is 1 or a > sufficiently low number like 10 default is 60) > > > Moving to nfs will not improve things. You will get more memory > since gluster isn't running and that is good. But you will have a single > node that can fail with all your storage and it would still be on 1 gigabit > only and your three node cluster would easily saturate that link. > > On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote: > >> So far it does not appear to be helping much. I'm still getting >> VM's locking up and all kinds of notices from overt engine about >> non-responsive hosts. I'm still seeing load averages in the 20-30 range. >> >> Jim >> >> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> >> wrote: >> >>> Thank you for the advice and help >>> >>> I do plan on going 10Gbps networking; haven't quite jumped off >>> that cliff yet, though. >>> >>> I did put my data-hdd (main VM storage volume) onto a dedicated >>> 1Gbps network, and I've watched throughput on that and never seen more than >>> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network >>> for communication and ovirt migration, but I wanted to break that up >>> further (separate out VM traffice from migration/mgmt traffic). My three >>> SSD-backed gluster volumes run the main network too, as I haven't been able >>> to get them to move to the new network (which I was trying to use as all >>> gluster). I tried bonding, but that seamed to reduce performance rather >>> than improve it. >>> >>> --Jim >>> >>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < >>> jlawrence@squaretrade.com> wrote: >>> >>>> Hi Jim, >>>> >>>> I don't have any targeted suggestions, because there isn't much >>>> to latch on to. I can say Gluster replica three (no arbiters) on dedicated >>>> servers serving a couple Ovirt VM clusters here have not had these sorts of >>>> issues. >>>> >>>> I suspect your long heal times (and the resultant long periods of >>>> high load) are at least partly related to 1G networking. That is just a >>>> matter of IO - heals of VMs involve moving a lot of bits. My cluster uses >>>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and >>>> separate bonded 1G for ovirtmgmt and communication with other >>>> machines/people, and we're occasionally hitting the bandwidth ceiling on >>>> the storage network. I'm starting to think about 40/100G, different ways of >>>> splitting up intensive systems, and considering iSCSI for specific volumes, >>>> although I really don't want to go there. >>>> >>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers >>>> for their excellent ZFS implementation, mostly for backups. ZFS will make >>>> your `heal` problem go away, but not your bandwidth problems, which become >>>> worse (because of fewer NICS pushing traffic). 10G hardware is not exactly >>>> in the impulse-buy territory, but if you can, I'd recommend doing some >>>> testing using it. I think at least some of your problems are related. >>>> >>>> If that's not possible, my next stops would be optimizing >>>> everything I could about sharding, healing and optimizing for serving the >>>> shard size to squeeze as much performance out of 1G as I could, but that >>>> will only go so far. >>>> >>>> -j >>>> >>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >>>> >>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> >>>> wrote: >>>> > >>>> > hi all: >>>> > >>>> > Once again my production ovirt cluster is collapsing in on >>>> itself. My servers are intermittently unavailable or degrading, customers >>>> are noticing and calling in. This seems to be yet another gluster failure >>>> that I haven't been able to pin down. >>>> > >>>> > I posted about this a while ago, but didn't get anywhere (no >>>> replies that I found). The problem started out as a glusterfsd process >>>> consuming large amounts of ram (up to the point where ram and swap were >>>> exhausted and the kernel OOM killer killed off the glusterfsd process). >>>> For reasons not clear to me at this time, that resulted in any VMs running >>>> on that host and that gluster volume to be paused with I/O error (the >>>> glusterfs process is usually unharmed; why it didn't continue I/O with >>>> other servers is confusing to me). >>>> > >>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, >>>> data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is >>>> replica 3. The first 3 are backed by an LVM partition (some thin >>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >>>> internal flash for acceleration). data-hdd is the only thing on the disk. >>>> Servers are Dell R610 with the PERC/6i raid card, with the disks >>>> individually passed through to the OS (no raid enabled). >>>> > >>>> > The above RAM usage issue came from the data-hdd volume. >>>> Yesterday, I cought one of the glusterfsd high ram usage before the >>>> OOM-Killer had to run. I was able to migrate the VMs off the machine and >>>> for good measure, reboot the entire machine (after taking this opportunity >>>> to run the software updates that ovirt said were pending). Upon booting >>>> back up, the necessary volume healing began. However, this time, the >>>> healing caused all three servers to go to very, very high load averages (I >>>> saw just under 200 on one server; typically they've been 40-70) with top >>>> reporting IO Wait at 7-20%. Network for this volume is a dedicated gig >>>> network. According to bwm-ng, initially the network bandwidth would hit >>>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All >>>> machines' load averages were still 40+ and gluster volume heal data-hdd >>>> info reported 5 items needing healing. Server's were intermittently >>>> experiencing IO issues, even on the 3 gluster volumes that appeared largely >>>> unaffected. Even the OS activities on the hosts itself (logging in, >>>> running commands) would often be very delayed. The ovirt engine was >>>> seemingly randomly throwing engine down / engine up / engine failed >>>> notifications. Responsiveness on ANY VM was horrific most of the time, >>>> with random VMs being inaccessible. >>>> > >>>> > I let the gluster heal run overnight. By morning, there were >>>> still 5 items needing healing, all three servers were still experiencing >>>> high load, and servers were still largely unstable. >>>> > >>>> > I've noticed that all of my ovirt outages (and I've had a lot, >>>> way more than is acceptable for a production cluster) have come from >>>> gluster. I still have 3 VMs who's hard disk images have become corrupted >>>> by my last gluster crash that I haven't had time to repair / rebuild yet (I >>>> believe this crash was caused by the OOM issue previously mentioned, but I >>>> didn't know it at the time). >>>> > >>>> > Is gluster really ready for production yet? It seems so >>>> unstable to me.... I'm looking at replacing gluster with a dedicated NFS >>>> server likely FreeNAS. Any suggestions? What is the "right" way to do >>>> production storage on this (3 node cluster)? Can I get this gluster volume >>>> stable enough to get my VMs to run reliably again until I can deploy >>>> another storage solution? >>>> > >>>> > --Jim >>>> > _______________________________________________ >>>> > Users mailing list -- users@ovirt.org >>>> > To unsubscribe send an email to users-leave@ovirt.org >>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> > oVirt Code of Conduct: https://www.ovirt.org/communit >>>> y/about/community-guidelines/ >>>> > List Archives: https://lists.ovirt.org/archiv >>>> es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/ >>>> >>>> >>> _______________________________________________ >> Users mailing list -- users@ovirt.org >> To unsubscribe send an email to users-leave@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: https://www.ovirt.org/communit >> y/about/community-guidelines/ >> List Archives: https://lists.ovirt.org/archiv >> es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/ >> > >

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPE...

Edward Clay Systems Administrator The Hut Group <http://www.thehutgroup.com/>

Tel: Email: edward.clay@uk2group.com

For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.

*Confidentiality Notice* This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company.

*Encryptions and Viruses* Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail.

*Monitoring* Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes. hgvyjuv

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community- guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/DXQPCVIJXNSM3IYY6EN5NUAGNWQKQ7DB/

-- Doug

Benjamin Selinger

4 Feb 4 Feb

10:18 p.m.

Has there been any word about ovirt-node being updated with gluster-3.13.x, as there are known issues with 3.12.x where gluster crashes on boot due to no networking?

Yaniv Kaul

8 Jul 8 Jul

12:54 p.m.

On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <jim@palousetech.com> wrote:

...

So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66

That is indeed a high load average. How many CPUs do you have, btw?

...

Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

Can you check what's swapping here? (a tweak to top output will show that)

...

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+

This one's certainly taking quite a bit of your CPU usage overall.

...

14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+

I'm not sure what the sorting order is, but doesn't look like Gluster is taking a lot of memory?

...

25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+

The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 = 12GB) - still plenty of memory left.

...

10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

1g isn't good enough for Gluster. It doesn't help that you have SSD, because network is certainly your bottleneck even for regular performance, not to mention when you are healing. Jumbo frames would give you additional 5% or so - nothing to write home about.

...

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

We have many happy users running Gluster and hyperconverged. We need to understand where's the failure in your setup.

...

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

Agreed. Thanks for continuing looking into this, we'll probably need some Gluster logs to understand what's going on. Y.

...

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community- guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ message/73F7P66ARAQ6VLXDAUK2XEGXTB4B3FSA/

Jim Kusznir

10 Jul 10 Jul

2:21 a.m.

Thank you for your help. After more troubleshooting and host reboots, I accidentally discovered that the backing disk on ovirt2 (host) had suffered a failure. On reboot, the raid card refused to see it at all. It said it had cache waiting to be written to disk, and in the end, as it couldn't (wouldn't) see that disk, I had no choice but to discard that cache and boot up without the physical disk. Since doing so (and running a gluster volume remove for the affected host), things are running like normal, although it appears it corrupted two disks (I've now lost 5 VMs to gluster-induced disk failures during poorly handled failures). I don't understand why one bad disk wasn't simply failed, or if one underlying process was having such a problem, the other hosts didn't take it offline and continue (much like RAID would have done). Instead, everything was broke (including gluster volumes on unaffected disks that are fully functional across all hosts) as well as very poor performance of affected machine AND no diagnostic reports that would allude to a failing hard drive. Is this expected behavior? --Jim On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul <ykaul@redhat.com> wrote:

...

On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66

That is indeed a high load average. How many CPUs do you have, btw?

...
Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

Can you check what's swapping here? (a tweak to top output will show that)

...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+

This one's certainly taking quite a bit of your CPU usage overall.

...
14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+

I'm not sure what the sorting order is, but doesn't look like Gluster is taking a lot of memory?

...
25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+

The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 = 12GB) - still plenty of memory left.

...
10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

1g isn't good enough for Gluster. It doesn't help that you have SSD, because network is certainly your bottleneck even for regular performance, not to mention when you are healing. Jumbo frames would give you additional 5% or so - nothing to write home about.

...
I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

We have many happy users running Gluster and hyperconverged. We need to understand where's the failure in your setup.

...
Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

Agreed. Thanks for continuing looking into this, we'll probably need some Gluster logs to understand what's going on. Y.

...
--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

> On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote: > > hi all: > > Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down. > > I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me). > > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled). > > The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible. > > I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable. > > I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time). > > Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution? > > --Jim > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/73F7P66ARAQ6VLXDAUK2XEGXTB4B3FSA/

Johan Bernhardsson

4:31 a.m.

In some cases Linux does not reject the broken sata drive it just gets horribly slow. From my experience it is how the drive fails. It might have shown signs in smart and it might have shown some signs in syslog with write errors and drive queue errors For gluster to notice that the drive is gone the drive needs to be reject and marked as failed in Linux then gluster would have reported it as dead. This is one reason it's a good practice in gluster to run a brick on a raid volume instead of only one drive. /Johan On July 10, 2018 04:21:33 Jim Kusznir <jim@palousetech.com> wrote:

...

Thank you for your help.

After more troubleshooting and host reboots, I accidentally discovered that the backing disk on ovirt2 (host) had suffered a failure. On reboot, the raid card refused to see it at all. It said it had cache waiting to be written to disk, and in the end, as it couldn't (wouldn't) see that disk, I had no choice but to discard that cache and boot up without the physical disk. Since doing so (and running a gluster volume remove for the affected host), things are running like normal, although it appears it corrupted two disks (I've now lost 5 VMs to gluster-induced disk failures during poorly handled failures).

I don't understand why one bad disk wasn't simply failed, or if one underlying process was having such a problem, the other hosts didn't take it offline and continue (much like RAID would have done). Instead, everything was broke (including gluster volumes on unaffected disks that are fully functional across all hosts) as well as very poor performance of affected machine AND no diagnostic reports that would allude to a failing hard drive. Is this expected behavior?

--Jim

On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul <ykaul@redhat.com> wrote:

On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <jim@palousetech.com> wrote:

So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66

That is indeed a high load average. How many CPUs do you have, btw?

Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

Can you check what's swapping here? (a tweak to top output will show that)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+

This one's certainly taking quite a bit of your CPU usage overall.

14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+

I'm not sure what the sorting order is, but doesn't look like Gluster is taking a lot of memory?

25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+

The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 = 12GB) - still plenty of memory left.

10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched] 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec] 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0] 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2] 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

1g isn't good enough for Gluster. It doesn't help that you have SSD, because network is certainly your bottleneck even for regular performance, not to mention when you are healing. Jumbo frames would give you additional 5% or so - nothing to write home about.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

We have many happy users running Gluster and hyperconverged. We need to understand where's the failure in your setup.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

Agreed. Thanks for continuing looking into this, we'll probably need some Gluster logs to understand what's going on. Y.

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote: Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com> wrote:

Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/73F7P66ARAQ6VL...

Johan Bernhardsson

4:33 a.m.

One more thing. Gluster is writing everything in sync so gluster will wueue untill the host with the broken drive has acknowledged the write. This creating io wait and high load /Johan On July 10, 2018 04:21:33 Jim Kusznir <jim@palousetech.com> wrote:

...

Thank you for your help.

After more troubleshooting and host reboots, I accidentally discovered that the backing disk on ovirt2 (host) had suffered a failure. On reboot, the raid card refused to see it at all. It said it had cache waiting to be written to disk, and in the end, as it couldn't (wouldn't) see that disk, I had no choice but to discard that cache and boot up without the physical disk. Since doing so (and running a gluster volume remove for the affected host), things are running like normal, although it appears it corrupted two disks (I've now lost 5 VMs to gluster-induced disk failures during poorly handled failures).

I don't understand why one bad disk wasn't simply failed, or if one underlying process was having such a problem, the other hosts didn't take it offline and continue (much like RAID would have done). Instead, everything was broke (including gluster volumes on unaffected disks that are fully functional across all hosts) as well as very poor performance of affected machine AND no diagnostic reports that would allude to a failing hard drive. Is this expected behavior?

--Jim

On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul <ykaul@redhat.com> wrote:

On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <jim@palousetech.com> wrote:

So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66

That is indeed a high load average. How many CPUs do you have, btw?

Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

Can you check what's swapping here? (a tweak to top output will show that)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+

This one's certainly taking quite a bit of your CPU usage overall.

14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+

I'm not sure what the sorting order is, but doesn't look like Gluster is taking a lot of memory?

25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+

The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 = 12GB) - still plenty of memory left.

10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched] 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec] 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0] 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2] 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

1g isn't good enough for Gluster. It doesn't help that you have SSD, because network is certainly your bottleneck even for regular performance, not to mention when you are healing. Jumbo frames would give you additional 5% or so - nothing to write home about.

I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

We have many happy users running Gluster and hyperconverged. We need to understand where's the failure in your setup.

Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

Agreed. Thanks for continuing looking into this, we'll probably need some Gluster logs to understand what's going on. Y.

--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote: Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawrence@squaretrade.com> wrote:

Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

...
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote:

hi all:

Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time).

Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution?

--Jim _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JT...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/73F7P66ARAQ6VL...

Jim Kusznir

15 Jul 15 Jul

4:59 a.m.

Thank you for your help. After more troubleshooting and host reboots, I accidentally discovered that the backing disk on ovirt2 (host) had suffered a failure. On reboot, the raid card refused to see it at all. It said it had cache waiting to be written to disk, and in the end, as it couldn't (wouldn't) see that disk, I had no choice but to discard that cache and boot up without the physical disk. Since doing so (and running a gluster volume remove for the affected host), things are running like normal. I don't understand why one bad disk wasn't simply failed, or if one underlying process was having such a problem, the other hosts didn't take it offline and continue (much like RAID would have done). Instead, everything was broke (including gluster volumes on unaffected disks that are fully functional across all hosts). I'm seeing the need to go multi-spindle for each storage, and I don't want to do that with the ovirt hosts due to hardware concerns/issues (I have to use the PERC6i, which I am also learning to distrust), and I would have to use 2.5in disks (I want to use 3.5"). As such, I will be going to a dedicated storage server with 12 spindles in a RAID6 configuration. I'm debating if its worth setting it up as a gluster replica 1 system (so I can easily migrate later), or just build it NFS with FreeNAS. I'm leaning to the latter, as it seems pointless to run gluster on a single node. --Jim On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul <ykaul@redhat.com> wrote:

...

On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <jim@palousetech.com> wrote:

...
So, I'm still at a loss...It sounds like its either insufficient ram/swap, or insufficient network. It seems to be neither now. At this point, it appears that gluster is just "broke" and killing my systems for no descernable reason. Here's detals, all from the same system (currently running 3 VMs):

[root@ovirt3 ~]# w 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w

bwm-ng reports the highest data usage was about 6MB/s during this test (and that was combined; I have two different gig networks. One gluster network (primary VM storage) runs on one, the other network handles everything else).

[root@ovirt3 ~]# free -m total used free shared buff/cache available Mem: 31996 13236 232 18 18526 18195 Swap: 16383 1475 14908

top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, 47.66

That is indeed a high load average. How many CPUs do you have, btw?

...
Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem

Can you check what's swapping here? (a tweak to top output will show that)

...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v+ 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/va+ 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+

This one's certainly taking quite a bit of your CPU usage overall.

...
14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 /usr/sbin/glusterfs --volfile-server=192.168.8.11 --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+

I'm not sure what the sorting order is, but doesn't look like Gluster is taking a lot of memory?

...
25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 /usr/bin/python2 /usr/share/vdsm/vdsmd

28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S -object secret,id=masterKey0,format=+ 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top

2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S -object secret,id=masterKey0,format=ra+

The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 = 12GB) - still plenty of memory left.

...
10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 [rcu_sched]

1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 /usr/sbin/sanlock daemon

1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63 /usr/sbin/zabbix_agentd: collector [idle 1 sec]

2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 [kworker/7:0]

10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 [kworker/u64:2]

14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -+ 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 [kworker/10:1]

Not sure why the system load dropped other than I was trying to take a picture of it :)

In any case, it appears that at this time, I have plenty of swap, ram, and network capacity, and yet things are still running very sluggish; I'm still getting e-mails from servers complaining about loss of communication with something or another; I still get e-mails from the engine about bad engine status, then recovery, etc.

1g isn't good enough for Gluster. It doesn't help that you have SSD, because network is certainly your bottleneck even for regular performance, not to mention when you are healing. Jumbo frames would give you additional 5% or so - nothing to write home about.

...
I've shut down 2/3 of my VMs, too....just trying to keep the critical ones operating.

At this point, I don't believe the problem is the memory leak, but it seems to be triggered by the memory leak, as in all my problems started when I got low ram warnings from one of my 3 nodes and began recovery efforts from that.

I do really like the idea / concept behind glusterfs, but I really have to figure out why its been so poor performing from day one, and its caused 95% of my outages (including several large ones lately). If I can get it stable, reliable, and well performing, then I'd love to keep it. If I can't, then perhaps NFS is the way to go? I don't like the single point of failure aspect of it, but my other NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it.

We have many happy users running Gluster and hyperconverged. We need to understand where's the failure in your setup.

...
Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been....

Agreed. Thanks for continuing looking into this, we'll probably need some Gluster logs to understand what's going on. Y.

...
--Jim

On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se> wrote:

...
Load like that is mostly io based either the machine is swapping or network is to slow. Check I/o wait in top.

And the problem where you get oom killer to kill off gluster. That means that you don't monitor ram usage on the servers? Either it's eating all your ram and swap gets really io intensive and then is killed off. Or you have the wrong swap settings in sysctl.conf (there are tons of broken guides that recommends swappines to 0 but that disables swap on newer kernels. The proper swappines for only swapping when nesseary is 1 or a sufficiently low number like 10 default is 60)

Moving to nfs will not improve things. You will get more memory since gluster isn't running and that is good. But you will have a single node that can fail with all your storage and it would still be on 1 gigabit only and your three node cluster would easily saturate that link.

On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:

...
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:

...
Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < jlawrence@squaretrade.com> wrote:

...
Hi Jim,

I don't have any targeted suggestions, because there isn't much to latch on to. I can say Gluster replica three (no arbiters) on dedicated servers serving a couple Ovirt VM clusters here have not had these sorts of issues.

I suspect your long heal times (and the resultant long periods of high load) are at least partly related to 1G networking. That is just a matter of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G bonded NICs on the gluster and ovirt boxes for storage traffic and separate bonded 1G for ovirtmgmt and communication with other machines/people, and we're occasionally hitting the bandwidth ceiling on the storage network. I'm starting to think about 40/100G, different ways of splitting up intensive systems, and considering iSCSI for specific volumes, although I really don't want to go there.

I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their excellent ZFS implementation, mostly for backups. ZFS will make your `heal` problem go away, but not your bandwidth problems, which become worse (because of fewer NICS pushing traffic). 10G hardware is not exactly in the impulse-buy territory, but if you can, I'd recommend doing some testing using it. I think at least some of your problems are related.

If that's not possible, my next stops would be optimizing everything I could about sharding, healing and optimizing for serving the shard size to squeeze as much performance out of 1G as I could, but that will only go so far.

-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.

> On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com> wrote: > > hi all: > > Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down. > > I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me). > > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled). > > The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible. > > I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable. > > I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time). > > Is gluster really ready for production yet? It seems so unstable to me.... I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution? > > --Jim > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/73F7P66ARAQ6VLXDAUK2XEGXTB4B3FSA/

2564

Age (days ago)

2777

Last active (days ago)

List overview

Download

23 comments

11 participants

participants (11)

Alex K
Benjamin Selinger
Darrell Budic
Doug Ingham
Edward Clay
Greg Sheremeta
Jamie Lawrence
Jim Kusznir
Johan Bernhardsson
Sahina Bose
Yaniv Kaul

Ovirt cluster unstable; gluster to blame (again)

tags

participants (11)