[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

9 Jul 2018

      Thank you for your help.

After more troubleshooting and host reboots, I accidentally discovered that
the backing disk on ovirt2 (host) had suffered a failure.  On reboot, the
raid card refused to see it at all.  It said it had cache waiting to be
written to disk, and in the end, as it couldn't (wouldn't) see that disk, I
had no choice but to discard that cache and boot up without the physical
disk.  Since doing so (and running a gluster volume remove for the affected
host), things are running like normal, although it appears it corrupted two
disks (I've now lost 5 VMs to gluster-induced disk failures during poorly
handled failures).

I don't understand why one bad disk wasn't simply failed, or if one
underlying process was having such a problem, the other hosts didn't take
it offline and continue (much like RAID would have done).  Instead,
everything was broke (including gluster volumes on unaffected disks that
are fully functional across all hosts) as well as very poor performance of
affected machine AND no diagnostic reports that would allude to a failing
hard drive.  Is this expected behavior?

--Jim

On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
...
On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <jim@palousetech.com> wrote:
...
So, I'm still at a loss...It sounds like its either insufficient
ram/swap, or insufficient network.  It seems to be neither now.  At this
point, it appears that gluster is just "broke" and killing my systems for
no descernable reason.  Here's detals, all from the same system (currently
running 3 VMs):
[root@ovirt3 ~]# w
 22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    192.168.8.90     22:26    2.00s  0.12s  0.11s w
bwm-ng reports the highest data usage was about 6MB/s during this test
(and that was combined; I have two different gig networks.  One gluster
network (primary VM storage) runs on one, the other network handles
everything else).
[root@ovirt3 ~]# free -m
              total        used        free      shared  buff/cache
 available
Mem:          31996       13236         232          18       18526
 18195
Swap:         16383        1475       14908
top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
47.66
That is indeed a high load average. How many CPUs do you have, btw?
...
Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
%Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
0.0 st
KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
buff/cache
KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
Mem
Can you check what's swapping here? (a tweak to top output will show that)
...
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
30036 qemu      20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
/usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
secret,id=masterKey0,format=raw,file=/v+
28501 qemu      20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
/usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
secret,id=masterKey0,format=raw,file=/va+
 2694 root      20   0 2169224  12164   3108 S   5.0  0.0   3290:42
/usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
This one's certainly taking quite a bit of your CPU usage overall.
...
14293 root      15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
/usr/sbin/glusterfs --volfile-server=192.168.8.11
--volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
I'm not sure what the sorting order is, but doesn't look like Gluster is
taking a lot of memory?
...
25100 vdsm       0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
/usr/bin/python2 /usr/share/vdsm/vdsmd
28971 qemu      20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
/usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
-S -object secret,id=masterKey0,format=+
12095 root      20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
2708 root      20   0 1906040  12404   3080 S   1.0  0.0   1083:33
/usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
28623 qemu      20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64
/usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on
-S -object secret,id=masterKey0,format=ra+
The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 =
12GB) - still plenty of memory left.
...
10 root      20   0       0      0      0 S   0.3  0.0 215:54.72
[rcu_sched]
1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61
/usr/sbin/sanlock daemon
1890 zabbix    20   0   83904   1696   1612 S   0.3  0.0  24:30.63
/usr/sbin/zabbix_agentd: collector [idle 1 sec]
2722 root      20   0 1298004   6148   2580 S   0.3  0.0  38:10.82
/usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
 6340 root      20   0       0      0      0 S   0.3  0.0   0:04.30
[kworker/7:0]
10652 root      20   0       0      0      0 S   0.3  0.0   0:00.23
[kworker/u64:2]
14724 root      20   0 1076344  17400   3200 S   0.3  0.1  10:04.13
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -+
22011 root      20   0       0      0      0 S   0.3  0.0   0:05.04
[kworker/10:1]
Not sure why the system load dropped other than I was trying to take a
picture of it :)
In any case, it appears that at this time, I have plenty of swap, ram,
and network capacity, and yet things are still running very sluggish; I'm
still getting e-mails from servers complaining about loss of communication
with something or another; I still get e-mails from the engine about bad
engine status, then recovery, etc.
1g isn't good enough for Gluster. It doesn't help that you have SSD,
because network is certainly your bottleneck even for regular performance,
not to mention when you are healing. Jumbo frames would give you additional
5% or so - nothing to write home about.
...
I've shut down 2/3 of my VMs, too....just trying to keep the critical
ones operating.
At this point, I don't believe the problem is the memory leak, but it
seems to be triggered by the memory leak, as in all my problems started
when I got low ram warnings from one of my 3 nodes and began recovery
efforts from that.
I do really like the idea / concept behind glusterfs, but I really have
to figure out why its been so poor performing from day one, and its caused
95% of my outages (including several large ones lately).  If I can get it
stable, reliable, and well performing, then I'd love to keep it.  If I
can't, then perhaps NFS is the way to go?  I don't like the single point of
failure aspect of it, but my other NAS boxes I run for clients (central
storage for windows boxes) have been very solid; If I could get that kind
of reliability for my ovirt stack, it would be a substantial improvement.
Currently, it seems about every other month I have a gluster-induced outage.
Sometimes I wonder if its just hyperconverged is the issue, but my
infrastructure doesn't justify three servers at the same location...I might
be able to do two, but even that seems like its pushing it.
We have many happy users running Gluster and hyperconverged. We need to
understand where's the failure in your setup.
...
Looks like I can upgrade to 10G for about $900.  I can order a dual-Xeon
supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair
of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered.  I've
got to do something to improve my reliability; I can't keep going the way I
have been....
Agreed. Thanks for continuing looking into this, we'll probably need some
Gluster logs to understand what's going on.
Y.
...
--Jim
On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan@kafit.se>
wrote:
...
Load like that is mostly io based either the machine is swapping or
network is to slow. Check I/o wait in top.
And the problem where you get oom killer to kill off gluster. That means
that you don't monitor ram usage on the servers? Either it's eating all
your ram and swap gets really io intensive and then is killed off. Or you
have the wrong swap settings in sysctl.conf (there are tons of broken
guides that recommends swappines to 0 but that disables swap on newer
kernels. The proper swappines for only swapping when nesseary is 1 or a
sufficiently low number like 10 default is 60)
Moving to nfs will not improve things. You will get more memory since
gluster isn't running and that is good. But you will have a single node
that can fail with all your storage and it would still be on 1 gigabit only
and your three node cluster would easily saturate that link.
On July 7, 2018 04:13:13 Jim Kusznir <jim@palousetech.com> wrote:
...
So far it does not appear to be helping much. I'm still getting VM's
locking up and all kinds of notices from overt engine about non-responsive
hosts.  I'm still seeing load averages in the 20-30 range.
Jim
On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim@palousetech.com> wrote:
...
Thank you for the advice and help
I do plan on going 10Gbps networking; haven't quite jumped off that
cliff yet, though.
I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
network, and I've watched throughput on that and never seen more than
60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network
for communication and ovirt migration, but I wanted to break that up
further (separate out VM traffice from migration/mgmt traffic).  My three
SSD-backed gluster volumes run the main network too, as I haven't been able
to get them to move to the new network (which I was trying to use as all
gluster).  I tried bonding, but that seamed to reduce performance rather
than improve it.
--Jim
On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <
jlawrence@squaretrade.com> wrote:
...
Hi Jim,
I don't have any targeted suggestions, because there isn't much to
latch on to. I can say Gluster replica three  (no arbiters) on dedicated
servers serving a couple Ovirt VM clusters here have not had these sorts of
issues.
I suspect your long heal times (and the resultant long periods of
high load) are at least partly related to 1G networking. That is just a
matter of IO - heals of VMs involve moving a lot of bits. My cluster uses
10G bonded NICs on the gluster and ovirt boxes for storage traffic and
separate bonded 1G for ovirtmgmt and communication with other
machines/people, and we're occasionally hitting the bandwidth ceiling on
the storage network. I'm starting to think about 40/100G, different ways of
splitting up intensive systems, and considering iSCSI for specific volumes,
although I really don't want to go there.
I don't run FreeNAS[1], but I do run FreeBSD as storage servers for
their excellent ZFS implementation, mostly for backups. ZFS will make your
`heal` problem go away, but not your bandwidth problems, which become worse
(because of fewer NICS pushing traffic). 10G hardware is not exactly in the
impulse-buy territory, but if you can, I'd recommend doing some testing
using it. I think at least some of your problems are related.
If that's not possible, my next stops would be optimizing everything
I could about sharding, healing and optimizing for serving the shard size
to squeeze as much performance out of 1G as I could, but that will only go
so far.
-j
[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
> On Jul 6, 2018, at 1:19 PM, Jim Kusznir <jim@palousetech.com>
wrote:
>
> hi all:
>
> Once again my production ovirt cluster is collapsing in on itself.
My servers are intermittently unavailable or degrading, customers are
noticing and calling in.  This seems to be yet another gluster failure that
I haven't been able to pin down.
>
> I posted about this a while ago, but didn't get anywhere (no
replies that I found).  The problem started out as a glusterfsd process
consuming large amounts of ram (up to the point where ram and swap were
exhausted and the kernel OOM killer killed off the glusterfsd process).
For reasons not clear to me at this time, that resulted in any VMs running
on that host and that gluster volume to be paused with I/O error (the
glusterfs process is usually unharmed; why it didn't continue I/O with
other servers is confusing to me).
>
> I have 3 servers and a total of 4 gluster volumes (engine, iso,
data, and data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) is
replica 3.  The first 3 are backed by an LVM partition (some thin
provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some
internal flash for acceleration).  data-hdd is the only thing on the disk.
Servers are Dell R610 with the PERC/6i raid card, with the disks
individually passed through to the OS (no raid enabled).
>
> The above RAM usage issue came from the data-hdd volume.
Yesterday, I cought one of the glusterfsd high ram usage before the
OOM-Killer had to run.  I was able to migrate the VMs off the machine and
for good measure, reboot the entire machine (after taking this opportunity
to run the software updates that ovirt said were pending).  Upon booting
back up, the necessary volume healing began.  However, this time, the
healing caused all three servers to go to very, very high load averages (I
saw just under 200 on one server; typically they've been 40-70) with top
reporting IO Wait at 7-20%.  Network for this volume is a dedicated gig
network.  According to bwm-ng, initially the network bandwidth would hit
50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while.  All
machines' load averages were still 40+ and gluster volume heal data-hdd
info reported 5 items needing healing.  Server's were intermittently
experiencing IO issues, even on the 3 gluster volumes that appeared largely
unaffected.  Even the OS activities on the hosts itself (logging in,
running commands) would often be very delayed.  The ovirt engine was
seemingly randomly throwing engine down / engine up / engine failed
notifications.  Responsiveness on ANY VM was horrific most of the time,
with random VMs being inaccessible.
>
> I let the gluster heal run overnight.  By morning, there were still
5 items needing healing, all three servers were still experiencing high
load, and servers were still largely unstable.
>
> I've noticed that all of my ovirt outages (and I've had a lot, way
more than is acceptable for a production cluster) have come from gluster.
I still have 3 VMs who's hard disk images have become corrupted by my last
gluster crash that I haven't had time to repair / rebuild yet (I believe
this crash was caused by the OOM issue previously mentioned, but I didn't
know it at the time).
>
> Is gluster really ready for production yet?  It seems so unstable
to me....  I'm looking at replacing gluster with a dedicated NFS server
likely FreeNAS.  Any suggestions?  What is the "right" way to do production
storage on this (3 node cluster)?  Can I get this gluster volume stable
enough to get my VMs to run reliably again until I can deploy another
storage solution?
>
> --Jim
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-leave@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/communit
y/about/community-guidelines/
> List Archives: https://lists.ovirt.org/archiv
es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/communit
y/about/community-guidelines/
List Archives: https://lists.ovirt.org/archiv
es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/communit
y/about/community-guidelines/
List Archives: https://lists.ovirt.org/archiv
es/list/users@ovirt.org/message/73F7P66ARAQ6VLXDAUK2XEGXTB4B3FSA/