I think I should throw one more thing out there: The current batch of
problems started essentially today, and I did apply the updates waiting in
the ovirt repos (through the ovirt mgmt interface: install updates).
Perhaps there is now something from that which is breaking things.
On Fri, Jul 6, 2018 at 10:51 PM, Jim Kusznir <jim(a)palousetech.com> wrote:
In case it matters, the data-hdd gluster volume uses these hard
drives:
https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_
detailpage_o05_s00?ie=UTF8&psc=1
This is in a Dell R610 with PERC6/i (one drive per server, configured as a
single drive volume to pass it through as its own /dev/sd* device). Inside
the OS, its partitioned with lvm_thin, then an lvm volume formatted with
XFS and mounted as /gluster/brick3, with the data-hdd volume created inside
that.
--Jim
On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <jim(a)palousetech.com> wrote:
> So, I'm still at a loss...It sounds like its either insufficient
> ram/swap, or insufficient network. It seems to be neither now. At this
> point, it appears that gluster is just "broke" and killing my systems for
> no descernable reason. Here's detals, all from the same system (currently
> running 3 VMs):
>
> [root@ovirt3 ~]# w
> 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31
> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
> root pts/0 192.168.8.90 22:26 2.00s 0.12s 0.11s w
>
> bwm-ng reports the highest data usage was about 6MB/s during this test
> (and that was combined; I have two different gig networks. One gluster
> network (primary VM storage) runs on one, the other network handles
> everything else).
>
> [root@ovirt3 ~]# free -m
> total used free shared buff/cache
> available
> Mem: 31996 13236 232 18 18526
> 18195
> Swap: 16383 1475 14908
>
> top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69,
> 47.66
> Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie
> %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si,
> 0.0 st
> KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036
> buff/cache
> KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail
> Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
>
> 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55
> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/v+
> 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99
> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/va+
> 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42
> /usr/sbin/glusterfsd -s
ovirt3.nwfiber.com --volfile-id
> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
> 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15
> /usr/sbin/glusterfs --volfile-server=192.168.8.11
> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
> 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20
> /usr/bin/python2 /usr/share/vdsm/vdsmd
>
> 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49
> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
> -S -object secret,id=masterKey0,format=+
> 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top
>
>
> 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33
> /usr/sbin/glusterfsd -s
ovirt3.nwfiber.com --volfile-id
> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
> 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64
> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on
> -S -object secret,id=masterKey0,format=ra+
> 10 root 20 0 0 0 0 S 0.3 0.0 215:54.72
> [rcu_sched]
>
> 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61
> /usr/sbin/sanlock daemon
>
> 1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63
> /usr/sbin/zabbix_agentd: collector [idle 1 sec]
>
> 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82
> /usr/sbin/glusterfsd -s
ovirt3.nwfiber.com --volfile-id
> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
> 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30
> [kworker/7:0]
>
> 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23
> [kworker/u64:2]
>
> 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13
> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
> /var/run/gluster/glustershd/glustershd.pid -+
> 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04
> [kworker/10:1]
>
>
> Not sure why the system load dropped other than I was trying to take a
> picture of it :)
>
> In any case, it appears that at this time, I have plenty of swap, ram,
> and network capacity, and yet things are still running very sluggish; I'm
> still getting e-mails from servers complaining about loss of communication
> with something or another; I still get e-mails from the engine about bad
> engine status, then recovery, etc.
>
> I've shut down 2/3 of my VMs, too....just trying to keep the critical
> ones operating.
>
> At this point, I don't believe the problem is the memory leak, but it
> seems to be triggered by the memory leak, as in all my problems started
> when I got low ram warnings from one of my 3 nodes and began recovery
> efforts from that.
>
> I do really like the idea / concept behind glusterfs, but I really have
> to figure out why its been so poor performing from day one, and its caused
> 95% of my outages (including several large ones lately). If I can get it
> stable, reliable, and well performing, then I'd love to keep it. If I
> can't, then perhaps NFS is the way to go? I don't like the single point of
> failure aspect of it, but my other NAS boxes I run for clients (central
> storage for windows boxes) have been very solid; If I could get that kind
> of reliability for my ovirt stack, it would be a substantial improvement.
> Currently, it seems about every other month I have a gluster-induced outage.
>
> Sometimes I wonder if its just hyperconverged is the issue, but my
> infrastructure doesn't justify three servers at the same location...I might
> be able to do two, but even that seems like its pushing it.
>
> Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon
> supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair
> of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've
> got to do something to improve my reliability; I can't keep going the way I
> have been....
>
> --Jim
>
>
> On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <johan(a)kafit.se>
> wrote:
>
>> Load like that is mostly io based either the machine is swapping or
>> network is to slow. Check I/o wait in top.
>>
>> And the problem where you get oom killer to kill off gluster. That means
>> that you don't monitor ram usage on the servers? Either it's eating all
>> your ram and swap gets really io intensive and then is killed off. Or you
>> have the wrong swap settings in sysctl.conf (there are tons of broken
>> guides that recommends swappines to 0 but that disables swap on newer
>> kernels. The proper swappines for only swapping when nesseary is 1 or a
>> sufficiently low number like 10 default is 60)
>>
>>
>> Moving to nfs will not improve things. You will get more memory since
>> gluster isn't running and that is good. But you will have a single node
>> that can fail with all your storage and it would still be on 1 gigabit only
>> and your three node cluster would easily saturate that link.
>>
>> On July 7, 2018 04:13:13 Jim Kusznir <jim(a)palousetech.com> wrote:
>>
>>> So far it does not appear to be helping much. I'm still getting VM's
>>> locking up and all kinds of notices from overt engine about non-responsive
>>> hosts. I'm still seeing load averages in the 20-30 range.
>>>
>>> Jim
>>>
>>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <jim(a)palousetech.com> wrote:
>>>
>>>> Thank you for the advice and help
>>>>
>>>> I do plan on going 10Gbps networking; haven't quite jumped off that
>>>> cliff yet, though.
>>>>
>>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
>>>> network, and I've watched throughput on that and never seen more
than
>>>> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps
network
>>>> for communication and ovirt migration, but I wanted to break that up
>>>> further (separate out VM traffice from migration/mgmt traffic). My
three
>>>> SSD-backed gluster volumes run the main network too, as I haven't
been able
>>>> to get them to move to the new network (which I was trying to use as all
>>>> gluster). I tried bonding, but that seamed to reduce performance rather
>>>> than improve it.
>>>>
>>>> --Jim
>>>>
>>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <
>>>> jlawrence(a)squaretrade.com> wrote:
>>>>
>>>>> Hi Jim,
>>>>>
>>>>> I don't have any targeted suggestions, because there isn't
much to
>>>>> latch on to. I can say Gluster replica three (no arbiters) on
dedicated
>>>>> servers serving a couple Ovirt VM clusters here have not had these
sorts of
>>>>> issues.
>>>>>
>>>>> I suspect your long heal times (and the resultant long periods of
>>>>> high load) are at least partly related to 1G networking. That is just
a
>>>>> matter of IO - heals of VMs involve moving a lot of bits. My cluster
uses
>>>>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic
and
>>>>> separate bonded 1G for ovirtmgmt and communication with other
>>>>> machines/people, and we're occasionally hitting the bandwidth
ceiling on
>>>>> the storage network. I'm starting to think about 40/100G,
different ways of
>>>>> splitting up intensive systems, and considering iSCSI for specific
volumes,
>>>>> although I really don't want to go there.
>>>>>
>>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers
for
>>>>> their excellent ZFS implementation, mostly for backups. ZFS will make
your
>>>>> `heal` problem go away, but not your bandwidth problems, which become
worse
>>>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly
in the
>>>>> impulse-buy territory, but if you can, I'd recommend doing some
testing
>>>>> using it. I think at least some of your problems are related.
>>>>>
>>>>> If that's not possible, my next stops would be optimizing
everything
>>>>> I could about sharding, healing and optimizing for serving the shard
size
>>>>> to squeeze as much performance out of 1G as I could, but that will
only go
>>>>> so far.
>>>>>
>>>>> -j
>>>>>
>>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>>>>>
>>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>> wrote:
>>>>> >
>>>>> > hi all:
>>>>> >
>>>>> > Once again my production ovirt cluster is collapsing in on
itself.
>>>>> My servers are intermittently unavailable or degrading, customers
are
>>>>> noticing and calling in. This seems to be yet another gluster
failure that
>>>>> I haven't been able to pin down.
>>>>> >
>>>>> > I posted about this a while ago, but didn't get anywhere
(no
>>>>> replies that I found). The problem started out as a glusterfsd
process
>>>>> consuming large amounts of ram (up to the point where ram and swap
were
>>>>> exhausted and the kernel OOM killer killed off the glusterfsd
process).
>>>>> For reasons not clear to me at this time, that resulted in any VMs
running
>>>>> on that host and that gluster volume to be paused with I/O error
(the
>>>>> glusterfs process is usually unharmed; why it didn't continue I/O
with
>>>>> other servers is confusing to me).
>>>>> >
>>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso,
>>>>> data, and data-hdd). The first 3 are replica 2+arb; the 4th
(data-hdd) is
>>>>> replica 3. The first 3 are backed by an LVM partition (some thin
>>>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd +
some
>>>>> internal flash for acceleration). data-hdd is the only thing on the
disk.
>>>>> Servers are Dell R610 with the PERC/6i raid card, with the disks
>>>>> individually passed through to the OS (no raid enabled).
>>>>> >
>>>>> > The above RAM usage issue came from the data-hdd volume.
>>>>> Yesterday, I cought one of the glusterfsd high ram usage before the
>>>>> OOM-Killer had to run. I was able to migrate the VMs off the machine
and
>>>>> for good measure, reboot the entire machine (after taking this
opportunity
>>>>> to run the software updates that ovirt said were pending). Upon
booting
>>>>> back up, the necessary volume healing began. However, this time,
the
>>>>> healing caused all three servers to go to very, very high load
averages (I
>>>>> saw just under 200 on one server; typically they've been 40-70)
with top
>>>>> reporting IO Wait at 7-20%. Network for this volume is a dedicated
gig
>>>>> network. According to bwm-ng, initially the network bandwidth would
hit
>>>>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a
while. All
>>>>> machines' load averages were still 40+ and gluster volume heal
data-hdd
>>>>> info reported 5 items needing healing. Server's were
intermittently
>>>>> experiencing IO issues, even on the 3 gluster volumes that appeared
largely
>>>>> unaffected. Even the OS activities on the hosts itself (logging in,
>>>>> running commands) would often be very delayed. The ovirt engine was
>>>>> seemingly randomly throwing engine down / engine up / engine failed
>>>>> notifications. Responsiveness on ANY VM was horrific most of the
time,
>>>>> with random VMs being inaccessible.
>>>>> >
>>>>> > I let the gluster heal run overnight. By morning, there were
still
>>>>> 5 items needing healing, all three servers were still experiencing
high
>>>>> load, and servers were still largely unstable.
>>>>> >
>>>>> > I've noticed that all of my ovirt outages (and I've had
a lot, way
>>>>> more than is acceptable for a production cluster) have come from
gluster.
>>>>> I still have 3 VMs who's hard disk images have become corrupted
by my last
>>>>> gluster crash that I haven't had time to repair / rebuild yet (I
believe
>>>>> this crash was caused by the OOM issue previously mentioned, but I
didn't
>>>>> know it at the time).
>>>>> >
>>>>> > Is gluster really ready for production yet? It seems so
unstable
>>>>> to me.... I'm looking at replacing gluster with a dedicated NFS
server
>>>>> likely FreeNAS. Any suggestions? What is the "right" way
to do production
>>>>> storage on this (3 node cluster)? Can I get this gluster volume
stable
>>>>> enough to get my VMs to run reliably again until I can deploy
another
>>>>> storage solution?
>>>>> >
>>>>> > --Jim
>>>>> > _______________________________________________
>>>>> > Users mailing list -- users(a)ovirt.org
>>>>> > To unsubscribe send an email to users-leave(a)ovirt.org
>>>>> > Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>> > oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>> y/about/community-guidelines/
>>>>> > List Archives:
https://lists.ovirt.org/archiv
>>>>> es/list/users(a)ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
>>>>>
>>>>>
>>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>> y/about/community-guidelines/
>>> List Archives:
https://lists.ovirt.org/archiv
>>> es/list/users(a)ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
>>>
>>
>>
>