Latest version is:
vdsm-cli-4.17.32-1.el7.noarch.rpm 08-Aug-2016 17:36
On Fri, Oct 14, 2016 at 1:11 AM, Francesco Romani <fromani(a)redhat.com>
wrote:
----- Original Message -----
> From: "Simone Tiraboschi" <stirabos(a)redhat.com>
> To: "Steve Dainard" <sdainard(a)spd1.com>, "Francesco
Romani" <
fromani(a)redhat.com>
> Cc: "users" <users(a)ovirt.org>
> Sent: Friday, October 14, 2016 9:59:49 AM
> Subject: Re: [ovirt-users] Ovirt Hypervisor vdsm.Scheduler logs fill
partition
>
> On Fri, Oct 14, 2016 at 1:12 AM, Steve Dainard <sdainard(a)spd1.com>
wrote:
>
> > Hello,
> >
> > I had a hypervisor semi-crash this week, 4 of ~10 VM's continued to
run,
> > but the others were killed off somehow and all VM's running on this
host
> > had '?' status in the ovirt UI.
> >
> > This appears to have been caused by vdsm logs filling up disk space on
the
> > logging partition.
> >
> > I've attached the log file vdsm.log.27.xz which shows this error:
> >
> > vdsm.Scheduler::DEBUG::2016-10-11
> > 16:42:09,318::executor::216::Executor::(_discard)
> > Worker discarded: <Worker name=periodic/3017 running <Operation
> > action=<VmDispatcher operation=<class
> > 'virt.periodic.DriveWatermarkMonitor'>
> > at 0x7f8e90021210> at 0x7f8e90021250> discarded at 0x7f8dd123e850>
> >
> > which happens more and more frequently throughout the log.
> >
> > It was a bit difficult to understand what caused the failure, but the
logs
> > were getting really large, then being xz'd which compressed 11G+ into
a few
> > MB. Once this happened the disk space would be freed, and nagios
wouldn't
> > hit the 3rd check to throw a warning, until pretty much right at the
crash.
> >
> > I was able to restart vdsmd to resolve the issue, but I still need to
know
> > why these logs started to stack up so I can avoid this issue in the
future.
> >
>
> We had this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1383259
> but in your case the logs are rotating.
> Francesco?
Hi,
yes, it is a different issue. Here the log messages are caused by the
Worker threads
of the periodic subsystem, which are leaking[1].
This was a bug in Vdsm (insufficient protection against rogue domains),
but the
real problem is that some of your domain are being unresponsive at
hypervisor level.
The most likely cause is in turn unresponsive storages.
Fixes are been committed and shipped with Vdsm 4.17.34.
See:
ttps://bugzilla.redhat.com/1364925
HTH,
+++
[1] actually, they are replaced too quickly, leading to unbound growth.
So those aren't actually "leaking", Vdsm is just overzealous handling one
error condition,
making things worse than before.
Still serious issue, no doubt, but quite different cause.
--
Francesco Romani
Red Hat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani