On Fri, Oct 14, 2016 at 1:11 AM, Francesco Romani <fromani@redhat.com> wrote:

----- Original Message -----
> From: "Simone Tiraboschi" <stirabos@redhat.com>
> To: "Steve Dainard" <sdainard@spd1.com>, "Francesco Romani" <fromani@redhat.com>
> Cc: "users" <users@ovirt.org>
> Sent: Friday, October 14, 2016 9:59:49 AM
> Subject: Re: [ovirt-users] Ovirt Hypervisor vdsm.Scheduler logs fill partition
>
> On Fri, Oct 14, 2016 at 1:12 AM, Steve Dainard <sdainard@spd1.com> wrote:
>
> > Hello,
> >
> > I had a hypervisor semi-crash this week, 4 of ~10 VM's continued to run,
> > but the others were killed off somehow and all VM's running on this host
> > had '?' status in the ovirt UI.
> >
> > This appears to have been caused by vdsm logs filling up disk space on the
> > logging partition.
> >
> > I've attached the log file vdsm.log.27.xz which shows this error:
> >
> > vdsm.Scheduler::DEBUG::2016-10-11
> > 16:42:09,318::executor::216::Executor::(_discard)
> > Worker discarded: <Worker name=periodic/3017 running <Operation
> > action=<VmDispatcher operation=<class
> > 'virt.periodic.DriveWatermarkMonitor'>
> > at 0x7f8e90021210> at 0x7f8e90021250> discarded at 0x7f8dd123e850>
> >
> > which happens more and more frequently throughout the log.
> >
> > It was a bit difficult to understand what caused the failure, but the logs
> > were getting really large, then being xz'd which compressed 11G+ into a few
> > MB. Once this happened the disk space would be freed, and nagios wouldn't
> > hit the 3rd check to throw a warning, until pretty much right at the crash.
> >
> > I was able to restart vdsmd to resolve the issue, but I still need to know
> > why these logs started to stack up so I can avoid this issue in the future.
> >
>
> We had this one: https://bugzilla.redhat.com/show_bug.cgi?id=1383259
> but in your case the logs are rotating.
> Francesco?

Hi,

yes, it is a different issue. Here the log messages are caused by the Worker threads
of the periodic subsystem, which are leaking[1].
This was a bug in Vdsm (insufficient protection against rogue domains), but the
real problem is that some of your domain are being unresponsive at hypervisor level.
The most likely cause is in turn unresponsive storages.

Fixes are been committed and shipped with Vdsm 4.17.34.

See: ttps://bugzilla.redhat.com/1364925

HTH,

+++

[1] actually, they are replaced too quickly, leading to unbound growth.
So those aren't actually "leaking", Vdsm is just overzealous handling one error condition,
making things worse than before.
Still serious issue, no doubt, but quite different cause.

--
Francesco Romani
Red Hat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani