None

2 Oct 2017

      Content analysis details:   (-1.0 points, 5.0 required)

  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SHORTCIRCUIT           Not all rules were run, due to a shortcircuited rule
 -1.0 ALL_TRUSTED            Passed through trusted hosts only via SMTP
Subject: [ovirt-users] xfs fragmentation problem caused data domain to hang
X-BeenThere: users@ovirt.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Main users mailing list for oVirt <users.ovirt.org>
List-Unsubscribe: <http://lists.ovirt.org/mailman/options/users>,
	<mailto:users-request@ovirt.org?subject=unsubscribe>
List-Archive: <http://lists.ovirt.org/pipermail/users/>
List-Post: <mailto:users@ovirt.org>
List-Help: <mailto:users-request@ovirt.org?subject=help>
List-Subscribe: <http://lists.ovirt.org/mailman/listinfo/users>,
	<mailto:users-request@ovirt.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Oct 2017 02:42:38 -0000

Hi.

For my data domain, I have one NFS server with a large RAID filesystem (9 TB).
I'm only using 2 TB of that at the moment. Today, my NFS server  hung with
the following error:
...
xfs: possible memory allocation deadlock in kmem_alloc
All 4 virtualization hosts of course had problems since there was no
longer any storage.

In the end, it seems like the problem is related to XFS fragmentation...

I read this great blog here:

https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlo...

In short, I tried this:

# xfs_db -r -c "frag -f" /dev/sdb1
actual 4314253, ideal 43107, fragmentation factor 99.00%

Apparently the fragmentation factor doesn't mean much, but the fact that
"actual" number of extents is considerably higher than "ideal" extents seems that it
may be the problem.

I saw that many of my virtual disks that are written to a lot have, of course,
a lot of extents...

For example, on our main web server disk image, there were 247,597
extents alone!  I took the web server down, and ran the XFS defrag
command on the disk...

# xfs_fsr -v 9a634692-1302-471f-a92e-c978b2b67fd0
9a634692-1302-471f-a92e-c978b2b67fd0
extents before:247597 after:429 DONE 9a634692-1302-471f-a92e-c978b2b67fd0

247,597 before and 429 after!  WOW!

Are virtual disks a problem with XFS?  Why isn't this memory allocation
deadlock issue more prevalent.  I do see this article mentioned on many
web posts.  I don't specifically see any recommendation to *not* use
XFS for the data domain though.

I was running CentOS 7.3 on the file server, but before rebooting the server,
I upgraded to the latest kernel and CentOS 7.4 in the hopes that if there
was a kernel issue, that this would solve it.

I took a few virtual systems down, and ran the defrag on the disks.  However,
with over 30 virtual systems, I don't really want to do this individually.
I was wondering if I could run xfs_fsr on all the disks LIVE?  It says in the
manual that you can run it live, but I can't see how this would be good when
a system is using that disk, and I don't want to deal with major
corruption across the board. Any thoughts?

Thanks,

Jason.

None

Jason Keltz