<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 10/02/2017 11:00 AM, Yaniv Kaul
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJgorsb2ctuEaTpNkzvixsDSjF-_ABH6JDMgw5X03WUgZgbo2A@mail.gmail.com">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Oct 2, 2017 at 5:57 PM, Jason
Keltz <span dir="ltr"><<a href="mailto:jas@cse.yorku.ca"
target="_blank" moz-do-not-send="true">jas@cse.yorku.ca</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span class=""> <br>
<div class="m_3456688468548054330moz-cite-prefix">On
10/02/2017 10:51 AM, Yaniv Kaul wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Oct 2, 2017 at
5:14 PM, Jason Keltz <span dir="ltr"><<a
href="mailto:jas@cse.yorku.ca"
target="_blank" moz-do-not-send="true">jas@cse.yorku.ca</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span>
<br>
<div
class="m_3456688468548054330m_-6564063642909371047moz-cite-prefix">On
10/02/2017 01:22 AM, Yaniv Kaul wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon,
Oct 2, 2017 at 5:11 AM, Jason
Keltz <span dir="ltr"><<a
href="mailto:jas@cse.yorku.ca"
target="_blank"
moz-do-not-send="true">jas@cse.yorku.ca</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">Hi.<br>
<br>
For my data domain, I have one
NFS server with a large RAID
filesystem (9 TB).<br>
I'm only using 2 TB of that at
the moment. Today, my NFS
server hung with<br>
the following error:<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
xfs: possible memory
allocation deadlock in
kmem_alloc<br>
</blockquote>
</blockquote>
<div><br>
</div>
<div>Can you share more of the
log so we'll see what happened
before and after?</div>
<div>Y.</div>
</div>
</div>
</div>
</blockquote>
</span><span class="">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div text="#000000"
bgcolor="#FFFFFF"> <br>
Here is engine-log from
yesterday.. the problem
started around 14:29 PM.<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/engine-log.txt"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/10012017/engine-lo<wbr>g.txt</a><br>
<br>
Here is the vdsm log on one
of the virtualization hosts,
virt01:<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/vdsm.log.2"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/10012017/vdsm.log.<wbr>2</a><br>
<br>
Doing further investigation,
I found that the XFS error
messages didn't start
yesterday. You'll see they
started at the very end of
the day on September 23.
See:<br>
<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20170924"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/messages-20170924</a>
<br>
</div>
</blockquote>
<div><br>
</div>
<div>Our storage guys do NOT
think it's an XFS
fragmentation issue, but we'll
be looking at it.</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</span> Hmmm... almost sorry to hear that
because that would be easy to "fix"... <br>
<span class=""> <br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div text="#000000"
bgcolor="#FFFFFF"> <br>
They continued on the 24th,
then on the 26th... I think
there were a few "hangs" on
those times that people were
complaining about, but we
didn't catch the problem.
However, the errors hit big
time yesterday at 14:27
PM... see here:<br>
<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20171001"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/messages-20171001</a><br>
<br>
If you want any other logs,
I'm happy to provide them.
I just don't know exactly
what to provide.<br>
<br>
Do you know if I can run the
XFS defrag command live?
Rather than on a disk by
disk, I'd rather just do it
on the whole filesystem.
There really aren't that
many files since it's just
ovirt disk images. However,
I don't understand the
implications to running
VMs. I wouldn't want to do
anything to create more
downtime.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Should be enough to copy
the disks to make them less
fragmented.</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</span> Yes, but this requires downtime..
but there's plenty of additional storage,
so this would fix things well.</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</span></div>
</blockquote>
<div><br>
</div>
<div>Live storage migration could be used.</div>
<div>Y.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span class=""><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
I had upgraded the engine server + 4
virtualization hosts from 4.1.1 to current
on September 20 along with upgrading them
from CentOS 7.3 to CentOS 7.4. virtfs,
the NFS file server, was running CentOS
7.3 and kernel
vmlinuz-3.10.0-514.16.1.el7.x8<wbr>6_64.
Only yesterday, did I upgrade it to CentOS
7.4 and hence kernel
vmlinuz-3.10.0-693.2.2.el7.x86<wbr>_64.<br>
<br>
I believe the problem is fully XFS
related, and not ovirt at all. Although,
I must admit, ovirt didn't help either.
When I rebooted the file server, the iso
and export domains were immediately
active, but the data domain took quite a
long time. I kept trying to activate it,
and it couldn't do it. I couldn't make a
host an SPM. I found that the data domain
directory on the virtualization host was a
"stale NFS file handle". I rebooted one
of the virtualization hosts (virt1), and
tried to make it the SPM. Again, it
wouldn't work. Finally, I ended up
turning everything into maintenance mode,
then activating just it, and I was able to
make it the SPM. I was then able to bring
everything up. I would have expected
ovirt to handle the problem a little more
gracefully, and give me more information
because I was sweating thinking I had to
restore all the VMs!<br>
</div>
</blockquote>
<div><br>
</div>
<div>Stale NFS is on our todo list to handle.
Quite challenging.</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</span> Thanks..<span class=""><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
I didn't think when I chose XFS as the
filesystem for my virtualization NFS
server that I would have to defragment the
filesystem manually. This is like the old
days of running Norton SpeedDisk to defrag
my 386...<br>
</div>
</blockquote>
<div><br>
</div>
<div>We are still not convinced it's an issue
- but we'll look into it (and perhaps ask
for more stats and data).</div>
</div>
</div>
</div>
</blockquote>
</span> Thanks!
<div>
<div class="h5"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Y.</div>
<div> </div>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
Thanks for any help you can provide...<span
class="m_3456688468548054330HOEnZb"><font
color="#888888"><br>
<br>
Jason.</font></span>
<div>
<div class="m_3456688468548054330h5"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
</blockquote>
<br>
All 4 virtualization hosts
of course had problems
since there was no<br>
longer any storage.<br>
<br>
In the end, it seems like
the problem is related to
XFS fragmentation...<br>
<br>
I read this great blog
here:<br>
<br>
<a
href="https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">https://blog.codecentric.de/en<wbr>/2017/04/xfs-possible-memory-a<wbr>llocation-deadlock-kmem_alloc/</a><br>
<br>
In short, I tried this:<br>
<br>
# xfs_db -r -c "frag -f"
/dev/sdb1<br>
actual 4314253, ideal
43107, fragmentation
factor 99.00%<br>
<br>
Apparently the
fragmentation factor
doesn't mean much, but the
fact that<br>
"actual" number of extents
is considerably higher
than "ideal" extents seems
that it<br>
may be the problem.<br>
<br>
I saw that many of my
virtual disks that are
written to a lot have, of
course,<br>
a lot of extents...<br>
<br>
For example, on our main
web server disk image,
there were 247,597<br>
extents alone! I took the
web server down, and ran
the XFS defrag<br>
command on the disk...<br>
<br>
# xfs_fsr -v
9a634692-1302-471f-a92e-c978b2<wbr>b67fd0<br>
9a634692-1302-471f-a92e-c978b2<wbr>b67fd0<br>
extents before:247597
after:429 DONE
9a634692-1302-471f-a92e-c978b2<wbr>b67fd0<br>
<br>
247,597 before and 429
after! WOW!<br>
<br>
Are virtual disks a
problem with XFS? Why
isn't this memory
allocation<br>
deadlock issue more
prevalent. I do see this
article mentioned on many<br>
web posts. I don't
specifically see any
recommendation to *not*
use<br>
XFS for the data domain
though.<br>
<br>
I was running CentOS 7.3
on the file server, but
before rebooting the
server,<br>
I upgraded to the latest
kernel and CentOS 7.4 in
the hopes that if there<br>
was a kernel issue, that
this would solve it.<br>
<br>
I took a few virtual
systems down, and ran the
defrag on the disks.
However,<br>
with over 30 virtual
systems, I don't really
want to do this
individually.<br>
I was wondering if I could
run xfs_fsr on all the
disks LIVE? It says in
the<br>
manual that you can run it
live, but I can't see how
this would be good when<br>
a system is using that
disk, and I don't want to
deal with major<br>
corruption across the
board. Any thoughts?<br>
<br>
Thanks,<br>
<br>
Jason.<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list<br>
<a
href="mailto:Users@ovirt.org"
target="_blank"
moz-do-not-send="true">Users@ovirt.org</a><br>
<a
href="http://lists.ovirt.org/mailman/listinfo/users"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</body>
</html>