Re: [ovirt-users] xfs fragmentation problem caused data domain to hang
by Jason Keltz
This is a multi-part message in MIME format.
--------------8888294F2C8EF5B404842160
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
On 10/02/2017 11:00 AM, Yaniv Kaul wrote:
>
>
> On Mon, Oct 2, 2017 at 5:57 PM, Jason Keltz <jas(a)cse.yorku.ca
> <mailto:jas@cse.yorku.ca>> wrote:
>
>
> On 10/02/2017 10:51 AM, Yaniv Kaul wrote:
>>
>>
>> On Mon, Oct 2, 2017 at 5:14 PM, Jason Keltz <jas(a)cse.yorku.ca
>> <mailto:jas@cse.yorku.ca>> wrote:
>>
>>
>> On 10/02/2017 01:22 AM, Yaniv Kaul wrote:
>>>
>>>
>>> On Mon, Oct 2, 2017 at 5:11 AM, Jason Keltz
>>> <jas(a)cse.yorku.ca <mailto:jas@cse.yorku.ca>> wrote:
>>>
>>> Hi.
>>>
>>> For my data domain, I have one NFS server with a large
>>> RAID filesystem (9 TB).
>>> I'm only using 2 TB of that at the moment. Today, my NFS
>>> server hung with
>>> the following error:
>>>
>>> xfs: possible memory allocation deadlock in kmem_alloc
>>>
>>>
>>> Can you share more of the log so we'll see what happened
>>> before and after?
>>> Y.
>>>
>>>
>>> Here is engine-log from yesterday.. the problem started
>>> around 14:29 PM.
>>> http://www.eecs.yorku.ca/~jas/ovirt-debug/10012017/engine-log.txt
>>> <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/engine-log.txt>
>>>
>>> Here is the vdsm log on one of the virtualization hosts,
>>> virt01:
>>> http://www.eecs.yorku.ca/~jas/ovirt-debug/10012017/vdsm.log.2
>>> <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/vdsm.log.2>
>>>
>>> Doing further investigation, I found that the XFS error
>>> messages didn't start yesterday. You'll see they
>>> started at the very end of the day on September 23. See:
>>>
>>> http://www.eecs.yorku.ca/~jas/ovirt-debug/messages-20170924
>>> <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20170924>
>>>
>>>
>>>
>>> Our storage guys do NOT think it's an XFS fragmentation
>>> issue, but we'll be looking at it.
>> Hmmm... almost sorry to hear that because that would be easy
>> to "fix"...
>>
>>>
>>> They continued on the 24th, then on the 26th... I think
>>> there were a few "hangs" on those times that people were
>>> complaining about, but we didn't catch the problem.
>>> However, the errors hit big time yesterday at 14:27
>>> PM... see here:
>>>
>>> http://www.eecs.yorku.ca/~jas/ovirt-debug/messages-20171001
>>> <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20171001>
>>>
>>> If you want any other logs, I'm happy to provide them. I
>>> just don't know exactly what to provide.
>>>
>>> Do you know if I can run the XFS defrag command live?
>>> Rather than on a disk by disk, I'd rather just do it on
>>> the whole filesystem. There really aren't that many
>>> files since it's just ovirt disk images. However, I
>>> don't understand the implications to running VMs. I
>>> wouldn't want to do anything to create more downtime.
>>>
>>>
>>> Should be enough to copy the disks to make them less fragmented.
>> Yes, but this requires downtime.. but there's plenty of
>> additional storage, so this would fix things well.
>>
>
> Live storage migration could be used.
> Y.
>
>
>
>>
>> I had upgraded the engine server + 4 virtualization hosts
>> from 4.1.1 to current on September 20 along with upgrading
>> them from CentOS 7.3 to CentOS 7.4. virtfs, the NFS file
>> server, was running CentOS 7.3 and kernel
>> vmlinuz-3.10.0-514.16.1.el7.x86_64. Only yesterday, did I
>> upgrade it to CentOS 7.4 and hence kernel
>> vmlinuz-3.10.0-693.2.2.el7.x86_64.
>>
>> I believe the problem is fully XFS related, and not ovirt at
>> all. Although, I must admit, ovirt didn't help either. When
>> I rebooted the file server, the iso and export domains were
>> immediately active, but the data domain took quite a long
>> time. I kept trying to activate it, and it couldn't do it.
>> I couldn't make a host an SPM. I found that the data domain
>> directory on the virtualization host was a "stale NFS file
>> handle". I rebooted one of the virtualization hosts (virt1),
>> and tried to make it the SPM. Again, it wouldn't work.
>> Finally, I ended up turning everything into maintenance mode,
>> then activating just it, and I was able to make it the SPM.
>> I was then able to bring everything up. I would have
>> expected ovirt to handle the problem a little more
>> gracefully, and give me more information because I was
>> sweating thinking I had to restore all the VMs!
>>
>>
>> Stale NFS is on our todo list to handle. Quite challenging.
> Thanks..
>
>>
>> I didn't think when I chose XFS as the filesystem for my
>> virtualization NFS server that I would have to defragment the
>> filesystem manually. This is like the old days of running
>> Norton SpeedDisk to defrag my 386...
>>
>>
>> We are still not convinced it's an issue - but we'll look into it
>> (and perhaps ask for more stats and data).
> Thanks!
>
>
>> Y.
>>
>>
>> Thanks for any help you can provide...
>>
>> Jason.
>>
>>
>>>
>>> All 4 virtualization hosts of course had problems since
>>> there was no
>>> longer any storage.
>>>
>>> In the end, it seems like the problem is related to XFS
>>> fragmentation...
>>>
>>> I read this great blog here:
>>>
>>> https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-dea...
>>> <https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-dea...>
>>>
>>> In short, I tried this:
>>>
>>> # xfs_db -r -c "frag -f" /dev/sdb1
>>> actual 4314253, ideal 43107, fragmentation factor 99.00%
>>>
>>> Apparently the fragmentation factor doesn't mean much,
>>> but the fact that
>>> "actual" number of extents is considerably higher than
>>> "ideal" extents seems that it
>>> may be the problem.
>>>
>>> I saw that many of my virtual disks that are written to
>>> a lot have, of course,
>>> a lot of extents...
>>>
>>> For example, on our main web server disk image, there
>>> were 247,597
>>> extents alone! I took the web server down, and ran the
>>> XFS defrag
>>> command on the disk...
>>>
>>> # xfs_fsr -v 9a634692-1302-471f-a92e-c978b2b67fd0
>>> 9a634692-1302-471f-a92e-c978b2b67fd0
>>> extents before:247597 after:429 DONE
>>> 9a634692-1302-471f-a92e-c978b2b67fd0
>>>
>>> 247,597 before and 429 after! WOW!
>>>
>>> Are virtual disks a problem with XFS? Why isn't this
>>> memory allocation
>>> deadlock issue more prevalent. I do see this article
>>> mentioned on many
>>> web posts. I don't specifically see any recommendation
>>> to *not* use
>>> XFS for the data domain though.
>>>
>>> I was running CentOS 7.3 on the file server, but before
>>> rebooting the server,
>>> I upgraded to the latest kernel and CentOS 7.4 in the
>>> hopes that if there
>>> was a kernel issue, that this would solve it.
>>>
>>> I took a few virtual systems down, and ran the defrag on
>>> the disks. However,
>>> with over 30 virtual systems, I don't really want to do
>>> this individually.
>>> I was wondering if I could run xfs_fsr on all the disks
>>> LIVE? It says in the
>>> manual that you can run it live, but I can't see how
>>> this would be good when
>>> a system is using that disk, and I don't want to deal
>>> with major
>>> corruption across the board. Any thoughts?
>>>
>>> Thanks,
>>>
>>> Jason.
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users
>>> <http://lists.ovirt.org/mailman/listinfo/users>
>>>
>>>
>>
>>
>
>
--------------8888294F2C8EF5B404842160
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 10/02/2017 11:00 AM, Yaniv Kaul
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJgorsb2ctuEaTpNkzvixsDSjF-_ABH6JDMgw5X03WUgZgbo2A@mail.gmail.com">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Oct 2, 2017 at 5:57 PM, Jason
Keltz <span dir="ltr"><<a href="mailto:jas@cse.yorku.ca"
target="_blank" moz-do-not-send="true">jas(a)cse.yorku.ca</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span class=""> <br>
<div class="m_3456688468548054330moz-cite-prefix">On
10/02/2017 10:51 AM, Yaniv Kaul wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Oct 2, 2017 at
5:14 PM, Jason Keltz <span dir="ltr"><<a
href="mailto:jas@cse.yorku.ca"
target="_blank" moz-do-not-send="true">jas(a)cse.yorku.ca</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span>
<br>
<div
class="m_3456688468548054330m_-6564063642909371047moz-cite-prefix">On
10/02/2017 01:22 AM, Yaniv Kaul wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon,
Oct 2, 2017 at 5:11 AM, Jason
Keltz <span dir="ltr"><<a
href="mailto:jas@cse.yorku.ca"
target="_blank"
moz-do-not-send="true">jas(a)cse.yorku.ca</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">Hi.<br>
<br>
For my data domain, I have one
NFS server with a large RAID
filesystem (9 TB).<br>
I'm only using 2 TB of that at
the moment. Today, my NFS
server hung with<br>
the following error:<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
xfs: possible memory
allocation deadlock in
kmem_alloc<br>
</blockquote>
</blockquote>
<div><br>
</div>
<div>Can you share more of the
log so we'll see what happened
before and after?</div>
<div>Y.</div>
</div>
</div>
</div>
</blockquote>
</span><span class="">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div text="#000000"
bgcolor="#FFFFFF"> <br>
Here is engine-log from
yesterday.. the problem
started around 14:29 PM.<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/engine-log.txt"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/10012017/engine-lo<wbr>g.txt</a><br>
<br>
Here is the vdsm log on one
of the virtualization hosts,
virt01:<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/vdsm.log.2"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/10012017/vdsm.log.<wbr>2</a><br>
<br>
Doing further investigation,
I found that the XFS error
messages didn't start
yesterday. You'll see they
started at the very end of
the day on September 23.
See:<br>
<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20170924"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/messages-20170924</a>
<br>
</div>
</blockquote>
<div><br>
</div>
<div>Our storage guys do NOT
think it's an XFS
fragmentation issue, but we'll
be looking at it.</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</span> Hmmm... almost sorry to hear that
because that would be easy to "fix"... <br>
<span class=""> <br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div text="#000000"
bgcolor="#FFFFFF"> <br>
They continued on the 24th,
then on the 26th... I think
there were a few "hangs" on
those times that people were
complaining about, but we
didn't catch the problem.
However, the errors hit big
time yesterday at 14:27
PM... see here:<br>
<br>
<a
class="m_3456688468548054330m_-6564063642909371047moz-txt-link-freetext"
href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20171001"
target="_blank"
moz-do-not-send="true">http://www.eecs.yorku.ca/~jas/<wbr>ovirt-debug/messages-20171001</a><br>
<br>
If you want any other logs,
I'm happy to provide them.
I just don't know exactly
what to provide.<br>
<br>
Do you know if I can run the
XFS defrag command live?
Rather than on a disk by
disk, I'd rather just do it
on the whole filesystem.
There really aren't that
many files since it's just
ovirt disk images. However,
I don't understand the
implications to running
VMs. I wouldn't want to do
anything to create more
downtime.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Should be enough to copy
the disks to make them less
fragmented.</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</span> Yes, but this requires downtime..
but there's plenty of additional storage,
so this would fix things well.</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</span></div>
</blockquote>
<div><br>
</div>
<div>Live storage migration could be used.</div>
<div>Y.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span class=""><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
I had upgraded the engine server + 4
virtualization hosts from 4.1.1 to current
on September 20 along with upgrading them
from CentOS 7.3 to CentOS 7.4. virtfs,
the NFS file server, was running CentOS
7.3 and kernel
vmlinuz-3.10.0-514.16.1.el7.x8<wbr>6_64.
Only yesterday, did I upgrade it to CentOS
7.4 and hence kernel
vmlinuz-3.10.0-693.2.2.el7.x86<wbr>_64.<br>
<br>
I believe the problem is fully XFS
related, and not ovirt at all. Although,
I must admit, ovirt didn't help either.
When I rebooted the file server, the iso
and export domains were immediately
active, but the data domain took quite a
long time. I kept trying to activate it,
and it couldn't do it. I couldn't make a
host an SPM. I found that the data domain
directory on the virtualization host was a
"stale NFS file handle". I rebooted one
of the virtualization hosts (virt1), and
tried to make it the SPM. Again, it
wouldn't work. Finally, I ended up
turning everything into maintenance mode,
then activating just it, and I was able to
make it the SPM. I was then able to bring
everything up. I would have expected
ovirt to handle the problem a little more
gracefully, and give me more information
because I was sweating thinking I had to
restore all the VMs!<br>
</div>
</blockquote>
<div><br>
</div>
<div>Stale NFS is on our todo list to handle.
Quite challenging.</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</span> Thanks..<span class=""><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
I didn't think when I chose XFS as the
filesystem for my virtualization NFS
server that I would have to defragment the
filesystem manually. This is like the old
days of running Norton SpeedDisk to defrag
my 386...<br>
</div>
</blockquote>
<div><br>
</div>
<div>We are still not convinced it's an issue
- but we'll look into it (and perhaps ask
for more stats and data).</div>
</div>
</div>
</div>
</blockquote>
</span> Thanks!
<div>
<div class="h5"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Y.</div>
<div> </div>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
Thanks for any help you can provide...<span
class="m_3456688468548054330HOEnZb"><font
color="#888888"><br>
<br>
Jason.</font></span>
<div>
<div class="m_3456688468548054330h5"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
</blockquote>
<br>
All 4 virtualization hosts
of course had problems
since there was no<br>
longer any storage.<br>
<br>
In the end, it seems like
the problem is related to
XFS fragmentation...<br>
<br>
I read this great blog
here:<br>
<br>
<a
href="https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-dea..."
rel="noreferrer"
target="_blank"
moz-do-not-send="true">https://blog.codecentric.de/en<wbr>/2017/04/xfs-possible-memory-a<wbr>llocation-deadlock-kmem_alloc/</a><br>
<br>
In short, I tried this:<br>
<br>
# xfs_db -r -c "frag -f"
/dev/sdb1<br>
actual 4314253, ideal
43107, fragmentation
factor 99.00%<br>
<br>
Apparently the
fragmentation factor
doesn't mean much, but the
fact that<br>
"actual" number of extents
is considerably higher
than "ideal" extents seems
that it<br>
may be the problem.<br>
<br>
I saw that many of my
virtual disks that are
written to a lot have, of
course,<br>
a lot of extents...<br>
<br>
For example, on our main
web server disk image,
there were 247,597<br>
extents alone! I took the
web server down, and ran
the XFS defrag<br>
command on the disk...<br>
<br>
# xfs_fsr -v
9a634692-1302-471f-a92e-c978b2<wbr>b67fd0<br>
9a634692-1302-471f-a92e-c978b2<wbr>b67fd0<br>
extents before:247597
after:429 DONE
9a634692-1302-471f-a92e-c978b2<wbr>b67fd0<br>
<br>
247,597 before and 429
after! WOW!<br>
<br>
Are virtual disks a
problem with XFS? Why
isn't this memory
allocation<br>
deadlock issue more
prevalent. I do see this
article mentioned on many<br>
web posts. I don't
specifically see any
recommendation to *not*
use<br>
XFS for the data domain
though.<br>
<br>
I was running CentOS 7.3
on the file server, but
before rebooting the
server,<br>
I upgraded to the latest
kernel and CentOS 7.4 in
the hopes that if there<br>
was a kernel issue, that
this would solve it.<br>
<br>
I took a few virtual
systems down, and ran the
defrag on the disks.
However,<br>
with over 30 virtual
systems, I don't really
want to do this
individually.<br>
I was wondering if I could
run xfs_fsr on all the
disks LIVE? It says in
the<br>
manual that you can run it
live, but I can't see how
this would be good when<br>
a system is using that
disk, and I don't want to
deal with major<br>
corruption across the
board. Any thoughts?<br>
<br>
Thanks,<br>
<br>
Jason.<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list<br>
<a
href="mailto:Users@ovirt.org"
target="_blank"
moz-do-not-send="true">Users(a)ovirt.org</a><br>
<a
href="http://lists.ovirt.org/mailman/listinfo/users"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</body>
</html>
--------------8888294F2C8EF5B404842160--
3 years, 5 months
libvirt: XML-RPC error : authentication failed: Failed to start SASL
by Ozan Uzun
Hello,
Today I updated my ovirt engine v3.5 and all my hosts on one datacenter
(centos 7.4 ones).
and suddenly my vdsm and vdsm-network services stopped working.
btw: My other DC is centos 6 based (managed from the same ovirt engine),
everything works just fine there.
vdsm fails dependent on vdsm-network service, with lots of RPC error.
I tried to configure vdsm-tool configure --force, deleted everything
(vdsm-libvirt), reinstalled.
Could not make it work.
My logs are filled with the follogin
Sep 18 23:06:01 node6 python[5340]: GSSAPI Error: Unspecified GSS failure.
Minor code may provide more information (No Kerberos credentials available
(default cache: KEYRING:persistent:0))
Sep 18 23:06:01 node6 vdsm-tool[5340]: libvirt: XML-RPC error :
authentication failed: Failed to start SASL negotiation: -1 (SASL(-1):
generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may
provide more information (No Kerberos credent
Sep 18 23:06:01 node6 libvirtd[4312]: 2017-09-18 20:06:01.954+0000: 4312:
error : virNetSocketReadWire:1808 : End of file while reading data:
Input/output error
-------
journalctl -xe output for vdsm-network
Sep 18 23:06:02 node6 vdsm-tool[5340]: libvirt: XML-RPC error :
authentication failed: Failed to start SASL negotiation: -1 (SASL(-1):
generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may
provide more information (No Kerberos credent
Sep 18 23:06:02 node6 vdsm-tool[5340]: Traceback (most recent call last):
Sep 18 23:06:02 node6 vdsm-tool[5340]: File "/usr/bin/vdsm-tool", line 219,
in main
Sep 18 23:06:02 node6 libvirtd[4312]: 2017-09-18 20:06:02.558+0000: 4312:
error : virNetSocketReadWire:1808 : End of file while reading data:
Input/output error
Sep 18 23:06:02 node6 vdsm-tool[5340]: return
tool_command[cmd]["command"](*args)
Sep 18 23:06:02 node6 vdsm-tool[5340]: File
"/usr/lib/python2.7/site-packages/vdsm/tool/upgrade_300_networks.py", line
83, in upgrade_networks
Sep 18 23:06:02 node6 vdsm-tool[5340]: networks = netinfo.networks()
Sep 18 23:06:02 node6 vdsm-tool[5340]: File
"/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 112, in networks
Sep 18 23:06:02 node6 vdsm-tool[5340]: conn = libvirtconnection.get()
Sep 18 23:06:02 node6 vdsm-tool[5340]: File
"/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 159, in
get
Sep 18 23:06:02 node6 vdsm-tool[5340]: conn = _open_qemu_connection()
Sep 18 23:06:02 node6 vdsm-tool[5340]: File
"/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 95, in
_open_qemu_connection
Sep 18 23:06:02 node6 vdsm-tool[5340]: return utils.retry(libvirtOpen,
timeout=10, sleep=0.2)
Sep 18 23:06:02 node6 vdsm-tool[5340]: File
"/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1108, in retry
Sep 18 23:06:02 node6 vdsm-tool[5340]: return func()
Sep 18 23:06:02 node6 vdsm-tool[5340]: File
"/usr/lib64/python2.7/site-packages/libvirt.py", line 105, in openAuth
Sep 18 23:06:02 node6 vdsm-tool[5340]: if ret is None:raise
libvirtError('virConnectOpenAuth() failed')
Sep 18 23:06:02 node6 vdsm-tool[5340]: libvirtError: authentication failed:
Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI
Error: Unspecified GSS failure. Minor code may provide more information
(No Kerberos credentials availa
Sep 18 23:06:02 node6 systemd[1]: vdsm-network.service: control process
exited, code=exited status=1
Sep 18 23:06:02 node6 systemd[1]: Failed to start Virtual Desktop Server
Manager network restoration.
-----
libvirt is running but throws some errors.
[root@node6 ~]# systemctl status libvirtd
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled;
vendor preset: enabled)
Drop-In: /etc/systemd/system/libvirtd.service.d
└─unlimited-core.conf
Active: active (running) since Mon 2017-09-18 23:15:47 +03; 19min ago
Docs: man:libvirtd(8)
http://libvirt.org
Main PID: 6125 (libvirtd)
CGroup: /system.slice/libvirtd.service
└─6125 /usr/sbin/libvirtd --listen
Sep 18 23:15:56 node6 libvirtd[6125]: 2017-09-18 20:15:56.195+0000: 6125:
error : virNetSocketReadWire:1808 : End of file while reading data:
Input/output error
Sep 18 23:15:56 node6 libvirtd[6125]: 2017-09-18 20:15:56.396+0000: 6125:
error : virNetSocketReadWire:1808 : End of file while reading data:
Input/output error
Sep 18 23:15:56 node6 libvirtd[6125]: 2017-09-18 20:15:56.597+0000: 6125:
error : virNetSocketReadWire:1808 : End of file while reading data:
Input/output error
----------------
[root@node6 ~]# virsh
Welcome to virsh, the virtualization interactive terminal.
Type: 'help' for help with commands
'quit' to quit
virsh # list
error: failed to connect to the hypervisor
error: authentication failed: Failed to start SASL negotiation: -1
(SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor
code may provide more information (No Kerberos credentials available
(default cache: KEYRING:persistent:0)))
=================
I do not want to lose all my virtual servers, is there any way to recover
them? Currenty everything is down. I am ok to install a new ovirt engine if
somehow I can restore my virtual servers. I can also split centos 6 and
centos 7 ovirt engine's.
3 years, 5 months
SPM recovery after disaster
by Alexander Vrublevskiy
----ALT--hRuoHjf2JJIXx0JHEsfbvIkF0yxhgbcB1506953449
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64
CkhlbGxvIENvbW11bml0eSEKUmVjZW50bHkgd2UgaGFkIGEgZGlzYXN0ZXIgd2l0aCBvdXIgb1Zp
cnQgNC4xIHRocmVlIG5vZGVzIGNsdXN0ZXIgCndpdGggSEUgYW5kIEdsdXN0ZXJGUyAoUkY9Mykg
c3RvcmFnZSBkb21haW4uIFdlJ3ZlIG1vdmVkIG9uZSBub2RlIHRvIAptYWludGVuYW5jZSBhbmQg
ZHVyaW5nIGFjdHVhbCBtYWludGVuYW5jZSBvbmUgb2Ygd29ya2luZyBub2RlcyB3aXRoIFNQTSAK
cm9sZSB3ZW50IGRvd24uIEl0IHdhcyBoYXJkd2FyZSBmYWlsdXJlIHNvIHdlIGhhZCB0byByZW1v
dmUgaXQgZnJvbSB0aGUgCmNsdXN0ZXIuCkFmdGVyIHRpbmtlcmluZyBhcm91bmQgbm93IHdlIGhh
dmUgYWxtb3N0IHdvcmtpbmcgY2x1c3RlciB3aXRoIHR3byAKbm9kZXMgYW5kIHdpdGggR2x1c3Rl
ckZTIFJGPTIuIEJ1dCB0aGUgcHJvYmxlbSBpcyBvVmlydCBjYW4ndCBmaW5kIFNQTSAKYW5kIHNw
YW1pbmcgd2ViIGludGVyZmFjZSBsb2dzIHdpdGggIkhTTUdldEFsbFRhc2tzU3RhdHVzZXNWRFMg
ZmFpbGVkOiAKTm90IFNQTSIgZXJyb3IuCkFmdGVyIHNvbWUgdGltZSBvZiBvcGVyYXRpbmcgd2l0
aCBzdGF0ZWQgY29uZmlndXJhdGlvbiB3ZSBsb3N0IGNvbnRlbnRzIG9mIGRvbV9tZCBzb21laG93
LgpMb29rcyBsaWtlIHRoZXNlIHR3byBwcm9ibGVtcyBhcmUgcmVsYXRlZCBhbmQgc2Vjb25kIG9u
ZSBpcyBhIGNvbnNlcXVlbmNlIG9mIHRoZSBmaXJzdC4KUGxlYXNlIHN1Z2dlc3QgaG93IHRvIHJl
Y292ZXIgU1BNIGFuZCBkb21fbWQuIElzIHRoZXJlIGEgd2F5IHRvIHJlY3JlYXRlIGJvdGg/ClRJ
QQpSZWdhcmRzCkFsZXg=
----ALT--hRuoHjf2JJIXx0JHEsfbvIkF0yxhgbcB1506953449
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: base64
CjxIVE1MPjxCT0RZPjxkaXYgY2xhc3M9InBvc3QtdGV4dCIgaXRlbXByb3A9InRleHQiPgoKPHA+
SGVsbG8gQ29tbXVuaXR5ITxiciBkYXRhLW1jZS1ib2d1cz0iMSI+PC9wPjxwPlJlY2VudGx5IHdl
IGhhZCBhIGRpc2FzdGVyIHdpdGggb3VyIG9WaXJ0IDQuMSB0aHJlZSBub2RlcyBjbHVzdGVyIAp3
aXRoIEhFIGFuZCBHbHVzdGVyRlMgKFJGPTMpIHN0b3JhZ2UgZG9tYWluLiBXZSd2ZSBtb3ZlZCBv
bmUgbm9kZSB0byAKbWFpbnRlbmFuY2UgYW5kIGR1cmluZyBhY3R1YWwgbWFpbnRlbmFuY2Ugb25l
IG9mIHdvcmtpbmcgbm9kZXMgd2l0aCBTUE0gCnJvbGUgd2VudCBkb3duLiBJdCB3YXMgaGFyZHdh
cmUgZmFpbHVyZSBzbyB3ZSBoYWQgdG8gcmVtb3ZlIGl0IGZyb20gdGhlIApjbHVzdGVyLjwvcD4K
CjxwPkFmdGVyIHRpbmtlcmluZyBhcm91bmQgbm93IHdlIGhhdmUgYWxtb3N0IHdvcmtpbmcgY2x1
c3RlciB3aXRoIHR3byAKbm9kZXMgYW5kIHdpdGggR2x1c3RlckZTIFJGPTIuIEJ1dCB0aGUgcHJv
YmxlbSBpcyBvVmlydCBjYW4ndCBmaW5kIFNQTSAKYW5kIHNwYW1pbmcgd2ViIGludGVyZmFjZSBs
b2dzIHdpdGggIkhTTUdldEFsbFRhc2tzU3RhdHVzZXNWRFMgZmFpbGVkOiAKTm90IFNQTSIgZXJy
b3IuPC9wPgoKPHA+QWZ0ZXIgc29tZSB0aW1lIG9mIG9wZXJhdGluZyB3aXRoIHN0YXRlZCBjb25m
aWd1cmF0aW9uIHdlIGxvc3QgY29udGVudHMgb2YgZG9tX21kIHNvbWVob3cuPC9wPgoKPHA+TG9v
a3MgbGlrZSB0aGVzZSB0d28gcHJvYmxlbXMgYXJlIHJlbGF0ZWQgYW5kIHNlY29uZCBvbmUgaXMg
YSBjb25zZXF1ZW5jZSBvZiB0aGUgZmlyc3QuPC9wPgoKPHA+UGxlYXNlIHN1Z2dlc3QgaG93IHRv
IHJlY292ZXIgU1BNIGFuZCBkb21fbWQuIElzIHRoZXJlIGEgd2F5IHRvIHJlY3JlYXRlIGJvdGg/
PC9wPgoKPHA+VElBPC9wPgoKPHA+UmVnYXJkczxicj5BbGV4PC9wPgogICAgPC9kaXY+PC9CT0RZ
PjwvSFRNTD4K
----ALT--hRuoHjf2JJIXx0JHEsfbvIkF0yxhgbcB1506953449--
3 years, 5 months
Having issue with external IPA
by Yan Naing Myint
Hello guys,
I'm having problem with adding users from my FreeIPA server to oVirt.
1. Status of ovirt-engine-extension-aaa-ldap-setup is success with RHDS
2. I cannot add IPA users in oVirt webadmin panel
3. In oVirt web admin panel it says "Error while executing action AddUser:
Internal Engine Error"
What will be the problem or is it a bug?
Is there any suggestion of how do it make it work?
in the engine.log it says;
2017-10-01 17:30:52,436+06 ERROR
[org.ovirt.engine.core.bll.aaa.AddUserCommand] (default task-113)
[bf5822eb-39da-49e5-b2ab-9865f71346a3] Transaction rolled-back for command
'org.ovirt.engine.core.bll.aaa.AddUserCommand'.
2017-10-01 17:30:52,459+06 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(default task-113) [bf5822eb-39da-49e5-b2ab-9865f71346a3] EVENT_ID:
USER_FAILED_ADD_ADUSER(327), Correlation ID:
bf5822eb-39da-49e5-b2ab-9865f71346a3, Call Stack: null, Custom ID: null,
Custom Event ID: -1, Message: Failed to add User 'mgorca' to the system.
in cyberwings.local.properties
ovirt.engine.extension.name = cyberwings.local
ovirt.engine.extension.bindings.method = jbossmodule
ovirt.engine.extension.binding.jbossmodule.module =
org.ovirt.engine-extensions.aaa.ldap
ovirt.engine.extension.binding.jbossmodule.class =
org.ovirt.engineextensions.aaa.ldap.AuthzExtension
ovirt.engine.extension.provides = org.ovirt.engine.api.extensions.aaa.Authz
config.profile.file.1 = ../aaa/cyberwings.local.properties
config.globals.baseDN.simple_baseDN = dc=cyberwings,dc=local
in cyberwings.local-authn.properties
ovirt.engine.extension.name = cyberwings.local-authn
ovirt.engine.extension.bindings.method = jbossmodule
ovirt.engine.extension.binding.jbossmodule.module =
org.ovirt.engine-extensions.aaa.ldap
ovirt.engine.extension.binding.jbossmodule.class =
org.ovirt.engineextensions.aaa.ldap.AuthnExtension
ovirt.engine.extension.provides = org.ovirt.engine.api.extensions.aaa.Authn
ovirt.engine.aaa.authn.profile.name = cyberwings.local
ovirt.engine.aaa.authn.authz.plugin = cyberwings.local
config.profile.file.1 = ../aaa/cyberwings.local.properties
config.globals.baseDN.simple_baseDN = dc=cyberwings,dc=local
--
Yan Naing Myint
CEO
Server & Network Engineer
Cyber Wings Co., Ltd
http://cyberwings.asia
09799950510
3 years, 5 months
None
by Jason Keltz
Content analysis details: (-1.0 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 SHORTCIRCUIT Not all rules were run, due to a shortcircuited rule
-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
Subject: [ovirt-users] xfs fragmentation problem caused data domain to hang
X-BeenThere: users(a)ovirt.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Main users mailing list for oVirt <users.ovirt.org>
List-Unsubscribe: <http://lists.ovirt.org/mailman/options/users>,
<mailto:users-request@ovirt.org?subject=unsubscribe>
List-Archive: <http://lists.ovirt.org/pipermail/users/>
List-Post: <mailto:users@ovirt.org>
List-Help: <mailto:users-request@ovirt.org?subject=help>
List-Subscribe: <http://lists.ovirt.org/mailman/listinfo/users>,
<mailto:users-request@ovirt.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Oct 2017 02:42:38 -0000
Hi.
For my data domain, I have one NFS server with a large RAID filesystem (9 TB).
I'm only using 2 TB of that at the moment. Today, my NFS server hung with
the following error:
> xfs: possible memory allocation deadlock in kmem_alloc
All 4 virtualization hosts of course had problems since there was no
longer any storage.
In the end, it seems like the problem is related to XFS fragmentation...
I read this great blog here:
https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-dea...
In short, I tried this:
# xfs_db -r -c "frag -f" /dev/sdb1
actual 4314253, ideal 43107, fragmentation factor 99.00%
Apparently the fragmentation factor doesn't mean much, but the fact that
"actual" number of extents is considerably higher than "ideal" extents seems that it
may be the problem.
I saw that many of my virtual disks that are written to a lot have, of course,
a lot of extents...
For example, on our main web server disk image, there were 247,597
extents alone! I took the web server down, and ran the XFS defrag
command on the disk...
# xfs_fsr -v 9a634692-1302-471f-a92e-c978b2b67fd0
9a634692-1302-471f-a92e-c978b2b67fd0
extents before:247597 after:429 DONE 9a634692-1302-471f-a92e-c978b2b67fd0
247,597 before and 429 after! WOW!
Are virtual disks a problem with XFS? Why isn't this memory allocation
deadlock issue more prevalent. I do see this article mentioned on many
web posts. I don't specifically see any recommendation to *not* use
XFS for the data domain though.
I was running CentOS 7.3 on the file server, but before rebooting the server,
I upgraded to the latest kernel and CentOS 7.4 in the hopes that if there
was a kernel issue, that this would solve it.
I took a few virtual systems down, and ran the defrag on the disks. However,
with over 30 virtual systems, I don't really want to do this individually.
I was wondering if I could run xfs_fsr on all the disks LIVE? It says in the
manual that you can run it live, but I can't see how this would be good when
a system is using that disk, and I don't want to deal with major
corruption across the board. Any thoughts?
Thanks,
Jason.
3 years, 5 months
iSCSI VLAN host connections - bond or multipath & IPv6
by Ben Bradley
Hi All
I'm looking to add a new host to my oVirt lab installation.
I'm going to share out some LVs from a separate box over iSCSI and will
hook the new host up to that.
I have 2 NICs on the storage host and 2 NICs on the new Ovirt host to
dedicate to the iSCSI traffic.
I also have 2 separate switches so I'm looking for redundancy here. Both
iSCSI host and oVirt host plugged into both switches.
If this was non-iSCSI traffic and without oVirt I would create bonded
interfaces in active-backup mode and layer the VLANs on top of that.
But for iSCSI traffic without oVirt involved I wouldn't bother with a
bond and just use multipath.
From scanning the oVirt docs it looks like there is an option to have
oVirt configure iSCSI multipathing.
So what's the best/most-supported option for oVirt?
Manually create active-backup bonds so oVirt just sees a single storage
link between host and storage?
Or leave them as separate interfaces on each side and use oVirt's
multipath/bonding?
Also I quite like the idea of using IPv6 for the iSCSI VLAN, purely down
to the fact I could use link-local addressing and not have to worry
about setting up static IPv4 addresses or DHCP. Is IPv6 iSCSI supported
by oVirt?
Thanks, Ben
3 years, 5 months
Re: [ovirt-users] 4.2 downgrade
by Yaniv Kaul
On Sep 30, 2017 8:09 AM, "Ryan Mahoney" <ryan(a)beaconhillentertainment.com>
wrote:
Accidentally upgraded a 4.0 environment to 4.2 (didn't realize the "master"
repo was development repo). What's my chances/best way if possible to roll
back to 4.0 (or 4.1 for that matter).
There is no roll back to oVirt installation.
That being said, I believe the Alpha quality is good. It is not feature
complete and we of course have more polishing to do, but it's very usable
and we will continue to ship updates to it. Let us know promptly what
issues you encounter.
Y.
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
3 years, 5 months
DR with oVirt: no data on OVF_STORE
by Luca 'remix_tj' Lorenzetto
Hello,
i'm experimenting a DR architecture that involves block storage
replication from the storage side (EMC VNX 8000).
Our idea is to import the replicated data storage domain on another
datacenter, managed by another engine, with the option "Import Domain"
and then import all the vms contained.
The idea works, but we encountered an issue that we don't want to have
again: we imported an SD and *no* vm were listed in the tab "VM
Import". Disks were available, but no VM informations.
What we did:
- on storage side: split the replica between the main disk in Site A
and the secondary disk Site B
- on storage side: added the disk to the storage group of the
"recovery" cluster
- from engine UI: Imported storage domain, confirming that i want to
activate even if seems to be attached to another DC
- from engine UI: move out from maintenance the storage domain and
click on the "VM Import" tab of the new SD.
What happened: *no* vm were listed
To identify better what's happening, I've found here some indications
on how lvm for block storage works and I identified the command on how
to find and read the OVF_STORE.
Looking inside the OVF_STORE has shown why no vm were listed: it was
empty (no .ovf file listed with tar tvf)
So, without the possibility import vms, i did a rollback, detaching
the storage domain and re-establishing the replication between the
main site and the DR site.
Then, after a day of replications (secondary volume is aligned every
30 minutes), i tried again and i've been able to import also vms
(OVF_STORE was populated).
So my question is: how to i force to have OVF_STORE to be aligned at
least as frequent as the replication? I want to have the VM disks
replicated to the remote site along with VM OVF informations.
Is possible to have OVF_STORE informations aligned when a VM is
created/edited or with a scheduled task? Is this so I/O expensive?
Thank you,
Luca
--
"E' assurdo impiegare gli uomini di intelligenza eccellente per fare
calcoli che potrebbero essere affidati a chiunque se si usassero delle
macchine"
Gottfried Wilhelm von Leibnitz, Filosofo e Matematico (1646-1716)
"Internet è la più grande biblioteca del mondo.
Ma il problema è che i libri sono tutti sparsi sul pavimento"
John Allen Paulos, Matematico (1945-vivente)
Luca 'remix_tj' Lorenzetto, http://www.remixtj.net , <lorenzetto.luca(a)gmail.com>
3 years, 5 months
oVirt 4.2 hosted-engine command damaged
by Julián Tete
I updated my lab environment from oVirt 4.1.x to oVirt 4.2 Alpha
The hosted-engine command has been corrupted
An example:
hosted-engine --vm-status
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py",
line 213, in <module>
if not status_checker.print_status ():
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py",
line 110, in print_status
all_host_stats = self._get_all_host_stats ()
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py",
line 75, in _get_all_host_stats
all_host_stats = ha_cli.get_all_host_stats ()
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 154, in get_all_host_stats
return self.get_all_stats (self.StatModes.HOST)
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 99, in get_all_stats
stats = broker.get_stats_from_storage (service)
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 147, in get_stats_from_storage
for host_id, data in six.iteritems (result):
File "/usr/lib/python2.7/site-packages/six.py", line 599, in iteritems
return d.iteritems (** kw)
AttributeError: 'NoneType' object has no attribute 'iteritems'
hosted-engine --set-maintenance --mode = none
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/set_maintenance.py",
line 88, in <module>
if not maintenance.set_mode (sys.argv [1]):
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/set_maintenance.py",
line 76, in set_mode
value = m_global,
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 240, in set_maintenance_mode
str (value))
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 187, in set_global_md_flag
all_stats = broker.get_stats_from_storage (service)
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 147, in get_stats_from_storage
for host_id, data in six.iteritems (result):
File "/usr/lib/python2.7/site-packages/six.py", line 599, in iteritems
return d.iteritems (** kw)
AttributeError: 'NoneType' object has no attribute 'iteritems'
hosted-engine --vm-start
VM exists and its status is Up
Hardware
Manufacturer: HP
Family: ProLiant
Product Name: ProLiant BL460c Gen8
CPU Model Name: Intel (R) Xeon (R) CPU E5-2667 v2 @ 3.30GHz
CPU Type: Intel SandyBridge Family
CPU Sockets: 2
CPU Cores per Socket: 8
CPU Threads per Core: 2 (SMT Enabled)
Software:
OS Version: RHEL - 7 - 4.1708.el7.centos
OS Description: CentOS Linux 7 (Core)
Kernel Version: 4.12.0 - 1.el7.elrepo.x86_64
KVM Version: 2.9.0 - 16.el7_4.5.1
LIBVIRT Version: libvirt-3.2.0-14.el7_4.3
VDSM Version: vdsm-4.20.3-95.git0813890.el7.centos
SPICE Version: 0.12.8 - 2.el7.1
GlusterFS Version: glusterfs-3.12.1-2.el7
CEPH Version: librbd1-0.94.5-2.el7
3 years, 5 months