This is a multi-part message in MIME format.
--------------050103010101010406090008
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
/Our current //nfs settings:/
listen_backlog=64
protocol=ALL
servers=1024
lockd_listen_backlog=64
lockd_servers=1024
lockd_retransmit_timeout=5
grace_period=90
server_versmin=2
server_versmax=4
client_versmin=2
client_versmax=4
server_delegation=on
nfsmapid_domain=
max_connections=-1
On 04/22/2015 11:32 AM, InterNetX - Juergen Gotteswinter wrote:
Am 22.04.2015 um 11:12 schrieb Maikel vd Mosselaar:
> Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter
> is on the default setting (standard) so "sync" is on.
for testing, i whould give zfs set sync=disabled pool/vol a shot. but as
i already said, thats nothing you should keep for production.
what i had in the past, too: the filer saturated the max lockd/nfs
processes (which are quite low in their default setting, dont worry to
push the nfs threads up to 512+. same goes for lockd)
to get your current values
sharectl get nfs
for example, one of my files which is pretty heavy hammered most of the
time through nfs uses this settings
servers=1024
lockd_listen_backlog=32
lockd_servers=1024
lockd_retransmit_timeout=5
grace_period=90
server_versmin=2
server_versmax=3
client_versmin=2
client_versmax=4
server_delegation=on
nfsmapid_domain=
max_connections=-1
protocol=ALL
listen_backlog=32
device=
mountd_listen_backlog=64
mountd_max_threads=16
to change them, use sharectl or throw it into /etc/system
set rpcmod:clnt_max_conns = 8
set rpcmod:maxdupreqs=8192
set rpcmod:cotsmaxdupreqs=8192
set nfs:nfs3_max_threads=1024
set nfs:nfs3_nra=128
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_transfer_size=1048576
-> reboot
> When the issue happens oVirt event viewer shows indeed latency warnings.
> Not always but most of the time this will be followed by an i/o storage
> error linked to random VMs and they will be paused when that happens.
>
> All the nodes use mode 4 bonding. The interfaces on the nodes don't show
> any drops or errors, i checked 2 of the VMs that got paused the last
> time it happened they have dropped packets on their interfaces.
>
> We don't have a subscription with nexenta (anymore).
>
> On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:
>> Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:
>>> Hi Juergen,
>>>
>>> The load on the nodes rises far over >200 during the event. Load on the
>>> nexenta stays normal and nothing strange in the logging.
>> ZFS + NFS could be still the root of this. Your Pool Configuration is
>> RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS
>> Subvolume which gets exported is kept default on "standard" ?
>>
>>
http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-perfo...
>>
>>
>> Since Ovirt acts very sensible about Storage Latency (throws VM into
>> unresponsive or unknown state) it might be worth a try to do "zfs set
>> sync=disabled pool/volume" to see if this changes things. But be aware
>> that this makes the NFS Export vuln. against dataloss in case of
>> powerloss etc, comparable to async NFS in Linux.
>>
>> If disabling the sync setting helps, and you dont use a seperate ZIL
>> Flash Drive yet -> this whould be very likely help to get rid of this.
>>
>> Also, if you run a subscribed Version of Nexenta it might be helpful to
>> involve them.
>>
>> Do you see any messages about high latency in the Ovirt Events Panel?
>>
>>> For our storage interfaces on our nodes we use bonding in mode 4
>>> (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also.
>> This should be fine, as long as no Node uses Mode0 / Round Robin which
>> whould lead to out of order TCP Packets. The Interfaces themself dont
>> show any Drops or Errors - on the VM Hosts as well as on the Switch
>> itself?
>>
>> Jumbo Frames?
>>
>>> Kind regards,
>>>
>>> Maikel
>>>
>>>
>>> On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:
>>>> Hi,
>>>>
>>>> how about Load, Latency, strange dmesg messages on the Nexenta ? You
>>>> are
>>>> using bonded Gbit Networking? If yes, which mode?
>>>>
>>>> Cheers,
>>>>
>>>> Juergen
>>>>
>>>> Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:
>>>>> Hi,
>>>>>
>>>>> We are running ovirt 3.5.1 with 3 nodes and seperate engine.
>>>>>
>>>>> All on CentOS 6.6:
>>>>> 3 x nodes
>>>>> 1 x engine
>>>>>
>>>>> 1 x storage nexenta with NFS
>>>>>
>>>>> For multiple weeks we are experiencing issues of our nodes that
cannot
>>>>> access the storage at random moments (atleast thats what the nodes
>>>>> think).
>>>>>
>>>>> When the nodes are complaining about a unavailable storage then the
>>>>> load
>>>>> rises up to +200 on all three nodes, this causes that all running
VMs
>>>>> are unaccessible. During this process oVirt event viewer shows some
>>>>> i/o
>>>>> storage error messages, when this happens random VMs get paused and
>>>>> will
>>>>> not be resumed anymore (this almost happens every time but not all
the
>>>>> VMs get paused).
>>>>>
>>>>> During the event we tested the accessibility from the nodes to the
>>>>> storage and it looks like it is working normal, at least we can do a
>>>>> normal
>>>>> "ls" on the storage without any delay of showing the
contents.
>>>>>
>>>>> We tried multiple things that we thought it causes this issue but
>>>>> nothing worked so far.
>>>>> * rebooting storage / nodes / engine.
>>>>> * disabling offsite rsync backups.
>>>>> * moved the biggest VMs with highest load to different platform
>>>>> outside
>>>>> of oVirt.
>>>>> * checked the wsize and rsize on the nfs mounts, storage and nodes
are
>>>>> correct according to the "NFS troubleshooting page" on
ovirt.org.
>>>>>
>>>>> The environment is running in production so we are not free to test
>>>>> everything.
>>>>>
>>>>> I can provide log files if needed.
>>>>>
>>>>> Kind Regards,
>>>>>
>>>>> Maikel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users(a)ovirt.org
>>>>
http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--------------050103010101010406090008
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><i>Our current
</i><i>nfs settings:</i><br>
<br>
listen_backlog=64<br>
protocol=ALL<br>
servers=1024<br>
lockd_listen_backlog=64<br>
lockd_servers=1024<br>
lockd_retransmit_timeout=5<br>
grace_period=90<br>
server_versmin=2<br>
server_versmax=4<br>
client_versmin=2<br>
client_versmax=4<br>
server_delegation=on<br>
nfsmapid_domain=<br>
max_connections=-1<br>
<br>
<br>
<br>
<br>
On 04/22/2015 11:32 AM, InterNetX - Juergen Gotteswinter wrote:<br>
</div>
<blockquote cite="mid:55376AC0.9020405@internetx.com"
type="cite">
<pre wrap="">Am 22.04.2015 um 11:12 schrieb Maikel vd Mosselaar:
</pre>
<blockquote type="cite">
<pre wrap="">
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter
is on the default setting (standard) so "sync" is on.
</pre>
</blockquote>
<pre wrap="">
for testing, i whould give zfs set sync=disabled pool/vol a shot. but as
i already said, thats nothing you should keep for production.
what i had in the past, too: the filer saturated the max lockd/nfs
processes (which are quite low in their default setting, dont worry to
push the nfs threads up to 512+. same goes for lockd)
to get your current values
sharectl get nfs
for example, one of my files which is pretty heavy hammered most of the
time through nfs uses this settings
servers=1024
lockd_listen_backlog=32
lockd_servers=1024
lockd_retransmit_timeout=5
grace_period=90
server_versmin=2
server_versmax=3
client_versmin=2
client_versmax=4
server_delegation=on
nfsmapid_domain=
max_connections=-1
protocol=ALL
listen_backlog=32
device=
mountd_listen_backlog=64
mountd_max_threads=16
to change them, use sharectl or throw it into /etc/system
set rpcmod:clnt_max_conns = 8
set rpcmod:maxdupreqs=8192
set rpcmod:cotsmaxdupreqs=8192
set nfs:nfs3_max_threads=1024
set nfs:nfs3_nra=128
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_transfer_size=1048576
-> reboot
</pre>
<blockquote type="cite">
<pre wrap="">
When the issue happens oVirt event viewer shows indeed latency warnings.
Not always but most of the time this will be followed by an i/o storage
error linked to random VMs and they will be paused when that happens.
All the nodes use mode 4 bonding. The interfaces on the nodes don't show
any drops or errors, i checked 2 of the VMs that got paused the last
time it happened they have dropped packets on their interfaces.
We don't have a subscription with nexenta (anymore).
On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Am 21.04.2015 um 16:19 schrieb Maikel vd
Mosselaar:
</pre>
<blockquote type="cite">
<pre wrap="">Hi Juergen,
The load on the nodes rises far over >200 during the event. Load on the
nexenta stays normal and nothing strange in the logging.
</pre>
</blockquote>
<pre wrap="">ZFS + NFS could be still the root of this. Your
Pool Configuration is
RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS
Subvolume which gets exported is kept default on "standard" ?
<a class="moz-txt-link-freetext"
href="http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performance-with-zil.html">http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performance-with-zil.html</a>
Since Ovirt acts very sensible about Storage Latency (throws VM into
unresponsive or unknown state) it might be worth a try to do "zfs set
sync=disabled pool/volume" to see if this changes things. But be aware
that this makes the NFS Export vuln. against dataloss in case of
powerloss etc, comparable to async NFS in Linux.
If disabling the sync setting helps, and you dont use a seperate ZIL
Flash Drive yet -> this whould be very likely help to get rid of this.
Also, if you run a subscribed Version of Nexenta it might be helpful to
involve them.
Do you see any messages about high latency in the Ovirt Events Panel?
</pre>
<blockquote type="cite">
<pre wrap="">For our storage interfaces on our nodes we use
bonding in mode 4
(802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also.
</pre>
</blockquote>
<pre wrap="">This should be fine, as long as no Node uses Mode0
/ Round Robin which
whould lead to out of order TCP Packets. The Interfaces themself dont
show any Drops or Errors - on the VM Hosts as well as on the Switch
itself?
Jumbo Frames?
</pre>
<blockquote type="cite">
<pre wrap="">Kind regards,
Maikel
On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Hi,
how about Load, Latency, strange dmesg messages on the Nexenta ? You
are
using bonded Gbit Networking? If yes, which mode?
Cheers,
Juergen
Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:
</pre>
<blockquote type="cite">
<pre wrap="">Hi,
We are running ovirt 3.5.1 with 3 nodes and seperate engine.
All on CentOS 6.6:
3 x nodes
1 x engine
1 x storage nexenta with NFS
For multiple weeks we are experiencing issues of our nodes that cannot
access the storage at random moments (atleast thats what the nodes
think).
When the nodes are complaining about a unavailable storage then the
load
rises up to +200 on all three nodes, this causes that all running VMs
are unaccessible. During this process oVirt event viewer shows some
i/o
storage error messages, when this happens random VMs get paused and
will
not be resumed anymore (this almost happens every time but not all the
VMs get paused).
During the event we tested the accessibility from the nodes to the
storage and it looks like it is working normal, at least we can do a
normal
"ls" on the storage without any delay of showing the contents.
We tried multiple things that we thought it causes this issue but
nothing worked so far.
* rebooting storage / nodes / engine.
* disabling offsite rsync backups.
* moved the biggest VMs with highest load to different platform
outside
of oVirt.
* checked the wsize and rsize on the nfs mounts, storage and nodes are
correct according to the "NFS troubleshooting page" on
ovirt.org.
The environment is running in production so we are not free to test
everything.
I can provide log files if needed.
Kind Regards,
Maikel
_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<pre
wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
</blockquote>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<pre wrap="">
_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<br>
</body>
</html>
--------------050103010101010406090008--