storage issue's with oVirt 3.5.1 + Nexenta NFS

Maikel vd Mosselaar

20 Apr 2015 20 Apr '15

3:25 p.m.

Hi, We are running ovirt 3.5.1 with 3 nodes and seperate engine. All on CentOS 6.6: 3 x nodes 1 x engine 1 x storage nexenta with NFS For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think). When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused). During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents. We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org. The environment is running in production so we are not free to test everything. I can provide log files if needed. Kind Regards, Maikel

Show replies by date

Fred Rolland

21 Apr 21 Apr

3:32 p.m.

Hi, Can you please attach VDSM logs ? Thanks, Fred ----- Original Message -----

...

From: "Maikel vd Mosselaar" <m.vandemosselaar@smoose.nl> To: users@ovirt.org Sent: Monday, April 20, 2015 3:25:38 PM Subject: [ovirt-users] storage issue's with oVirt 3.5.1 + Nexenta NFS

Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maikel vd Mosselaar

5:09 p.m.

Hi Fred, This is one of the nodes from yesterday around 01:00 (20-04-15). The issue started around 01:00. https://bpaste.net/raw/67542540a106 The VDSM logs are very big so i am unable to paste a bigger part of the logfile, i don't know what the maximum allowed attachment size is of the mailing list? dmesg on the one the nodes (despite this message the storage is still accessible): https://bpaste.net/raw/67da167aa300 Kind regards, Maikel On 04/21/2015 02:32 PM, Fred Rolland wrote:

...

Hi,

Can you please attach VDSM logs ?

Thanks,

Fred

----- Original Message -----

...
From: "Maikel vd Mosselaar" <m.vandemosselaar@smoose.nl> To: users@ovirt.org Sent: Monday, April 20, 2015 3:25:38 PM Subject: [ovirt-users] storage issue's with oVirt 3.5.1 + Nexenta NFS

Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

InterNetX - Juergen Gotteswinter

5:43 p.m.

Am 21.04.2015 um 16:09 schrieb Maikel vd Mosselaar:

...

Hi Fred,

This is one of the nodes from yesterday around 01:00 (20-04-15). The issue started around 01:00. https://bpaste.net/raw/67542540a106

The VDSM logs are very big so i am unable to paste a bigger part of the logfile, i don't know what the maximum allowed attachment size is of the mailing list?

dmesg on the one the nodes (despite this message the storage is still accessible): https://bpaste.net/raw/67da167aa300

Flaky Network? NFS / Lockd Processes saturated @ Nexenta?

...

Kind regards,

Maikel

On 04/21/2015 02:32 PM, Fred Rolland wrote:

...
Hi,

Can you please attach VDSM logs ?

Thanks,

Fred

----- Original Message -----

...
From: "Maikel vd Mosselaar" <m.vandemosselaar@smoose.nl> To: users@ovirt.org Sent: Monday, April 20, 2015 3:25:38 PM Subject: [ovirt-users] storage issue's with oVirt 3.5.1 + Nexenta NFS

Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

InterNetX - Juergen Gotteswinter

3:51 p.m.

Hi, how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode? Cheers, Juergen Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...

Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maikel vd Mosselaar

5:19 p.m.

Hi Juergen, The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. Kind regards, Maikel On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...

Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

InterNetX - Juergen Gotteswinter

5:41 p.m.

Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...

Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging.

ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ? http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa... Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux. If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this. Also, if you run a subscribed Version of Nexenta it might be helpful to involve them. Do you see any messages about high latency in the Ovirt Events Panel?

...

For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also.

This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself? Jumbo Frames?

...

Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maikel vd Mosselaar

22 Apr 22 Apr

12:12 p.m.

Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens. All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces. We don't have a subscription with nexenta (anymore). On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...

Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Karli Sjöberg

12:17 p.m.

On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:

...

Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on.

# zpool status ? /K

...

When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maikel vd Mosselaar

12:39 p.m.

pool: z2pool state: ONLINE scan: scrub canceled on Sun Apr 12 16:33:38 2015 config: NAME STATE READ WRITE CKSUM z2pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c0t5000C5004172A87Bd0 ONLINE 0 0 0 c0t5000C50041A59027d0 ONLINE 0 0 0 c0t5000C50041A592AFd0 ONLINE 0 0 0 c0t5000C50041A660D7d0 ONLINE 0 0 0 c0t5000C50041A69223d0 ONLINE 0 0 0 c0t5000C50041A6ADF3d0 ONLINE 0 0 0 logs c0t5001517BB2845595d0 ONLINE 0 0 0 cache c0t5001517BB2847892d0 ONLINE 0 0 0 spares c0t5000C50041A6B737d0 AVAIL c0t5000C50041AC3F07d0 AVAIL c0t5000C50041AD48DBd0 AVAIL c0t5000C50041ADD727d0 AVAIL errors: No known data errors On 04/22/2015 11:17 AM, Karli Sjöberg wrote:

...

On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:

...
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. # zpool status ?

/K

...
When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

InterNetX - Juergen Gotteswinter

12:48 p.m.

i expect that you are aware of the fact that you only get the write performance of a single disk in that configuration? i whould drop that pool configuration, drop the spare drives and go for a mirror pool. Am 22.04.2015 um 11:39 schrieb Maikel vd Mosselaar:

...

pool: z2pool state: ONLINE scan: scrub canceled on Sun Apr 12 16:33:38 2015 config:

NAME STATE READ WRITE CKSUM z2pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c0t5000C5004172A87Bd0 ONLINE 0 0 0 c0t5000C50041A59027d0 ONLINE 0 0 0 c0t5000C50041A592AFd0 ONLINE 0 0 0 c0t5000C50041A660D7d0 ONLINE 0 0 0 c0t5000C50041A69223d0 ONLINE 0 0 0 c0t5000C50041A6ADF3d0 ONLINE 0 0 0 logs c0t5001517BB2845595d0 ONLINE 0 0 0 cache c0t5001517BB2847892d0 ONLINE 0 0 0 spares c0t5000C50041A6B737d0 AVAIL c0t5000C50041AC3F07d0 AVAIL c0t5000C50041AD48DBd0 AVAIL c0t5000C50041ADD727d0 AVAIL

errors: No known data errors

On 04/22/2015 11:17 AM, Karli Sjöberg wrote:

...
On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:

...
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. # zpool status ?

/K

...
When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar: > Hi, > > We are running ovirt 3.5.1 with 3 nodes and seperate engine. > > All on CentOS 6.6: > 3 x nodes > 1 x engine > > 1 x storage nexenta with NFS > > For multiple weeks we are experiencing issues of our nodes that > cannot > access the storage at random moments (atleast thats what the nodes > think). > > When the nodes are complaining about a unavailable storage then > the load > rises up to +200 on all three nodes, this causes that all running > VMs > are unaccessible. During this process oVirt event viewer shows > some i/o > storage error messages, when this happens random VMs get paused > and will > not be resumed anymore (this almost happens every time but not > all the > VMs get paused). > > During the event we tested the accessibility from the nodes to the > storage and it looks like it is working normal, at least we can do a > normal > "ls" on the storage without any delay of showing the contents. > > We tried multiple things that we thought it causes this issue but > nothing worked so far. > * rebooting storage / nodes / engine. > * disabling offsite rsync backups. > * moved the biggest VMs with highest load to different platform > outside > of oVirt. > * checked the wsize and rsize on the nfs mounts, storage and > nodes are > correct according to the "NFS troubleshooting page" on ovirt.org. > > The environment is running in production so we are not free to test > everything. > > I can provide log files if needed. > > Kind Regards, > > Maikel > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maikel vd Mosselaar

12:54 p.m.

Yes we are aware of that, problem is it's running production so not very easy to change the pool. On 04/22/2015 11:48 AM, InterNetX - Juergen Gotteswinter wrote:

...

i expect that you are aware of the fact that you only get the write performance of a single disk in that configuration? i whould drop that pool configuration, drop the spare drives and go for a mirror pool.

Am 22.04.2015 um 11:39 schrieb Maikel vd Mosselaar:

...
pool: z2pool state: ONLINE scan: scrub canceled on Sun Apr 12 16:33:38 2015 config:

NAME STATE READ WRITE CKSUM z2pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c0t5000C5004172A87Bd0 ONLINE 0 0 0 c0t5000C50041A59027d0 ONLINE 0 0 0 c0t5000C50041A592AFd0 ONLINE 0 0 0 c0t5000C50041A660D7d0 ONLINE 0 0 0 c0t5000C50041A69223d0 ONLINE 0 0 0 c0t5000C50041A6ADF3d0 ONLINE 0 0 0 logs c0t5001517BB2845595d0 ONLINE 0 0 0 cache c0t5001517BB2847892d0 ONLINE 0 0 0 spares c0t5000C50041A6B737d0 AVAIL c0t5000C50041AC3F07d0 AVAIL c0t5000C50041AD48DBd0 AVAIL c0t5000C50041ADD727d0 AVAIL

errors: No known data errors

On 04/22/2015 11:17 AM, Karli Sjöberg wrote:

...
On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:

...
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. # zpool status ?

/K

...
When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote: > Hi, > > how about Load, Latency, strange dmesg messages on the Nexenta ? > You are > using bonded Gbit Networking? If yes, which mode? > > Cheers, > > Juergen > > Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar: >> Hi, >> >> We are running ovirt 3.5.1 with 3 nodes and seperate engine. >> >> All on CentOS 6.6: >> 3 x nodes >> 1 x engine >> >> 1 x storage nexenta with NFS >> >> For multiple weeks we are experiencing issues of our nodes that >> cannot >> access the storage at random moments (atleast thats what the nodes >> think). >> >> When the nodes are complaining about a unavailable storage then >> the load >> rises up to +200 on all three nodes, this causes that all running >> VMs >> are unaccessible. During this process oVirt event viewer shows >> some i/o >> storage error messages, when this happens random VMs get paused >> and will >> not be resumed anymore (this almost happens every time but not >> all the >> VMs get paused). >> >> During the event we tested the accessibility from the nodes to the >> storage and it looks like it is working normal, at least we can do a >> normal >> "ls" on the storage without any delay of showing the contents. >> >> We tried multiple things that we thought it causes this issue but >> nothing worked so far. >> * rebooting storage / nodes / engine. >> * disabling offsite rsync backups. >> * moved the biggest VMs with highest load to different platform >> outside >> of oVirt. >> * checked the wsize and rsize on the nfs mounts, storage and >> nodes are >> correct according to the "NFS troubleshooting page" on ovirt.org. >> >> The environment is running in production so we are not free to test >> everything. >> >> I can provide log files if needed. >> >> Kind Regards, >> >> Maikel >> >> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

InterNetX - Juergen Gotteswinter

12:59 p.m.

you got 4 spare disks, and can take out one of your raidz to create a temp. parallel existing pool. zfs send/receive to migrate the data, this shouldnt take much time if you are not using huge drives? Am 22.04.2015 um 11:54 schrieb Maikel vd Mosselaar:

...

Yes we are aware of that, problem is it's running production so not very easy to change the pool.

On 04/22/2015 11:48 AM, InterNetX - Juergen Gotteswinter wrote:

...
i expect that you are aware of the fact that you only get the write performance of a single disk in that configuration? i whould drop that pool configuration, drop the spare drives and go for a mirror pool.

Am 22.04.2015 um 11:39 schrieb Maikel vd Mosselaar:

...
pool: z2pool state: ONLINE scan: scrub canceled on Sun Apr 12 16:33:38 2015 config:

NAME STATE READ WRITE CKSUM z2pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c0t5000C5004172A87Bd0 ONLINE 0 0 0 c0t5000C50041A59027d0 ONLINE 0 0 0 c0t5000C50041A592AFd0 ONLINE 0 0 0 c0t5000C50041A660D7d0 ONLINE 0 0 0 c0t5000C50041A69223d0 ONLINE 0 0 0 c0t5000C50041A6ADF3d0 ONLINE 0 0 0 logs c0t5001517BB2845595d0 ONLINE 0 0 0 cache c0t5001517BB2847892d0 ONLINE 0 0 0 spares c0t5000C50041A6B737d0 AVAIL c0t5000C50041AC3F07d0 AVAIL c0t5000C50041AD48DBd0 AVAIL c0t5000C50041ADD727d0 AVAIL

errors: No known data errors

On 04/22/2015 11:17 AM, Karli Sjöberg wrote:

...
On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:

...
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. # zpool status ?

/K

...
When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar: > Hi Juergen, > > The load on the nodes rises far over >200 during the event. Load on > the > nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

> For our storage interfaces on our nodes we use bonding in mode 4 > (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

> Kind regards, > > Maikel > > > On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote: >> Hi, >> >> how about Load, Latency, strange dmesg messages on the Nexenta ? >> You are >> using bonded Gbit Networking? If yes, which mode? >> >> Cheers, >> >> Juergen >> >> Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar: >>> Hi, >>> >>> We are running ovirt 3.5.1 with 3 nodes and seperate engine. >>> >>> All on CentOS 6.6: >>> 3 x nodes >>> 1 x engine >>> >>> 1 x storage nexenta with NFS >>> >>> For multiple weeks we are experiencing issues of our nodes that >>> cannot >>> access the storage at random moments (atleast thats what the nodes >>> think). >>> >>> When the nodes are complaining about a unavailable storage then >>> the load >>> rises up to +200 on all three nodes, this causes that all running >>> VMs >>> are unaccessible. During this process oVirt event viewer shows >>> some i/o >>> storage error messages, when this happens random VMs get paused >>> and will >>> not be resumed anymore (this almost happens every time but not >>> all the >>> VMs get paused). >>> >>> During the event we tested the accessibility from the nodes to the >>> storage and it looks like it is working normal, at least we can >>> do a >>> normal >>> "ls" on the storage without any delay of showing the contents. >>> >>> We tried multiple things that we thought it causes this issue but >>> nothing worked so far. >>> * rebooting storage / nodes / engine. >>> * disabling offsite rsync backups. >>> * moved the biggest VMs with highest load to different platform >>> outside >>> of oVirt. >>> * checked the wsize and rsize on the nfs mounts, storage and >>> nodes are >>> correct according to the "NFS troubleshooting page" on ovirt.org. >>> >>> The environment is running in production so we are not free to >>> test >>> everything. >>> >>> I can provide log files if needed. >>> >>> Kind Regards, >>> >>> Maikel >>> >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Karli Sjöberg

1:01 p.m.

On Wed, 2015-04-22 at 11:54 +0200, Maikel vd Mosselaar wrote:

...

Yes we are aware of that, problem is it's running production so not very easy to change the pool.

On 04/22/2015 11:48 AM, InterNetX - Juergen Gotteswinter wrote:

...
i expect that you are aware of the fact that you only get the write performance of a single disk in that configuration? i whould drop that pool configuration, drop the spare drives and go for a mirror pool.

^ What he said:) That, or if you have more space to add another 2 disks and use them plus the spare drives to add a second raidz(1|2|3) vdev. What drives do you use for data, log and cache? /K

...

...
Am 22.04.2015 um 11:39 schrieb Maikel vd Mosselaar:

...
pool: z2pool state: ONLINE scan: scrub canceled on Sun Apr 12 16:33:38 2015 config:

NAME STATE READ WRITE CKSUM z2pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c0t5000C5004172A87Bd0 ONLINE 0 0 0 c0t5000C50041A59027d0 ONLINE 0 0 0 c0t5000C50041A592AFd0 ONLINE 0 0 0 c0t5000C50041A660D7d0 ONLINE 0 0 0 c0t5000C50041A69223d0 ONLINE 0 0 0 c0t5000C50041A6ADF3d0 ONLINE 0 0 0 logs c0t5001517BB2845595d0 ONLINE 0 0 0 cache c0t5001517BB2847892d0 ONLINE 0 0 0 spares c0t5000C50041A6B737d0 AVAIL c0t5000C50041AC3F07d0 AVAIL c0t5000C50041AD48DBd0 AVAIL c0t5000C50041ADD727d0 AVAIL

errors: No known data errors

On 04/22/2015 11:17 AM, Karli Sjöberg wrote:

...
On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:

...
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. # zpool status ?

/K

...
When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar: > Hi Juergen, > > The load on the nodes rises far over >200 during the event. Load on > the > nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

> For our storage interfaces on our nodes we use bonding in mode 4 > (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

> Kind regards, > > Maikel > > > On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote: >> Hi, >> >> how about Load, Latency, strange dmesg messages on the Nexenta ? >> You are >> using bonded Gbit Networking? If yes, which mode? >> >> Cheers, >> >> Juergen >> >> Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar: >>> Hi, >>> >>> We are running ovirt 3.5.1 with 3 nodes and seperate engine. >>> >>> All on CentOS 6.6: >>> 3 x nodes >>> 1 x engine >>> >>> 1 x storage nexenta with NFS >>> >>> For multiple weeks we are experiencing issues of our nodes that >>> cannot >>> access the storage at random moments (atleast thats what the nodes >>> think). >>> >>> When the nodes are complaining about a unavailable storage then >>> the load >>> rises up to +200 on all three nodes, this causes that all running >>> VMs >>> are unaccessible. During this process oVirt event viewer shows >>> some i/o >>> storage error messages, when this happens random VMs get paused >>> and will >>> not be resumed anymore (this almost happens every time but not >>> all the >>> VMs get paused). >>> >>> During the event we tested the accessibility from the nodes to the >>> storage and it looks like it is working normal, at least we can do a >>> normal >>> "ls" on the storage without any delay of showing the contents. >>> >>> We tried multiple things that we thought it causes this issue but >>> nothing worked so far. >>> * rebooting storage / nodes / engine. >>> * disabling offsite rsync backups. >>> * moved the biggest VMs with highest load to different platform >>> outside >>> of oVirt. >>> * checked the wsize and rsize on the nfs mounts, storage and >>> nodes are >>> correct according to the "NFS troubleshooting page" on ovirt.org. >>> >>> The environment is running in production so we are not free to test >>> everything. >>> >>> I can provide log files if needed. >>> >>> Kind Regards, >>> >>> Maikel >>> >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

InterNetX - Juergen Gotteswinter

12:32 p.m.

Am 22.04.2015 um 11:12 schrieb Maikel vd Mosselaar:

...

Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on.

for testing, i whould give zfs set sync=disabled pool/vol a shot. but as i already said, thats nothing you should keep for production. what i had in the past, too: the filer saturated the max lockd/nfs processes (which are quite low in their default setting, dont worry to push the nfs threads up to 512+. same goes for lockd) to get your current values sharectl get nfs for example, one of my files which is pretty heavy hammered most of the time through nfs uses this settings servers=1024 lockd_listen_backlog=32 lockd_servers=1024 lockd_retransmit_timeout=5 grace_period=90 server_versmin=2 server_versmax=3 client_versmin=2 client_versmax=4 server_delegation=on nfsmapid_domain= max_connections=-1 protocol=ALL listen_backlog=32 device= mountd_listen_backlog=64 mountd_max_threads=16 to change them, use sharectl or throw it into /etc/system set rpcmod:clnt_max_conns = 8 set rpcmod:maxdupreqs=8192 set rpcmod:cotsmaxdupreqs=8192 set nfs:nfs3_max_threads=1024 set nfs:nfs3_nra=128 set nfs:nfs3_bsize=1048576 set nfs:nfs3_max_transfer_size=1048576 -> reboot

...

When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maikel vd Mosselaar

12:53 p.m.

This is a multi-part message in MIME format. --------------050103010101010406090008 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit /Our current //nfs settings:/ listen_backlog=64 protocol=ALL servers=1024 lockd_listen_backlog=64 lockd_servers=1024 lockd_retransmit_timeout=5 grace_period=90 server_versmin=2 server_versmax=4 client_versmin=2 client_versmax=4 server_delegation=on nfsmapid_domain= max_connections=-1 On 04/22/2015 11:32 AM, InterNetX - Juergen Gotteswinter wrote:

...

Am 22.04.2015 um 11:12 schrieb Maikel vd Mosselaar:

...
Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. for testing, i whould give zfs set sync=disabled pool/vol a shot. but as i already said, thats nothing you should keep for production.

what i had in the past, too: the filer saturated the max lockd/nfs processes (which are quite low in their default setting, dont worry to push the nfs threads up to 512+. same goes for lockd)

to get your current values

sharectl get nfs

for example, one of my files which is pretty heavy hammered most of the time through nfs uses this settings

servers=1024 lockd_listen_backlog=32 lockd_servers=1024 lockd_retransmit_timeout=5 grace_period=90 server_versmin=2 server_versmax=3 client_versmin=2 client_versmax=4 server_delegation=on nfsmapid_domain= max_connections=-1 protocol=ALL listen_backlog=32 device= mountd_listen_backlog=64 mountd_max_threads=16

to change them, use sharectl or throw it into /etc/system

set rpcmod:clnt_max_conns = 8 set rpcmod:maxdupreqs=8192 set rpcmod:cotsmaxdupreqs=8192

set nfs:nfs3_max_threads=1024 set nfs:nfs3_nra=128 set nfs:nfs3_bsize=1048576 set nfs:nfs3_max_transfer_size=1048576

-> reboot

...
When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens.

All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces.

We don't have a subscription with nexenta (anymore).

On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:

...
Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:

...
Hi Juergen,

The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ?

http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performa...

Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux.

If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this.

Also, if you run a subscribed Version of Nexenta it might be helpful to involve them.

Do you see any messages about high latency in the Ovirt Events Panel?

...
For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself?

Jumbo Frames?

...
Kind regards,

Maikel

On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:

...
Hi,

how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode?

Cheers,

Juergen

Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:

...
Hi,

We are running ovirt 3.5.1 with 3 nodes and seperate engine.

All on CentOS 6.6: 3 x nodes 1 x engine

1 x storage nexenta with NFS

For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think).

When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused).

During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents.

We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org.

The environment is running in production so we are not free to test everything.

I can provide log files if needed.

Kind Regards,

Maikel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--------------050103010101010406090008 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">Our current nfs settings: listen_backlog=64 protocol=ALL servers=1024 lockd_listen_backlog=64 lockd_servers=1024 lockd_retransmit_timeout=5 grace_period=90 server_versmin=2 server_versmax=4 client_versmin=2 client_versmax=4 server_delegation=on nfsmapid_domain= max_connections=-1 On 04/22/2015 11:32 AM, InterNetX - Juergen Gotteswinter wrote: </div> <blockquote cite="mid:55376AC0.9020405@internetx.com" type="cite"> <pre wrap="">Am 22.04.2015 um 11:12 schrieb Maikel vd Mosselaar: </pre> <blockquote type="cite"> <pre wrap=""> Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter is on the default setting (standard) so "sync" is on. </pre> </blockquote> <pre wrap=""> for testing, i whould give zfs set sync=disabled pool/vol a shot. but as i already said, thats nothing you should keep for production. what i had in the past, too: the filer saturated the max lockd/nfs processes (which are quite low in their default setting, dont worry to push the nfs threads up to 512+. same goes for lockd) to get your current values sharectl get nfs for example, one of my files which is pretty heavy hammered most of the time through nfs uses this settings servers=1024 lockd_listen_backlog=32 lockd_servers=1024 lockd_retransmit_timeout=5 grace_period=90 server_versmin=2 server_versmax=3 client_versmin=2 client_versmax=4 server_delegation=on nfsmapid_domain= max_connections=-1 protocol=ALL listen_backlog=32 device= mountd_listen_backlog=64 mountd_max_threads=16 to change them, use sharectl or throw it into /etc/system set rpcmod:clnt_max_conns = 8 set rpcmod:maxdupreqs=8192 set rpcmod:cotsmaxdupreqs=8192 set nfs:nfs3_max_threads=1024 set nfs:nfs3_nra=128 set nfs:nfs3_bsize=1048576 set nfs:nfs3_max_transfer_size=1048576 -> reboot </pre> <blockquote type="cite"> <pre wrap=""> When the issue happens oVirt event viewer shows indeed latency warnings. Not always but most of the time this will be followed by an i/o storage error linked to random VMs and they will be paused when that happens. All the nodes use mode 4 bonding. The interfaces on the nodes don't show any drops or errors, i checked 2 of the VMs that got paused the last time it happened they have dropped packets on their interfaces. We don't have a subscription with nexenta (anymore). On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote: </pre> <blockquote type="cite"> <pre wrap="">Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar: </pre> <blockquote type="cite"> <pre wrap="">Hi Juergen, The load on the nodes rises far over >200 during the event. Load on the nexenta stays normal and nothing strange in the logging. </pre> </blockquote> <pre wrap="">ZFS + NFS could be still the root of this. Your Pool Configuration is RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS Subvolume which gets exported is kept default on "standard" ? <a class="moz-txt-link-freetext" href="http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performance-with-zil.html">http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performance-with-zil.html</a> Since Ovirt acts very sensible about Storage Latency (throws VM into unresponsive or unknown state) it might be worth a try to do "zfs set sync=disabled pool/volume" to see if this changes things. But be aware that this makes the NFS Export vuln. against dataloss in case of powerloss etc, comparable to async NFS in Linux. If disabling the sync setting helps, and you dont use a seperate ZIL Flash Drive yet -> this whould be very likely help to get rid of this. Also, if you run a subscribed Version of Nexenta it might be helpful to involve them. Do you see any messages about high latency in the Ovirt Events Panel? </pre> <blockquote type="cite"> <pre wrap="">For our storage interfaces on our nodes we use bonding in mode 4 (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. </pre> </blockquote> <pre wrap="">This should be fine, as long as no Node uses Mode0 / Round Robin which whould lead to out of order TCP Packets. The Interfaces themself dont show any Drops or Errors - on the VM Hosts as well as on the Switch itself? Jumbo Frames? </pre> <blockquote type="cite"> <pre wrap="">Kind regards, Maikel On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote: </pre> <blockquote type="cite"> <pre wrap="">Hi, how about Load, Latency, strange dmesg messages on the Nexenta ? You are using bonded Gbit Networking? If yes, which mode? Cheers, Juergen Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar: </pre> <blockquote type="cite"> <pre wrap="">Hi, We are running ovirt 3.5.1 with 3 nodes and seperate engine. All on CentOS 6.6: 3 x nodes 1 x engine 1 x storage nexenta with NFS For multiple weeks we are experiencing issues of our nodes that cannot access the storage at random moments (atleast thats what the nodes think). When the nodes are complaining about a unavailable storage then the load rises up to +200 on all three nodes, this causes that all running VMs are unaccessible. During this process oVirt event viewer shows some i/o storage error messages, when this happens random VMs get paused and will not be resumed anymore (this almost happens every time but not all the VMs get paused). During the event we tested the accessibility from the nodes to the storage and it looks like it is working normal, at least we can do a normal "ls" on the storage without any delay of showing the contents. We tried multiple things that we thought it causes this issue but nothing worked so far. * rebooting storage / nodes / engine. * disabling offsite rsync backups. * moved the biggest VMs with highest load to different platform outside of oVirt. * checked the wsize and rsize on the nfs mounts, storage and nodes are correct according to the "NFS troubleshooting page" on ovirt.org. The environment is running in production so we are not free to test everything. I can provide log files if needed. Kind Regards, Maikel _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> </blockquote> </blockquote> <pre wrap=""> </pre> </blockquote> <pre wrap=""> _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> </body> </html> --------------050103010101010406090008--

3760

Age (days ago)

3762

Last active (days ago)

List overview

Download

15 comments

5 participants

participants (5)

Fred Rolland
InterNetX - Juergen Gotteswinter
InterNetX - Juergen Gotteswinter
Karli Sjöberg
Maikel vd Mosselaar

storage issue's with oVirt 3.5.1 + Nexenta NFS

tags

participants (5)