[ovirt-users] storage issue's with oVirt 3.5.1 + Nexenta NFS

Wed Apr 22 09:59:14 UTC 2015

you got 4 spare disks, and can take out one of your raidz to create a
temp. parallel existing pool. zfs send/receive to migrate the data, this
shouldnt take much time if you are not using huge drives?

Am 22.04.2015 um 11:54 schrieb Maikel vd Mosselaar:
> Yes we are aware of that, problem is it's running production so not very
> easy to change the pool.
> 
> On 04/22/2015 11:48 AM, InterNetX - Juergen Gotteswinter wrote:
>> i expect that you are aware of the fact that you only get the write
>> performance of a single disk in that configuration? i whould drop that
>> pool configuration, drop the spare drives and go for a mirror pool.
>>
>> Am 22.04.2015 um 11:39 schrieb Maikel vd Mosselaar:
>>>    pool: z2pool
>>>   state: ONLINE
>>>   scan: scrub canceled on Sun Apr 12 16:33:38 2015
>>> config:
>>>
>>>          NAME                       STATE     READ WRITE CKSUM
>>>          z2pool                     ONLINE       0     0     0
>>>            raidz1-0                 ONLINE       0     0     0
>>>              c0t5000C5004172A87Bd0  ONLINE       0     0     0
>>>              c0t5000C50041A59027d0  ONLINE       0     0     0
>>>              c0t5000C50041A592AFd0  ONLINE       0     0     0
>>>              c0t5000C50041A660D7d0  ONLINE       0     0     0
>>>              c0t5000C50041A69223d0  ONLINE       0     0     0
>>>              c0t5000C50041A6ADF3d0  ONLINE       0     0     0
>>>          logs
>>>            c0t5001517BB2845595d0    ONLINE       0     0     0
>>>          cache
>>>            c0t5001517BB2847892d0    ONLINE       0     0     0
>>>          spares
>>>            c0t5000C50041A6B737d0    AVAIL
>>>            c0t5000C50041AC3F07d0    AVAIL
>>>            c0t5000C50041AD48DBd0    AVAIL
>>>            c0t5000C50041ADD727d0    AVAIL
>>>
>>> errors: No known data errors
>>>
>>>
>>> On 04/22/2015 11:17 AM, Karli Sjöberg wrote:
>>>> On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote:
>>>>> Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter
>>>>> is on the default setting (standard) so "sync" is on.
>>>> # zpool status ?
>>>>
>>>> /K
>>>>
>>>>> When the issue happens oVirt event viewer shows indeed latency
>>>>> warnings.
>>>>> Not always but most of the time this will be followed by an i/o
>>>>> storage
>>>>> error linked to random VMs and they will be paused when that happens.
>>>>>
>>>>> All the nodes use mode 4 bonding. The interfaces on the nodes don't
>>>>> show
>>>>> any drops or errors, i checked 2 of the VMs that got paused the last
>>>>> time it happened they have dropped packets on their interfaces.
>>>>>
>>>>> We don't have a subscription with nexenta (anymore).
>>>>>
>>>>> On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote:
>>>>>> Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar:
>>>>>>> Hi Juergen,
>>>>>>>
>>>>>>> The load on the nodes rises far over >200 during the event. Load on
>>>>>>> the
>>>>>>> nexenta stays normal and nothing strange in the logging.
>>>>>> ZFS + NFS could be still the root of this. Your Pool Configuration is
>>>>>> RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS
>>>>>> Subvolume which gets exported is kept default on "standard" ?
>>>>>>
>>>>>> http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performance-with-zil.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> Since Ovirt acts very sensible about Storage Latency (throws VM into
>>>>>> unresponsive or unknown state) it might be worth a try to do "zfs set
>>>>>> sync=disabled pool/volume" to see if this changes things. But be
>>>>>> aware
>>>>>> that this makes the NFS Export vuln. against dataloss in case of
>>>>>> powerloss etc, comparable to async NFS in Linux.
>>>>>>
>>>>>> If disabling the sync setting helps, and you dont use a seperate ZIL
>>>>>> Flash Drive yet -> this whould be very likely help to get rid of
>>>>>> this.
>>>>>>
>>>>>> Also, if you run a subscribed Version of Nexenta it might be
>>>>>> helpful to
>>>>>> involve them.
>>>>>>
>>>>>> Do you see any messages about high latency in the Ovirt Events Panel?
>>>>>>
>>>>>>> For our storage interfaces on our nodes we use bonding in mode 4
>>>>>>> (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also.
>>>>>> This should be fine, as long as no Node uses Mode0 / Round Robin
>>>>>> which
>>>>>> whould lead to out of order TCP Packets. The Interfaces themself dont
>>>>>> show any Drops or Errors - on the VM Hosts as well as on the Switch
>>>>>> itself?
>>>>>>
>>>>>> Jumbo Frames?
>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Maikel
>>>>>>>
>>>>>>>
>>>>>>> On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> how about Load, Latency, strange dmesg messages on the Nexenta ?
>>>>>>>> You are
>>>>>>>> using bonded Gbit Networking? If yes, which mode?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Juergen
>>>>>>>>
>>>>>>>> Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We are running ovirt 3.5.1 with 3 nodes and seperate engine.
>>>>>>>>>
>>>>>>>>> All on CentOS 6.6:
>>>>>>>>> 3 x nodes
>>>>>>>>> 1 x engine
>>>>>>>>>
>>>>>>>>> 1 x storage nexenta with NFS
>>>>>>>>>
>>>>>>>>> For multiple weeks we are experiencing issues of our nodes that
>>>>>>>>> cannot
>>>>>>>>> access the storage at random moments (atleast thats what the nodes
>>>>>>>>> think).
>>>>>>>>>
>>>>>>>>> When the nodes are complaining about a unavailable storage then
>>>>>>>>> the load
>>>>>>>>> rises up to +200 on all three nodes, this causes that all running
>>>>>>>>> VMs
>>>>>>>>> are unaccessible. During this process oVirt event viewer shows
>>>>>>>>> some i/o
>>>>>>>>> storage error messages, when this happens random VMs get paused
>>>>>>>>> and will
>>>>>>>>> not be resumed anymore (this almost happens every time but not
>>>>>>>>> all the
>>>>>>>>> VMs get paused).
>>>>>>>>>
>>>>>>>>> During the event we tested the accessibility from the nodes to the
>>>>>>>>> storage and it looks like it is working normal, at least we can
>>>>>>>>> do a
>>>>>>>>> normal
>>>>>>>>> "ls" on the storage without any delay of showing the contents.
>>>>>>>>>
>>>>>>>>> We tried multiple things that we thought it causes this issue but
>>>>>>>>> nothing worked so far.
>>>>>>>>> * rebooting storage / nodes / engine.
>>>>>>>>> * disabling offsite rsync backups.
>>>>>>>>> * moved the biggest VMs with highest load to different platform
>>>>>>>>> outside
>>>>>>>>> of oVirt.
>>>>>>>>> * checked the wsize and rsize on the nfs mounts, storage and
>>>>>>>>> nodes are
>>>>>>>>> correct according to the "NFS troubleshooting page" on ovirt.org.
>>>>>>>>>
>>>>>>>>> The environment is running in production so we are not free to
>>>>>>>>> test
>>>>>>>>> everything.
>>>>>>>>>
>>>>>>>>> I can provide log files if needed.
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>>
>>>>>>>>> Maikel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users at ovirt.org
>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users