[ovirt-users] Ovirt vm's paused due to storage error

Darrell Budic budic at onholyground.com
Fri Mar 30 17:30:23 UTC 2018


Found (and caused) my problem. 

I’d been evaluating different settings for (default settings shown):
cluster.shd-max-threads                 1                                       
cluster.shd-wait-qlength                1024                                    

and had forgotten to reset them after testing. I had them at max-thread 8 and qlength 10000.

It worked in that the cluster healed in approximately half the time, and was a total failure in that my cluster experienced IO pauses and at least one VM abnormal shutdown. 

I have 6 core processers in these boxes, and it looks like I just overloaded them to the point that normal IO wasn’t getting serviced because the self-heal was getting too much priority. I’ve reverted to the defaults for these, and things are now behaving normally, no pauses during healing at all.

Moral of the story is don’t forget to undo testing settings when done, and really don’t test extreme settings in production!

Back to upgrading my test cluster so I can properly abuse things like this.

  -Darrell
> From: Darrell Budic <budic at onholyground.com>
> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error
> Date: March 22, 2018 at 1:23:29 PM CDT
> To: users
> 
> I’ve also encounter something similar on my setup, ovirt 3.1.9 with a gluster 3.12.3 storage cluster. All the storage domains in question are setup as gluster volumes & sharded, and I’ve enabled libgfapi support in the engine. It’s happened primarily to VMs that haven’t been restarted to switch to gfapi yet (still have fuse mounts for these), but one or two VMs that have been switched to gfapi mounts as well.
> 
> I started updating the storage cluster to gluster 3.12.6 yesterday and got more annoying/bad behavior as well. Many VMs that were “high disk use” VMs experienced hangs, but not as storage related pauses. Instead, they hang and their watchdogs eventually reported CPU hangs. All did eventually resume normal operation, but it was annoying, to be sure. The Ovirt Engine also lost contact with all of my VMs (unknown status, ? in GUI), even though it still had contact with the hosts. My gluster cluster reported no errors, volume status was normal, and all peers and bricks were connected. Didn’t see anything in the gluster logs that indicated problems, but there were reports of failed heals that eventually went away. 
> 
> Seems like something in vdsm and/or libgfapi isn’t handling the gfapi mounts well during healing and the related locks, but I can’t tell what it is. I’ve got two more servers in the cluster to upgrade to 3.12.6 yet, and I’ll keep an eye on more logs while I’m doing it, will report on it after I get more info.
> 
>   -Darrell
>> From: Sahina Bose <sabose at redhat.com <mailto:sabose at redhat.com>>
>> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error
>> Date: March 22, 2018 at 4:56:13 AM CDT
>> To: Endre Karlson
>> Cc: users
>> 
>> Can you provide "gluster volume info" and  the mount logs of the data volume (I assume that this hosts the vdisks for the VM's with storage error).
>> 
>> Also vdsm.log at the corresponding time.
>> 
>> On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson <endre.karlson at gmail.com <mailto:endre.karlson at gmail.com>> wrote:
>> Hi, this is is here again and we are getting several vm's going into storage error in our 4 node cluster running on centos 7.4 with gluster and ovirt 4.2.1.
>> 
>> Gluster version: 3.12.6
>> 
>> volume status
>> [root at ovirt3 ~]# gluster volume status
>> Status of volume: data
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick ovirt0:/gluster/brick3/data           49152     0          Y       9102 
>> Brick ovirt2:/gluster/brick3/data           49152     0          Y       28063
>> Brick ovirt3:/gluster/brick3/data           49152     0          Y       28379
>> Brick ovirt0:/gluster/brick4/data           49153     0          Y       9111 
>> Brick ovirt2:/gluster/brick4/data           49153     0          Y       28069
>> Brick ovirt3:/gluster/brick4/data           49153     0          Y       28388
>> Brick ovirt0:/gluster/brick5/data           49154     0          Y       9120 
>> Brick ovirt2:/gluster/brick5/data           49154     0          Y       28075
>> Brick ovirt3:/gluster/brick5/data           49154     0          Y       28397
>> Brick ovirt0:/gluster/brick6/data           49155     0          Y       9129 
>> Brick ovirt2:/gluster/brick6_1/data         49155     0          Y       28081
>> Brick ovirt3:/gluster/brick6/data           49155     0          Y       28404
>> Brick ovirt0:/gluster/brick7/data           49156     0          Y       9138 
>> Brick ovirt2:/gluster/brick7/data           49156     0          Y       28089
>> Brick ovirt3:/gluster/brick7/data           49156     0          Y       28411
>> Brick ovirt0:/gluster/brick8/data           49157     0          Y       9145 
>> Brick ovirt2:/gluster/brick8/data           49157     0          Y       28095
>> Brick ovirt3:/gluster/brick8/data           49157     0          Y       28418
>> Brick ovirt1:/gluster/brick3/data           49152     0          Y       23139
>> Brick ovirt1:/gluster/brick4/data           49153     0          Y       23145
>> Brick ovirt1:/gluster/brick5/data           49154     0          Y       23152
>> Brick ovirt1:/gluster/brick6/data           49155     0          Y       23159
>> Brick ovirt1:/gluster/brick7/data           49156     0          Y       23166
>> Brick ovirt1:/gluster/brick8/data           49157     0          Y       23173
>> Self-heal Daemon on localhost               N/A       N/A        Y       7757 
>> Bitrot Daemon on localhost                  N/A       N/A        Y       7766 
>> Scrubber Daemon on localhost                N/A       N/A        Y       7785 
>> Self-heal Daemon on ovirt2                  N/A       N/A        Y       8205 
>> Bitrot Daemon on ovirt2                     N/A       N/A        Y       8216 
>> Scrubber Daemon on ovirt2                   N/A       N/A        Y       8227 
>> Self-heal Daemon on ovirt0                  N/A       N/A        Y       32665
>> Bitrot Daemon on ovirt0                     N/A       N/A        Y       32674
>> Scrubber Daemon on ovirt0                   N/A       N/A        Y       32712
>> Self-heal Daemon on ovirt1                  N/A       N/A        Y       31759
>> Bitrot Daemon on ovirt1                     N/A       N/A        Y       31768
>> Scrubber Daemon on ovirt1                   N/A       N/A        Y       31790
>>  
>> Task Status of Volume data
>> ------------------------------------------------------------------------------
>> Task                 : Rebalance           
>> ID                   : 62942ba3-db9e-4604-aa03-4970767f4d67
>> Status               : completed           
>>  
>> Status of volume: engine
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick ovirt0:/gluster/brick1/engine         49158     0          Y       9155 
>> Brick ovirt2:/gluster/brick1/engine         49158     0          Y       28107
>> Brick ovirt3:/gluster/brick1/engine         49158     0          Y       28427
>> Self-heal Daemon on localhost               N/A       N/A        Y       7757 
>> Self-heal Daemon on ovirt1                  N/A       N/A        Y       31759
>> Self-heal Daemon on ovirt0                  N/A       N/A        Y       32665
>> Self-heal Daemon on ovirt2                  N/A       N/A        Y       8205 
>>  
>> Task Status of Volume engine
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>>  
>> Status of volume: iso
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick ovirt0:/gluster/brick2/iso            49159     0          Y       9164 
>> Brick ovirt2:/gluster/brick2/iso            49159     0          Y       28116
>> Brick ovirt3:/gluster/brick2/iso            49159     0          Y       28436
>> NFS Server on localhost                     2049      0          Y       7746 
>> Self-heal Daemon on localhost               N/A       N/A        Y       7757 
>> NFS Server on ovirt1                        2049      0          Y       31748
>> Self-heal Daemon on ovirt1                  N/A       N/A        Y       31759
>> NFS Server on ovirt0                        2049      0          Y       32656
>> Self-heal Daemon on ovirt0                  N/A       N/A        Y       32665
>> NFS Server on ovirt2                        2049      0          Y       8194 
>> Self-heal Daemon on ovirt2                  N/A       N/A        Y       8205 
>>  
>> Task Status of Volume iso
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>> 
>> 
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org <mailto:Users at ovirt.org>
>> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
>> 
>> 
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org <mailto:Users at ovirt.org>
>> http://lists.ovirt.org/mailman/listinfo/users
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180330/6abf91d5/attachment.html>


More information about the Users mailing list