I did eventually get ovirt1 to come up. I lost track of the specific
combination of things, but I know the last command was "reinstall host"
from the gui. I had tried that several times prior and it failed, but this
time it worked. As I was running on almost no sleep, I don't remember the
final things that allowed the reinstall to work. It did take my gluster
volumes down again in the process, though.
Once I got ovirt1 back up and operation and got my gluster volumes online
again, then the gui recognize my storage domains and brought the data
center up, and then I could start doing things in the gui again (like
starting VMs).
I will perform more in depth analysis and troubleshooting of ovirt1's low
level storage performance today. According to everything I've checked so
far, I don't see anything wrong with it, though.
On the gluster side, I am using LVM and LV_Thin for my gluster backing
volumes...Is that expected to be a problem?
--Jim
On Thu, May 31, 2018 at 11:25 PM, Sahina Bose <sabose(a)redhat.com> wrote:
On Thu, May 31, 2018 at 3:16 AM, Jim Kusznir <jim(a)palousetech.com> wrote:
> I've been back at it, and still am unable to get more than one of my
> physical nodes to come online in ovirt, nor am I able to get more than the
> two gluster volumes (storage domains) to show online within ovirt.
>
> In Storage -> Volumes, they all show offline (many with one brick down,
> which is correct: I have one server off)
> However, in Storage -> domains, they all show down (although somehow, I
> got machines from the data domain to start), and there's no obvious way
> I've seen to tell them to bring it online. It claims none of the servers
> can see that volume, but I've quadruple-checked that the volumes are
> mounted on the engines and are fully functional there. I have some more
> VMs I need to get back up and running. How do I fix this?
>
> ovirt1, for unknown reasons, will not work. Attempts to bring it online
> fail, and I haven't figured out what log file to look in yet for more
> details.
>
As Krutika mentioned before, the storage performance issues seems to stem
from the slow brick on ovirt1 - ovirt1.nwfiber.com:/gluster/brick1/engine
Have you checked this?
When you say ovirt1 will not work - do you mean the host cannot be
activated in oVirt engine? Can you look at/share the engine.log (on the HE
VM) to understand why it won't. Also if the storage domain is not
accessible by any of the servers, check/share the vdsm.log to understand
the error.
> On Wed, May 30, 2018 at 9:36 AM, Jim Kusznir <jim(a)palousetech.com> wrote:
>
>> Hi all again:
>>
>> I'm now subscribed to gluster-users as well, so I should get any replies
>> from that side too.
>>
>> At this point, I am seeing acceptable (although slower than I expect)
>> performance much of the time, with periodic massive spikes in latency
>> (occasionally so bad as to cause ovirt to detect a engine bad health
>> status). Often, if I check the logs just then, I'll see those call traces
>> in xfs_log_worker or other gluster processes, as well as hung task timeout
>> messages.
>>
>> As to the profile suggesting ovirt1 had poorer performance than ovirt2,
>> I don't have an explanation. gluster volume info engine on both hosts are
>> identical. The computers and drives are identical (Dell R610 with PERC 6/i
>> controller configured to pass through the drive). ovirt1 and ovirt2's
>> partiion scheme/map do vary somewhat, but I figured that wouldn't be a big
>> deal and gluster just uses the least common denominator. Is that accurate?
>>
>> In case it was missed, here are the dmesg errors being thrown by gluster
>> (as far as I can tell). They definately started after I upgraded from
>> gluster 3.8 to 3.12:
>>
>> [ 6604.536156] perf: interrupt took too long (9593 > 9565), lowering
>> kernel.perf_event_max_sample_rate to 20000
>> [ 8280.188734] INFO: task xfsaild/dm-10:1061 blocked for more than 120
>> seconds.
>> [ 8280.188787] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8280.188837] xfsaild/dm-10 D ffff93203a2eeeb0 0 1061 2
>> 0x00000000
>> [ 8280.188843] Call Trace:
>> [ 8280.188857] [<ffffffff960a3a2e>] ? try_to_del_timer_sync+0x5e/0x90
>> [ 8280.188864] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8280.188932] [<ffffffffc049fe36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
>> [ 8280.188939] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8280.188972] [<ffffffffc04abfec>] ? xfsaild+0x16c/0x6f0 [xfs]
>> [ 8280.189003] [<ffffffffc049ff5c>] xfs_log_force+0x2c/0x70 [xfs]
>> [ 8280.189035] [<ffffffffc04abe80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>> [xfs]
>> [ 8280.189067] [<ffffffffc04abfec>] xfsaild+0x16c/0x6f0 [xfs]
>> [ 8280.189100] [<ffffffffc04abe80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>> [xfs]
>> [ 8280.189105] [<ffffffff960bb161>] kthread+0xd1/0xe0
>> [ 8280.189109] [<ffffffff960bb090>] ? insert_kthread_work+0x40/0x40
>> [ 8280.189117] [<ffffffff96720677>] ret_from_fork_nospec_begin+0x2
>> 1/0x21
>> [ 8280.189121] [<ffffffff960bb090>] ? insert_kthread_work+0x40/0x40
>> [ 8280.189161] INFO: task glusterclogfsyn:6421 blocked for more than 120
>> seconds.
>> [ 8280.189207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8280.189256] glusterclogfsyn D ffff93203a2edee0 0 6421 1
>> 0x00000080
>> [ 8280.189260] Call Trace:
>> [ 8280.189265] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8280.189300] [<ffffffffc04a0388>] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>> [ 8280.189305] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8280.189333] [<ffffffffc0480b97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>> [ 8280.189340] [<ffffffff9624f0e7>] do_fsync+0x67/0xb0
>> [ 8280.189345] [<ffffffff9672076f>] ? system_call_after_swapgs+0xbc/
>> 0x160
>> [ 8280.189348] [<ffffffff9624f3d0>] SyS_fsync+0x10/0x20
>> [ 8280.189352] [<ffffffff9672082f>] system_call_fastpath+0x1c/0x21
>> [ 8280.189356] [<ffffffff9672077b>] ? system_call_after_swapgs+0xc8/
>> 0x160
>> [ 8280.189360] INFO: task glusteriotwr2:766 blocked for more than 120
>> seconds.
>> [ 8280.189404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8280.189453] glusteriotwr2 D ffff931cfa510000 0 766 1
>> 0x00000080
>> [ 8280.189457] Call Trace:
>> [ 8280.189476] [<ffffffffc01a3839>] ? __split_and_process_bio+0x2e9/0x520
>> [dm_mod]
>> [ 8280.189480] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8280.189484] [<ffffffff967118e9>] schedule_timeout+0x239/0x2c0
>> [ 8280.189494] [<ffffffffc01a3d98>] ? dm_make_request+0x128/0x1a0
>> [dm_mod]
>> [ 8280.189498] [<ffffffff9671348d>] io_schedule_timeout+0xad/0x130
>> [ 8280.189502] [<ffffffff967145ad>] wait_for_completion_io+0xfd/0x140
>> [ 8280.189507] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8280.189513] [<ffffffff9631e574>] blkdev_issue_flush+0xb4/0x110
>> [ 8280.189546] [<ffffffffc04984b9>] xfs_blkdev_issue_flush+0x19/0x20
>> [xfs]
>> [ 8280.189588] [<ffffffffc0480c40>] xfs_file_fsync+0x1b0/0x1e0 [xfs]
>> [ 8280.189593] [<ffffffff9624f0e7>] do_fsync+0x67/0xb0
>> [ 8280.189597] [<ffffffff9672076f>] ? system_call_after_swapgs+0xbc/
>> 0x160
>> [ 8280.189600] [<ffffffff9624f3d0>] SyS_fsync+0x10/0x20
>> [ 8280.189604] [<ffffffff9672082f>] system_call_fastpath+0x1c/0x21
>> [ 8280.189608] [<ffffffff9672077b>] ? system_call_after_swapgs+0xc8/
>> 0x160
>> [ 8400.186024] INFO: task xfsaild/dm-10:1061 blocked for more than 120
>> seconds.
>> [ 8400.186077] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8400.186127] xfsaild/dm-10 D ffff93203a2eeeb0 0 1061 2
>> 0x00000000
>> [ 8400.186133] Call Trace:
>> [ 8400.186146] [<ffffffff960a3a2e>] ? try_to_del_timer_sync+0x5e/0x90
>> [ 8400.186153] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8400.186221] [<ffffffffc049fe36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
>> [ 8400.186227] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8400.186260] [<ffffffffc04abfec>] ? xfsaild+0x16c/0x6f0 [xfs]
>> [ 8400.186292] [<ffffffffc049ff5c>] xfs_log_force+0x2c/0x70 [xfs]
>> [ 8400.186324] [<ffffffffc04abe80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>> [xfs]
>> [ 8400.186356] [<ffffffffc04abfec>] xfsaild+0x16c/0x6f0 [xfs]
>> [ 8400.186388] [<ffffffffc04abe80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>> [xfs]
>> [ 8400.186394] [<ffffffff960bb161>] kthread+0xd1/0xe0
>> [ 8400.186398] [<ffffffff960bb090>] ? insert_kthread_work+0x40/0x40
>> [ 8400.186405] [<ffffffff96720677>] ret_from_fork_nospec_begin+0x2
>> 1/0x21
>> [ 8400.186409] [<ffffffff960bb090>] ? insert_kthread_work+0x40/0x40
>> [ 8400.186450] INFO: task glusterclogfsyn:6421 blocked for more than 120
>> seconds.
>> [ 8400.186496] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8400.186545] glusterclogfsyn D ffff93203a2edee0 0 6421 1
>> 0x00000080
>> [ 8400.186549] Call Trace:
>> [ 8400.186554] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8400.186589] [<ffffffffc04a0388>] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>> [ 8400.186593] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8400.186622] [<ffffffffc0480b97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>> [ 8400.186629] [<ffffffff9624f0e7>] do_fsync+0x67/0xb0
>> [ 8400.186634] [<ffffffff9672076f>] ? system_call_after_swapgs+0xbc/
>> 0x160
>> [ 8400.186637] [<ffffffff9624f3d0>] SyS_fsync+0x10/0x20
>> [ 8400.186641] [<ffffffff9672082f>] system_call_fastpath+0x1c/0x21
>> [ 8400.186645] [<ffffffff9672077b>] ? system_call_after_swapgs+0xc8/
>> 0x160
>> [ 8400.186649] INFO: task glusteriotwr2:766 blocked for more than 120
>> seconds.
>> [ 8400.186693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8400.186741] glusteriotwr2 D ffff931cfa510000 0 766 1
>> 0x00000080
>> [ 8400.186746] Call Trace:
>> [ 8400.186764] [<ffffffffc01a3839>] ? __split_and_process_bio+0x2e9/0x520
>> [dm_mod]
>> [ 8400.186768] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8400.186772] [<ffffffff967118e9>] schedule_timeout+0x239/0x2c0
>> [ 8400.186782] [<ffffffffc01a3d98>] ? dm_make_request+0x128/0x1a0
>> [dm_mod]
>> [ 8400.186786] [<ffffffff9671348d>] io_schedule_timeout+0xad/0x130
>> [ 8400.186790] [<ffffffff967145ad>] wait_for_completion_io+0xfd/0x140
>> [ 8400.186795] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8400.186800] [<ffffffff9631e574>] blkdev_issue_flush+0xb4/0x110
>> [ 8400.186833] [<ffffffffc04984b9>] xfs_blkdev_issue_flush+0x19/0x20
>> [xfs]
>> [ 8400.186862] [<ffffffffc0480c40>] xfs_file_fsync+0x1b0/0x1e0 [xfs]
>> [ 8400.186879] [<ffffffff9624f0e7>] do_fsync+0x67/0xb0
>> [ 8400.186884] [<ffffffff9672076f>] ? system_call_after_swapgs+0xbc/
>> 0x160
>> [ 8400.186887] [<ffffffff9624f3d0>] SyS_fsync+0x10/0x20
>> [ 8400.186891] [<ffffffff9672082f>] system_call_fastpath+0x1c/0x21
>> [ 8400.186895] [<ffffffff9672077b>] ? system_call_after_swapgs+0xc8/
>> 0x160
>> [ 8400.186912] INFO: task kworker/3:4:3191 blocked for more than 120
>> seconds.
>> [ 8400.186957] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 8400.187006] kworker/3:4 D ffff931faf598fd0 0 3191 2
>> 0x00000080
>> [ 8400.187041] Workqueue: xfs-sync/dm-10 xfs_log_worker [xfs]
>> [ 8400.187043] Call Trace:
>> [ 8400.187050] [<ffffffff9614e45f>] ? delayacct_end+0x8f/0xb0
>> [ 8400.187055] [<ffffffff960dbbc8>] ? update_group_power+0x28/0x280
>> [ 8400.187059] [<ffffffff96713f79>] schedule+0x29/0x70
>> [ 8400.187062] [<ffffffff967118e9>] schedule_timeout+0x239/0x2c0
>> [ 8400.187066] [<ffffffff9671432d>] wait_for_completion+0xfd/0x140
>> [ 8400.187071] [<ffffffff960cf1b0>] ? wake_up_state+0x20/0x20
>> [ 8400.187075] [<ffffffff960b418d>] flush_work+0xfd/0x190
>> [ 8400.187079] [<ffffffff960b06e0>] ? move_linked_works+0x90/0x90
>> [ 8400.187110] [<ffffffffc04a1d8a>] xlog_cil_force_lsn+0x8a/0x210 [xfs]
>> [ 8400.187142] [<ffffffffc049fcf5>] _xfs_log_force+0x85/0x2c0 [xfs]
>> [ 8400.187174] [<ffffffffc049ffd6>] ? xfs_log_worker+0x36/0x100 [xfs]
>> [ 8400.187206] [<ffffffffc049ff5c>] xfs_log_force+0x2c/0x70 [xfs]
>> [ 8400.187237] [<ffffffffc049ffd6>] xfs_log_worker+0x36/0x100 [xfs]
>> [ 8400.187241] [<ffffffff960b312f>] process_one_work+0x17f/0x440
>> [ 8400.187245] [<ffffffff960b3df6>] worker_thread+0x126/0x3c0
>> [ 8400.187249] [<ffffffff960b3cd0>] ? manage_workers.isra.24+0x2a0/0
>> x2a0
>> [ 8400.187253] [<ffffffff960bb161>] kthread+0xd1/0xe0
>> [ 8400.187257] [<ffffffff960bb090>] ? insert_kthread_work+0x40/0x40
>> [ 8400.187261] [<ffffffff96720677>] ret_from_fork_nospec_begin+0x2
>> 1/0x21
>> [ 8400.187265] [<ffffffff960bb090>] ? insert_kthread_work+0x40/0x40
>> [13493.643624] perf: interrupt took too long (12026 > 11991), lowering
>> kernel.perf_event_max_sample_rate to 16000
>>
>>
>> On Wed, May 30, 2018 at 2:44 AM, Krutika Dhananjay <kdhananj(a)redhat.com>
>> wrote:
>>
>>> The profile seems to suggest very high latencies on the brick at
>>> ovirt1.nwfiber.com:/gluster/brick1/engine
>>> ovirt2.* shows decent numbers. Is everything OK with the brick on
>>> ovirt1?
>>> Are the bricks of engine volume on both these servers identical in
>>> terms of their config?
>>>
>>> -Krutika
>>>
>>>
>>> On Wed, May 30, 2018 at 3:07 PM, Jim Kusznir <jim(a)palousetech.com>
>>> wrote:
>>>
>>>> Hi:
>>>>
>>>> Thank you. I was finally able to get my cluster minimally functional
>>>> again (its 2am local here; had to be up by 6am). I set the
>>>> cluster.eager-lock off, but I'm still seeing performance issues.
Here's
>>>> the engine volume:
>>>>
>>>> Brick: ovirt1.nwfiber.com:/gluster/brick1/engine
>>>> ------------------------------------------------
>>>> Cumulative Stats:
>>>> Block Size: 256b+ 512b+
>>>> 4096b+
>>>> No. of Reads: 702 24
>>>> 772
>>>> No. of Writes: 0 939
>>>> 4802
>>>>
>>>> Block Size: 8192b+ 16384b+
>>>> 32768b+
>>>> No. of Reads: 0 51
>>>> 49
>>>> No. of Writes: 4211 858
>>>> 291
>>>>
>>>> Block Size: 65536b+ 131072b+
>>>> No. of Reads: 49 5232
>>>> No. of Writes: 333 588
>>>> %-latency Avg-latency Min-Latency Max-Latency No. of calls
>>>> Fop
>>>> --------- ----------- ----------- ----------- ------------
>>>> ----
>>>> 0.00 0.00 us 0.00 us 0.00 us 3
>>>> FORGET
>>>> 0.00 0.00 us 0.00 us 0.00 us 1227
>>>> RELEASE
>>>> 0.00 0.00 us 0.00 us 0.00 us 1035
>>>> RELEASEDIR
>>>> 0.00 827.00 us 619.00 us 1199.00 us 10
>>>> READDIRP
>>>> 0.00 98.38 us 30.00 us 535.00 us 144
>>>> ENTRYLK
>>>> 0.00 180.11 us 121.00 us 257.00 us 94
>>>> REMOVEXATTR
>>>> 0.00 208.03 us 23.00 us 1980.00 us 182
>>>> GETXATTR
>>>> 0.00 1212.71 us 35.00 us 29459.00 us 38
>>>> READDIR
>>>> 0.01 260.39 us 172.00 us 552.00 us 376
>>>> XATTROP
>>>> 0.01 305.10 us 45.00 us 144983.00 us 727
>>>> STATFS
>>>> 0.03 1267.23 us 22.00 us 418399.00 us 451
>>>> FLUSH
>>>> 0.04 1164.14 us 86.00 us 437638.00 us 600
>>>> OPEN
>>>> 0.07 2253.75 us 3.00 us 645037.00 us 598
>>>> OPENDIR
>>>> 0.08 17360.97 us 149.00 us 962156.00 us 94
>>>> SETATTR
>>>> 0.12 2733.36 us 50.00 us 1683945.00 us 877
>>>> FSTAT
>>>> 0.21 3041.55 us 29.00 us 1021732.00 us 1368
>>>> INODELK
>>>> 0.24 389.93 us 107.00 us 550203.00 us 11840
>>>> FXATTROP
>>>> 0.34 14820.49 us 38.00 us 1935527.00 us 444
>>>> STAT
>>>> 0.88 3765.77 us 42.00 us 1341978.00 us 4581
>>>> LOOKUP
>>>> 15.06 19624.04 us 26.00 us 6412511.00 us 14971
>>>> FINODELK
>>>> 21.63 59817.26 us 45.00 us 4008832.00 us 7054
>>>> FSYNC
>>>> 23.19 263941.14 us 86.00 us 4130892.00 us 1714
>>>> READ
>>>> 38.10 125578.64 us 207.00 us 7682338.00 us 5919
>>>> WRITE
>>>>
>>>> Duration: 6957 seconds
>>>> Data Read: 694832857 bytes
>>>> Data Written: 192751104 bytes
>>>>
>>>> Interval 0 Stats:
>>>> Block Size: 256b+ 512b+
>>>> 4096b+
>>>> No. of Reads: 702 24
>>>> 772
>>>> No. of Writes: 0 939
>>>> 4802
>>>>
>>>> Block Size: 8192b+ 16384b+
>>>> 32768b+
>>>> No. of Reads: 0 51
>>>> 49
>>>> No. of Writes: 4211 858
>>>> 291
>>>>
>>>> Block Size: 65536b+ 131072b+
>>>> No. of Reads: 49 5232
>>>> No. of Writes: 333 588
>>>> %-latency Avg-latency Min-Latency Max-Latency No. of calls
>>>> Fop
>>>> --------- ----------- ----------- ----------- ------------
>>>> ----
>>>> 0.00 0.00 us 0.00 us 0.00 us 3
>>>> FORGET
>>>> 0.00 0.00 us 0.00 us 0.00 us 1227
>>>> RELEASE
>>>> 0.00 0.00 us 0.00 us 0.00 us 1035
>>>> RELEASEDIR
>>>> 0.00 827.00 us 619.00 us 1199.00 us 10
>>>> READDIRP
>>>> 0.00 98.38 us 30.00 us 535.00 us 144
>>>> ENTRYLK
>>>> 0.00 180.11 us 121.00 us 257.00 us 94
>>>> REMOVEXATTR
>>>> 0.00 208.03 us 23.00 us 1980.00 us 182
>>>> GETXATTR
>>>> 0.00 1212.71 us 35.00 us 29459.00 us 38
>>>> READDIR
>>>> 0.01 260.39 us 172.00 us 552.00 us 376
>>>> XATTROP
>>>> 0.01 305.10 us 45.00 us 144983.00 us 727
>>>> STATFS
>>>> 0.03 1267.23 us 22.00 us 418399.00 us 451
>>>> FLUSH
>>>> 0.04 1164.14 us 86.00 us 437638.00 us 600
>>>> OPEN
>>>> 0.07 2253.75 us 3.00 us 645037.00 us 598
>>>> OPENDIR
>>>> 0.08 17360.97 us 149.00 us 962156.00 us 94
>>>> SETATTR
>>>> 0.12 2733.36 us 50.00 us 1683945.00 us 877
>>>> FSTAT
>>>> 0.21 3041.55 us 29.00 us 1021732.00 us 1368
>>>> INODELK
>>>> 0.24 389.93 us 107.00 us 550203.00 us 11840
>>>> FXATTROP
>>>> 0.34 14820.49 us 38.00 us 1935527.00 us 444
>>>> STAT
>>>> 0.91 3859.75 us 42.00 us 1341978.00 us 4581
>>>> LOOKUP
>>>> 15.05 19622.80 us 26.00 us 6412511.00 us 14971
>>>> FINODELK
>>>> 21.62 59809.04 us 45.00 us 4008832.00 us 7054
>>>> FSYNC
>>>> 23.18 263941.14 us 86.00 us 4130892.00 us 1714
>>>> READ
>>>> 38.09 125562.74 us 207.00 us 7682338.00 us 5919
>>>> WRITE
>>>>
>>>> Duration: 6957 seconds
>>>> Data Read: 694832857 bytes
>>>> Data Written: 192751104 bytes
>>>>
>>>> Brick: ovirt2.nwfiber.com:/gluster/brick1/engine
>>>> ------------------------------------------------
>>>> Cumulative Stats:
>>>> Block Size: 256b+ 512b+
>>>> 1024b+
>>>> No. of Reads: 704 167
>>>> 216
>>>> No. of Writes: 0 939
>>>> 0
>>>>
>>>> Block Size: 2048b+ 4096b+
>>>> 8192b+
>>>> No. of Reads: 424 34918
>>>> 9866
>>>> No. of Writes: 0 4802
>>>> 4211
>>>>
>>>> Block Size: 16384b+ 32768b+
>>>> 65536b+
>>>> No. of Reads: 12695 3290
>>>> 3880
>>>> No. of Writes: 858 291
>>>> 333
>>>>
>>>> Block Size: 131072b+
>>>> No. of Reads: 13718
>>>> No. of Writes: 596
>>>> %-latency Avg-latency Min-Latency Max-Latency No. of calls
>>>> Fop
>>>> --------- ----------- ----------- ----------- ------------
>>>> ----
>>>> 0.00 0.00 us 0.00 us 0.00 us 3
>>>> FORGET
>>>> 0.00 0.00 us 0.00 us 0.00 us 1228
>>>> RELEASE
>>>> 0.00 0.00 us 0.00 us 0.00 us 1041
>>>> RELEASEDIR
>>>> 0.06 74.26 us 21.00 us 915.00 us 144
>>>> ENTRYLK
>>>> 0.07 141.97 us 87.00 us 268.00 us 94
>>>> REMOVEXATTR
>>>> 0.08 370.68 us 22.00 us 1311.00 us 38
>>>> READDIR
>>>> 0.09 184.88 us 108.00 us 316.00 us 94
>>>> SETATTR
>>>> 0.14 56.41 us 21.00 us 540.00 us 451
>>>> FLUSH
>>>> 0.14 139.42 us 18.00 us 3251.00 us 184
>>>> GETXATTR
>>>> 0.31 96.06 us 2.00 us 2293.00 us 598
>>>> OPENDIR
>>>> 0.35 88.76 us 31.00 us 716.00 us 727
>>>> STATFS
>>>> 0.43 131.96 us 59.00 us 784.00 us 600
>>>> OPEN
>>>> 0.44 214.77 us 110.00 us 851.00 us 376
>>>> XATTROP
>>>> 0.44 81.02 us 21.00 us 3029.00 us 998
>>>> STAT
>>>> 0.51 68.57 us 17.00 us 3039.00 us 1368
>>>> INODELK
>>>> 1.26 429.06 us 274.00 us 988.00 us 539
>>>> READDIRP
>>>> 1.89 109.48 us 38.00 us 2492.00 us 3175
>>>> FSTAT
>>>> 5.09 62.48 us 19.00 us 4578.00 us 14967
>>>> FINODELK
>>>> 5.72 229.43 us 32.00 us 979.00 us 4581
>>>> LOOKUP
>>>> 11.48 356.40 us 144.00 us 16222.00 us 5917
>>>> WRITE
>>>> 12.40 192.55 us 87.00 us 4372.00 us 11834
>>>> FXATTROP
>>>> 12.48 325.11 us 33.00 us 10702.00 us 7051
>>>> FSYNC
>>>> 46.61 2063.11 us 131.00 us 193833.00 us 4151
>>>> READ
>>>>
>>>> Duration: 6958 seconds
>>>> Data Read: 2782908225 bytes
>>>> Data Written: 193799680 bytes
>>>>
>>>> Interval 0 Stats:
>>>> Block Size: 256b+ 512b+
>>>> 1024b+
>>>> No. of Reads: 704 167
>>>> 216
>>>> No. of Writes: 0 939
>>>> 0
>>>>
>>>> Block Size: 2048b+ 4096b+
>>>> 8192b+
>>>> No. of Reads: 424 34918
>>>> 9866
>>>> No. of Writes: 0 4802
>>>> 4211
>>>>
>>>> Block Size: 16384b+ 32768b+
>>>> 65536b+
>>>> No. of Reads: 12695 3290
>>>> 3880
>>>> No. of Writes: 858 291
>>>> 333
>>>>
>>>> Block Size: 131072b+
>>>> No. of Reads: 13718
>>>> No. of Writes: 596
>>>> %-latency Avg-latency Min-Latency Max-Latency No. of calls
>>>> Fop
>>>> --------- ----------- ----------- ----------- ------------
>>>> ----
>>>> 0.00 0.00 us 0.00 us 0.00 us 3
>>>> FORGET
>>>> 0.00 0.00 us 0.00 us 0.00 us 1228
>>>> RELEASE
>>>> 0.00 0.00 us 0.00 us 0.00 us 1041
>>>> RELEASEDIR
>>>> 0.06 74.26 us 21.00 us 915.00 us 144
>>>> ENTRYLK
>>>> 0.07 141.97 us 87.00 us 268.00 us 94
>>>> REMOVEXATTR
>>>> 0.08 370.68 us 22.00 us 1311.00 us 38
>>>> READDIR
>>>> 0.09 184.88 us 108.00 us 316.00 us 94
>>>> SETATTR
>>>> 0.14 56.41 us 21.00 us 540.00 us 451
>>>> FLUSH
>>>> 0.14 139.42 us 18.00 us 3251.00 us 184
>>>> GETXATTR
>>>> 0.31 96.06 us 2.00 us 2293.00 us 598
>>>> OPENDIR
>>>> 0.35 88.76 us 31.00 us 716.00 us 727
>>>> STATFS
>>>> 0.43 131.96 us 59.00 us 784.00 us 600
>>>> OPEN
>>>> 0.44 214.77 us 110.00 us 851.00 us 376
>>>> XATTROP
>>>> 0.44 81.02 us 21.00 us 3029.00 us 998
>>>> STAT
>>>> 0.51 68.57 us 17.00 us 3039.00 us 1368
>>>> INODELK
>>>> 1.26 429.06 us 274.00 us 988.00 us 539
>>>> READDIRP
>>>> 1.89 109.51 us 38.00 us 2492.00 us 3175
>>>> FSTAT
>>>> 5.09 62.48 us 19.00 us 4578.00 us 14967
>>>> FINODELK
>>>> 5.72 229.43 us 32.00 us 979.00 us 4581
>>>> LOOKUP
>>>> 11.48 356.40 us 144.00 us 16222.00 us 5917
>>>> WRITE
>>>> 12.40 192.55 us 87.00 us 4372.00 us 11834
>>>> FXATTROP
>>>> 12.48 325.11 us 33.00 us 10702.00 us 7051
>>>> FSYNC
>>>> 46.62 2063.91 us 131.00 us 193833.00 us 4151
>>>> READ
>>>>
>>>> Duration: 6958 seconds
>>>> Data Read: 2782908225 bytes
>>>> Data Written: 193799680 bytes
>>>>
>>>>
>>>>
>>>> gluster> volume info engine
>>>>
>>>> Volume Name: engine
>>>> Type: Replicate
>>>> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
>>>> Status: Started
>>>> Snapshot Count: 0
>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
>>>> Options Reconfigured:
>>>> diagnostics.count-fop-hits: on
>>>> diagnostics.latency-measurement: on
>>>> performance.strict-o-direct: on
>>>> nfs.disable: on
>>>> user.cifs: off
>>>> network.ping-timeout: 30
>>>> cluster.shd-max-threads: 6
>>>> cluster.shd-wait-qlength: 10000
>>>> cluster.locking-scheme: granular
>>>> cluster.data-self-heal-algorithm: full
>>>> performance.low-prio-threads: 32
>>>> features.shard-block-size: 512MB
>>>> features.shard: on
>>>> storage.owner-gid: 36
>>>> storage.owner-uid: 36
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: off
>>>> cluster.eager-lock: off
>>>> performance.stat-prefetch: off
>>>> performance.io-cache: off
>>>> performance.read-ahead: off
>>>> performance.quick-read: off
>>>> performance.readdir-ahead: on
>>>> geo-replication.indexing: on
>>>> geo-replication.ignore-pid-check: on
>>>> changelog.changelog: on
>>>>
>>>>
>>>> On Tue, May 29, 2018 at 9:17 PM, Krutika Dhananjay <
>>>> kdhananj(a)redhat.com> wrote:
>>>>
>>>>> Adding Ravi to look into the heal issue.
>>>>>
>>>>> As for the fsync hang and subsequent IO errors, it seems a lot like
>>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1497156 and Paolo
>>>>> Bonzini from qemu had pointed out that this would be fixed by the
following
>>>>> commit:
>>>>>
>>>>> commit e72c9a2a67a6400c8ef3d01d4c461dbbbfa0e1f0
>>>>> Author: Paolo Bonzini <pbonzini(a)redhat.com>
>>>>> Date: Wed Jun 21 16:35:46 2017 +0200
>>>>>
>>>>> scsi: virtio_scsi: let host do exception handling
>>>>>
>>>>> virtio_scsi tries to do exception handling after the default 30
seconds
>>>>> timeout expires. However, it's better to let the host
control the
>>>>> timeout, otherwise with a heavy I/O load it is likely that an
abort will
>>>>> also timeout. This leads to fatal errors like filesystems going
>>>>> offline.
>>>>>
>>>>> Disable the 'sd' timeout and allow the host to do
exception handling,
>>>>> following the precedent of the storvsc driver.
>>>>>
>>>>> Hannes has a proposal to introduce timeouts in virtio, but this
provides
>>>>> an immediate solution for stable kernels too.
>>>>>
>>>>> [mkp: fixed typo]
>>>>>
>>>>> Reported-by: Douglas Miller <dougmill(a)linux.vnet.ibm.com>
>>>>> Cc: "James E.J. Bottomley"
<jejb(a)linux.vnet.ibm.com>
>>>>> Cc: "Martin K. Petersen"
<martin.petersen(a)oracle.com>
>>>>> Cc: Hannes Reinecke <hare(a)suse.de>
>>>>> Cc: linux-scsi(a)vger.kernel.org
>>>>> Cc: stable(a)vger.kernel.org
>>>>> Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
>>>>> Signed-off-by: Martin K. Petersen
<martin.petersen(a)oracle.com>
>>>>>
>>>>>
>>>>> Adding Paolo/Kevin to comment.
>>>>>
>>>>> As for the poor gluster performance, could you disable
>>>>> cluster.eager-lock and see if that makes any difference:
>>>>>
>>>>> # gluster volume set <VOL> cluster.eager-lock off
>>>>>
>>>>> Do also capture the volume profile again if you still see
performance
>>>>> issues after disabling eager-lock.
>>>>>
>>>>> -Krutika
>>>>>
>>>>>
>>>>> On Wed, May 30, 2018 at 6:55 AM, Jim Kusznir
<jim(a)palousetech.com>
>>>>> wrote:
>>>>>
>>>>>> I also finally found the following in my system log on one
server:
>>>>>>
>>>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [10679.527144] glusterclogro D ffff97209832bf40 0 14933
1
>>>>>> 0x00000080
>>>>>> [10679.527150] Call Trace:
>>>>>> [10679.527161] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [10679.527218] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>> [xfs]
>>>>>> [10679.527225] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [10679.527254] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>>> [10679.527260] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>>> [10679.527268] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [10679.527271] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>>>> [10679.527275] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [10679.527279] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more
>>>>>> than 120 seconds.
>>>>>> [10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [10679.529956] glusterposixfsy D ffff972495f84f10 0 14941
1
>>>>>> 0x00000080
>>>>>> [10679.529961] Call Trace:
>>>>>> [10679.529966] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [10679.530003] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>> [xfs]
>>>>>> [10679.530008] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [10679.530038] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>>> [10679.530042] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>>> [10679.530046] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [10679.530050] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>> [10679.530054] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [10679.530058] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [10679.533732] glusteriotwr13 D ffff9720a83f0000 0 15486
1
>>>>>> 0x00000080
>>>>>> [10679.533738] Call Trace:
>>>>>> [10679.533747] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [10679.533799] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>> [xfs]
>>>>>> [10679.533806] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [10679.533846] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>>> [10679.533852] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>>> [10679.533858] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [10679.533863] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>> [10679.533868] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [10679.533873] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [10919.512757] INFO: task glusterclogro:14933 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [10919.516663] glusterclogro D ffff97209832bf40 0 14933
1
>>>>>> 0x00000080
>>>>>> [10919.516677] Call Trace:
>>>>>> [10919.516690] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [10919.516696] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>> [10919.516703] [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40
>>>>>> [10919.516768] [<ffffffffc05e9224>] ?
_xfs_buf_ioapply+0x334/0x460
>>>>>> [xfs]
>>>>>> [10919.516774] [<ffffffffb991432d>]
wait_for_completion+0xfd/0x140
>>>>>> [10919.516782] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [10919.516821] [<ffffffffc05eb0a3>] ?
_xfs_buf_read+0x23/0x40 [xfs]
>>>>>> [10919.516859] [<ffffffffc05eafa9>]
xfs_buf_submit_wait+0xf9/0x1d0
>>>>>> [xfs]
>>>>>> [10919.516902] [<ffffffffc061b279>] ?
xfs_trans_read_buf_map+0x199/0x400
>>>>>> [xfs]
>>>>>> [10919.516940] [<ffffffffc05eb0a3>]
_xfs_buf_read+0x23/0x40 [xfs]
>>>>>> [10919.516977] [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160
>>>>>> [xfs]
>>>>>> [10919.517022] [<ffffffffc061b279>]
xfs_trans_read_buf_map+0x199/0x400
>>>>>> [xfs]
>>>>>> [10919.517057] [<ffffffffc05c8d04>]
xfs_da_read_buf+0xd4/0x100 [xfs]
>>>>>> [10919.517091] [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0
>>>>>> [xfs]
>>>>>> [10919.517126] [<ffffffffc05c9fee>]
xfs_da3_node_lookup_int+0x6e/0x2f0
>>>>>> [xfs]
>>>>>> [10919.517160] [<ffffffffc05d5a1d>]
xfs_dir2_node_lookup+0x4d/0x170
>>>>>> [xfs]
>>>>>> [10919.517194] [<ffffffffc05ccf5d>]
xfs_dir_lookup+0x1bd/0x1e0 [xfs]
>>>>>> [10919.517233] [<ffffffffc05fd8d9>] xfs_lookup+0x69/0x140
[xfs]
>>>>>> [10919.517271] [<ffffffffc05fa018>]
xfs_vn_lookup+0x78/0xc0 [xfs]
>>>>>> [10919.517278] [<ffffffffb9425cf3>] lookup_real+0x23/0x60
>>>>>> [10919.517283] [<ffffffffb9426702>]
__lookup_hash+0x42/0x60
>>>>>> [10919.517288] [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0
>>>>>> [10919.517296] [<ffffffffb94d3753>] ?
selinux_file_free_security+0x2
>>>>>> 3/0x30
>>>>>> [10919.517304] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [10919.517309] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [10919.517313] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [10919.517318] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [10919.517323] [<ffffffffb942e58e>]
SyS_renameat2+0xe/0x10
>>>>>> [10919.517328] [<ffffffffb942e5ce>] SyS_rename+0x1e/0x20
>>>>>> [10919.517333] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [10919.517339] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [11159.498978] glusteriotwr9 D ffff971fa0fa1fa0 0 15482
1
>>>>>> 0x00000080
>>>>>> [11159.498984] Call Trace:
>>>>>> [11159.498995] [<ffffffffb9911f00>] ? bit_wait+0x50/0x50
>>>>>> [11159.498999] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [11159.499003] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>> [11159.499056] [<ffffffffc05dd9b7>] ?
>>>>>> xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
>>>>>> [11159.499082] [<ffffffffc05dd43e>] ?
>>>>>> xfs_iext_bno_to_irec+0x8e/0xd0 [xfs]
>>>>>> [11159.499090] [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0
>>>>>> [11159.499093] [<ffffffffb9911f00>] ? bit_wait+0x50/0x50
>>>>>> [11159.499097] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>>> [11159.499101] [<ffffffffb9913528>] io_schedule+0x18/0x20
>>>>>> [11159.499104] [<ffffffffb9911f11>] bit_wait_io+0x11/0x50
>>>>>> [11159.499107] [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0
>>>>>> [11159.499113] [<ffffffffb9393634>] __lock_page+0x74/0x90
>>>>>> [11159.499118] [<ffffffffb92bc210>] ?
wake_bit_function+0x40/0x40
>>>>>> [11159.499121] [<ffffffffb9394154>]
__find_lock_page+0x54/0x70
>>>>>> [11159.499125] [<ffffffffb9394e85>]
grab_cache_page_write_begin+0x
>>>>>> 55/0xc0
>>>>>> [11159.499130] [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100
>>>>>> [11159.499135] [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0
>>>>>> [11159.499140] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>> [11159.499144] [<ffffffffb94854e7>]
iomap_apply+0xb7/0x150
>>>>>> [11159.499149] [<ffffffffb9485621>]
iomap_file_buffered_write+0xa1
>>>>>> /0xe0
>>>>>> [11159.499153] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>> [11159.499182] [<ffffffffc05f025d>]
xfs_file_buffered_aio_write+0x12d/0x2c0
>>>>>> [xfs]
>>>>>> [11159.499213] [<ffffffffc05f057d>]
xfs_file_aio_write+0x18d/0x1b0
>>>>>> [xfs]
>>>>>> [11159.499217] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>> [11159.499222] [<ffffffffb941b010>] vfs_write+0xc0/0x1f0
>>>>>> [11159.499225] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>> [11159.499230] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [11159.499234] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [11159.499238] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [11279.491665] xfsaild/dm-10 D ffff9720a8660fd0 0 1134
2
>>>>>> 0x00000000
>>>>>> [11279.491671] Call Trace:
>>>>>> [11279.491682] [<ffffffffb92a3a2e>] ?
try_to_del_timer_sync+0x5e/0x9
>>>>>> 0
>>>>>> [11279.491688] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [11279.491744] [<ffffffffc060de36>]
_xfs_log_force+0x1c6/0x2c0 [xfs]
>>>>>> [11279.491750] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [11279.491783] [<ffffffffc0619fec>] ? xfsaild+0x16c/0x6f0
[xfs]
>>>>>> [11279.491817] [<ffffffffc060df5c>]
xfs_log_force+0x2c/0x70 [xfs]
>>>>>> [11279.491849] [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>>>>>> [xfs]
>>>>>> [11279.491880] [<ffffffffc0619fec>] xfsaild+0x16c/0x6f0
[xfs]
>>>>>> [11279.491913] [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>>>>>> [xfs]
>>>>>> [11279.491919] [<ffffffffb92bb161>] kthread+0xd1/0xe0
>>>>>> [11279.491926] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>>> [11279.491932] [<ffffffffb9920677>]
ret_from_fork_nospec_begin+0x2
>>>>>> 1/0x21
>>>>>> [11279.491936] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>>> [11279.491976] INFO: task glusterclogfsyn:14934 blocked for more
>>>>>> than 120 seconds.
>>>>>> [11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [11279.494952] glusterclogfsyn D ffff97209832af70 0 14934
1
>>>>>> 0x00000080
>>>>>> [11279.494957] Call Trace:
>>>>>> [11279.494979] [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/0x520
>>>>>> [dm_mod]
>>>>>> [11279.494983] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [11279.494987] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>> [11279.494997] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>>> [dm_mod]
>>>>>> [11279.495001] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>>> [11279.495005] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x
>>>>>> 140
>>>>>> [11279.495010] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [11279.495016] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>>> [11279.495049] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>>> [xfs]
>>>>>> [11279.495079] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0 [xfs]
>>>>>> [11279.495086] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>>> [11279.495090] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [11279.495094] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>>>> [11279.495098] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [11279.495102] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [11279.495105] INFO: task glusterposixfsy:14941 blocked for more
>>>>>> than 120 seconds.
>>>>>> [11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [11279.498114] glusterposixfsy D ffff972495f84f10 0 14941
1
>>>>>> 0x00000080
>>>>>> [11279.498118] Call Trace:
>>>>>> [11279.498134] [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/0x520
>>>>>> [dm_mod]
>>>>>> [11279.498138] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [11279.498142] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>> [11279.498152] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>>> [dm_mod]
>>>>>> [11279.498156] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>>> [11279.498160] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x
>>>>>> 140
>>>>>> [11279.498165] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [11279.498169] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>>> [11279.498202] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>>> [xfs]
>>>>>> [11279.498231] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0 [xfs]
>>>>>> [11279.498238] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>>> [11279.498242] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [11279.498246] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>> [11279.498250] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [11279.498254] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [11279.501343] glusteriotwr1 D ffff97208b6daf70 0 14950
1
>>>>>> 0x00000080
>>>>>> [11279.501348] Call Trace:
>>>>>> [11279.501353] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [11279.501390] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>> [xfs]
>>>>>> [11279.501396] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [11279.501428] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>>> [11279.501432] [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70
>>>>>> [11279.501461] [<ffffffffc05f0545>]
xfs_file_aio_write+0x155/0x1b0
>>>>>> [xfs]
>>>>>> [11279.501466] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>> [11279.501471] [<ffffffffb941b010>] vfs_write+0xc0/0x1f0
>>>>>> [11279.501475] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>> [11279.501479] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [11279.501483] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [11279.501489] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked for more
than
>>>>>> 120 seconds.
>>>>>> [11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [11279.504630] glusteriotwr4 D ffff972499f2bf40 0 14953
1
>>>>>> 0x00000080
>>>>>> [11279.504635] Call Trace:
>>>>>> [11279.504640] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>>> [11279.504676] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>> [xfs]
>>>>>> [11279.504681] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>> [11279.504710] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>>> [11279.504714] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>>> [11279.504718] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>>> 0x160
>>>>>> [11279.504722] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>>>> [11279.504725] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>> [11279.504730] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>>> 0x160
>>>>>> [12127.466494] perf: interrupt took too long (8263 > 8150),
lowering
>>>>>> kernel.perf_event_max_sample_rate to 24000
>>>>>>
>>>>>> --------------------
>>>>>> I think this is the cause of the massive ovirt performance
issues
>>>>>> irrespective of gluster volume. At the time this happened, I was
also
>>>>>> ssh'ed into the host, and was doing some rpm querry commands.
I had just
>>>>>> run rpm -qa |grep glusterfs (to verify what version was actually
>>>>>> installed), and that command took almost 2 minutes to return!
Normally it
>>>>>> takes less than 2 seconds. That is all pure local SSD IO,
too....
>>>>>>
>>>>>> I'm no expert, but its my understanding that anytime a
software
>>>>>> causes these kinds of issues, its a serious bug in the software,
even if
>>>>>> its mis-handled exceptions. Is this correct?
>>>>>>
>>>>>> --Jim
>>>>>>
>>>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think this is the profile information for one of the
volumes that
>>>>>>> lives on the SSDs and is fully operational with no
down/problem disks:
>>>>>>>
>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data info
>>>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>>> ----------------------------------------------
>>>>>>> Cumulative Stats:
>>>>>>> Block Size: 256b+ 512b+
>>>>>>> 1024b+
>>>>>>> No. of Reads: 983 2696
>>>>>>> 1059
>>>>>>> No. of Writes: 0 1113
>>>>>>> 302
>>>>>>>
>>>>>>> Block Size: 2048b+ 4096b+
>>>>>>> 8192b+
>>>>>>> No. of Reads: 852 88608
>>>>>>> 53526
>>>>>>> No. of Writes: 522 812340
>>>>>>> 76257
>>>>>>>
>>>>>>> Block Size: 16384b+ 32768b+
>>>>>>> 65536b+
>>>>>>> No. of Reads: 54351 241901
>>>>>>> 15024
>>>>>>> No. of Writes: 21636 8656
>>>>>>> 8976
>>>>>>>
>>>>>>> Block Size: 131072b+
>>>>>>> No. of Reads: 524156
>>>>>>> No. of Writes: 296071
>>>>>>> %-latency Avg-latency Min-Latency Max-Latency No.
of
>>>>>>> calls Fop
>>>>>>> --------- ----------- ----------- -----------
>>>>>>> ------------ ----
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 4189 RELEASE
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 1257 RELEASEDIR
>>>>>>> 0.00 46.19 us 12.00 us 187.00 us
>>>>>>> 69 FLUSH
>>>>>>> 0.00 147.00 us 78.00 us 367.00 us
86
>>>>>>> REMOVEXATTR
>>>>>>> 0.00 223.46 us 24.00 us 1166.00 us
>>>>>>> 149 READDIR
>>>>>>> 0.00 565.34 us 76.00 us 3639.00 us
>>>>>>> 88 FTRUNCATE
>>>>>>> 0.00 263.28 us 20.00 us 28385.00 us
>>>>>>> 228 LK
>>>>>>> 0.00 98.84 us 2.00 us 880.00 us
>>>>>>> 1198 OPENDIR
>>>>>>> 0.00 91.59 us 26.00 us 10371.00 us
>>>>>>> 3853 STATFS
>>>>>>> 0.00 494.14 us 17.00 us 193439.00 us
>>>>>>> 1171 GETXATTR
>>>>>>> 0.00 299.42 us 35.00 us 9799.00 us
>>>>>>> 2044 READDIRP
>>>>>>> 0.00 1965.31 us 110.00 us 382258.00 us
>>>>>>> 321 XATTROP
>>>>>>> 0.01 113.40 us 24.00 us 61061.00 us
>>>>>>> 8134 STAT
>>>>>>> 0.01 755.38 us 57.00 us 607603.00 us
>>>>>>> 3196 DISCARD
>>>>>>> 0.05 2690.09 us 58.00 us 2704761.00 us
>>>>>>> 3206 OPEN
>>>>>>> 0.10 119978.25 us 97.00 us 9406684.00 us
>>>>>>> 154 SETATTR
>>>>>>> 0.18 101.73 us 28.00 us 700477.00 us
>>>>>>> 313379 FSTAT
>>>>>>> 0.23 1059.84 us 25.00 us 2716124.00 us
>>>>>>> 38255 LOOKUP
>>>>>>> 0.47 1024.11 us 54.00 us 6197164.00 us
>>>>>>> 81455 FXATTROP
>>>>>>> 1.72 2984.00 us 15.00 us 37098954.00 us
>>>>>>> 103020 FINODELK
>>>>>>> 5.92 44315.32 us 51.00 us 24731536.00 us
>>>>>>> 23957 FSYNC
>>>>>>> 13.27 2399.78 us 25.00 us 22089540.00 us
>>>>>>> 991005 READ
>>>>>>> 37.00 5980.43 us 52.00 us 22099889.00 us
>>>>>>> 1108976 WRITE
>>>>>>> 41.04 5452.75 us 13.00 us 22102452.00 us
>>>>>>> 1349053 INODELK
>>>>>>>
>>>>>>> Duration: 10026 seconds
>>>>>>> Data Read: 80046027759 bytes
>>>>>>> Data Written: 44496632320 bytes
>>>>>>>
>>>>>>> Interval 1 Stats:
>>>>>>> Block Size: 256b+ 512b+
>>>>>>> 1024b+
>>>>>>> No. of Reads: 983 2696
>>>>>>> 1059
>>>>>>> No. of Writes: 0 838
>>>>>>> 185
>>>>>>>
>>>>>>> Block Size: 2048b+ 4096b+
>>>>>>> 8192b+
>>>>>>> No. of Reads: 852 85856
>>>>>>> 51575
>>>>>>> No. of Writes: 382 705802
>>>>>>> 57812
>>>>>>>
>>>>>>> Block Size: 16384b+ 32768b+
>>>>>>> 65536b+
>>>>>>> No. of Reads: 52673 232093
>>>>>>> 14984
>>>>>>> No. of Writes: 13499 4908
>>>>>>> 4242
>>>>>>>
>>>>>>> Block Size: 131072b+
>>>>>>> No. of Reads: 460040
>>>>>>> No. of Writes: 6411
>>>>>>> %-latency Avg-latency Min-Latency Max-Latency No.
of
>>>>>>> calls Fop
>>>>>>> --------- ----------- ----------- -----------
>>>>>>> ------------ ----
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 2093 RELEASE
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 1093 RELEASEDIR
>>>>>>> 0.00 53.38 us 26.00 us 111.00 us
>>>>>>> 16 FLUSH
>>>>>>> 0.00 145.14 us 78.00 us 367.00 us
71
>>>>>>> REMOVEXATTR
>>>>>>> 0.00 190.96 us 114.00 us 298.00 us
>>>>>>> 71 SETATTR
>>>>>>> 0.00 213.38 us 24.00 us 1145.00 us
>>>>>>> 90 READDIR
>>>>>>> 0.00 263.28 us 20.00 us 28385.00 us
>>>>>>> 228 LK
>>>>>>> 0.00 101.76 us 2.00 us 880.00 us
>>>>>>> 1093 OPENDIR
>>>>>>> 0.01 93.60 us 27.00 us 10371.00 us
>>>>>>> 3090 STATFS
>>>>>>> 0.02 537.47 us 17.00 us 193439.00 us
>>>>>>> 1038 GETXATTR
>>>>>>> 0.03 297.44 us 35.00 us 9799.00 us
>>>>>>> 1990 READDIRP
>>>>>>> 0.03 2357.28 us 110.00 us 382258.00 us
>>>>>>> 253 XATTROP
>>>>>>> 0.04 385.93 us 58.00 us 47593.00 us
>>>>>>> 2091 OPEN
>>>>>>> 0.04 114.86 us 24.00 us 61061.00 us
>>>>>>> 7715 STAT
>>>>>>> 0.06 444.59 us 57.00 us 333240.00 us
>>>>>>> 3053 DISCARD
>>>>>>> 0.42 316.24 us 25.00 us 290728.00 us
>>>>>>> 29823 LOOKUP
>>>>>>> 0.73 257.92 us 54.00 us 344812.00 us
>>>>>>> 63296 FXATTROP
>>>>>>> 1.37 98.30 us 28.00 us 67621.00 us
>>>>>>> 313172 FSTAT
>>>>>>> 1.58 2124.69 us 51.00 us 849200.00 us
>>>>>>> 16717 FSYNC
>>>>>>> 5.73 162.46 us 52.00 us 748492.00 us
>>>>>>> 794079 WRITE
>>>>>>> 7.19 2065.17 us 16.00 us 37098954.00 us
>>>>>>> 78381 FINODELK
>>>>>>> 36.44 886.32 us 25.00 us 2216436.00 us
>>>>>>> 925421 READ
>>>>>>> 46.30 1178.04 us 13.00 us 1700704.00 us
>>>>>>> 884635 INODELK
>>>>>>>
>>>>>>> Duration: 7485 seconds
>>>>>>> Data Read: 71250527215 bytes
>>>>>>> Data Written: 5119903744 bytes
>>>>>>>
>>>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data
>>>>>>> ----------------------------------------------
>>>>>>> Cumulative Stats:
>>>>>>> Block Size: 1b+
>>>>>>> No. of Reads: 0
>>>>>>> No. of Writes: 3264419
>>>>>>> %-latency Avg-latency Min-Latency Max-Latency No.
of
>>>>>>> calls Fop
>>>>>>> --------- ----------- ----------- -----------
>>>>>>> ------------ ----
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 90 FORGET
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 9462 RELEASE
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 4254 RELEASEDIR
>>>>>>> 0.00 50.52 us 13.00 us 190.00 us
>>>>>>> 71 FLUSH
>>>>>>> 0.00 186.97 us 87.00 us 713.00 us
86
>>>>>>> REMOVEXATTR
>>>>>>> 0.00 79.32 us 33.00 us 189.00 us
>>>>>>> 228 LK
>>>>>>> 0.00 220.98 us 129.00 us 513.00 us
>>>>>>> 86 SETATTR
>>>>>>> 0.01 259.30 us 26.00 us 2632.00 us
>>>>>>> 137 READDIR
>>>>>>> 0.02 322.76 us 145.00 us 2125.00 us
>>>>>>> 321 XATTROP
>>>>>>> 0.03 109.55 us 2.00 us 1258.00 us
>>>>>>> 1193 OPENDIR
>>>>>>> 0.05 70.21 us 21.00 us 431.00 us
>>>>>>> 3196 DISCARD
>>>>>>> 0.05 169.26 us 21.00 us 2315.00 us
>>>>>>> 1545 GETXATTR
>>>>>>> 0.12 176.85 us 63.00 us 2844.00 us
>>>>>>> 3206 OPEN
>>>>>>> 0.61 303.49 us 90.00 us 3085.00 us
>>>>>>> 9633 FSTAT
>>>>>>> 2.44 305.66 us 28.00 us 3716.00 us
>>>>>>> 38230 LOOKUP
>>>>>>> 4.52 266.22 us 55.00 us 53424.00 us
>>>>>>> 81455 FXATTROP
>>>>>>> 6.96 1397.99 us 51.00 us 64822.00 us
>>>>>>> 23889 FSYNC
>>>>>>> 16.48 84.74 us 25.00 us 6917.00 us
>>>>>>> 932592 WRITE
>>>>>>> 30.16 106.90 us 13.00 us 3920189.00 us
>>>>>>> 1353046 INODELK
>>>>>>> 38.55 1794.52 us 14.00 us 16210553.00 us
>>>>>>> 103039 FINODELK
>>>>>>>
>>>>>>> Duration: 66562 seconds
>>>>>>> Data Read: 0 bytes
>>>>>>> Data Written: 3264419 bytes
>>>>>>>
>>>>>>> Interval 1 Stats:
>>>>>>> Block Size: 1b+
>>>>>>> No. of Reads: 0
>>>>>>> No. of Writes: 794080
>>>>>>> %-latency Avg-latency Min-Latency Max-Latency No.
of
>>>>>>> calls Fop
>>>>>>> --------- ----------- ----------- -----------
>>>>>>> ------------ ----
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 2093 RELEASE
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 1093 RELEASEDIR
>>>>>>> 0.00 70.31 us 26.00 us 125.00 us
>>>>>>> 16 FLUSH
>>>>>>> 0.00 193.10 us 103.00 us 713.00 us
71
>>>>>>> REMOVEXATTR
>>>>>>> 0.01 227.32 us 133.00 us 513.00 us
>>>>>>> 71 SETATTR
>>>>>>> 0.01 79.32 us 33.00 us 189.00 us
>>>>>>> 228 LK
>>>>>>> 0.01 259.83 us 35.00 us 1138.00 us
>>>>>>> 89 READDIR
>>>>>>> 0.03 318.26 us 145.00 us 2047.00 us
>>>>>>> 253 XATTROP
>>>>>>> 0.04 112.67 us 3.00 us 1258.00 us
>>>>>>> 1093 OPENDIR
>>>>>>> 0.06 167.98 us 23.00 us 1951.00 us
>>>>>>> 1014 GETXATTR
>>>>>>> 0.08 70.97 us 22.00 us 431.00 us
>>>>>>> 3053 DISCARD
>>>>>>> 0.13 183.78 us 66.00 us 2844.00 us
>>>>>>> 2091 OPEN
>>>>>>> 1.01 303.82 us 90.00 us 3085.00 us
>>>>>>> 9610 FSTAT
>>>>>>> 3.27 316.59 us 30.00 us 3716.00 us
>>>>>>> 29820 LOOKUP
>>>>>>> 5.83 265.79 us 59.00 us 53424.00 us
>>>>>>> 63296 FXATTROP
>>>>>>> 7.95 1373.89 us 51.00 us 64822.00 us
>>>>>>> 16717 FSYNC
>>>>>>> 23.17 851.99 us 14.00 us 16210553.00 us
>>>>>>> 78555 FINODELK
>>>>>>> 24.04 87.44 us 27.00 us 6917.00 us
>>>>>>> 794081 WRITE
>>>>>>> 34.36 111.91 us 14.00 us 984871.00 us
>>>>>>> 886790 INODELK
>>>>>>>
>>>>>>> Duration: 7485 seconds
>>>>>>> Data Read: 0 bytes
>>>>>>> Data Written: 794080 bytes
>>>>>>>
>>>>>>>
>>>>>>> -----------------------
>>>>>>> Here is the data from the volume that is backed by the SHDDs
and
>>>>>>> has one failed disk:
>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data-hdd
info
>>>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>>> --------------------------------------------
>>>>>>> Cumulative Stats:
>>>>>>> Block Size: 256b+ 512b+
>>>>>>> 1024b+
>>>>>>> No. of Reads: 1702 86
>>>>>>> 16
>>>>>>> No. of Writes: 0 767
>>>>>>> 71
>>>>>>>
>>>>>>> Block Size: 2048b+ 4096b+
>>>>>>> 8192b+
>>>>>>> No. of Reads: 19 51841
>>>>>>> 2049
>>>>>>> No. of Writes: 76 60668
>>>>>>> 35727
>>>>>>>
>>>>>>> Block Size: 16384b+ 32768b+
>>>>>>> 65536b+
>>>>>>> No. of Reads: 1744 639
>>>>>>> 1088
>>>>>>> No. of Writes: 8524 2410
>>>>>>> 1285
>>>>>>>
>>>>>>> Block Size: 131072b+
>>>>>>> No. of Reads: 771999
>>>>>>> No. of Writes: 29584
>>>>>>> %-latency Avg-latency Min-Latency Max-Latency No.
of
>>>>>>> calls Fop
>>>>>>> --------- ----------- ----------- -----------
>>>>>>> ------------ ----
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 2902 RELEASE
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 1517 RELEASEDIR
>>>>>>> 0.00 197.00 us 197.00 us 197.00 us
>>>>>>> 1 FTRUNCATE
>>>>>>> 0.00 70.24 us 16.00 us 758.00 us
>>>>>>> 51 FLUSH
>>>>>>> 0.00 143.93 us 82.00 us 305.00 us
57
>>>>>>> REMOVEXATTR
>>>>>>> 0.00 178.63 us 105.00 us 712.00 us
>>>>>>> 60 SETATTR
>>>>>>> 0.00 67.30 us 19.00 us 572.00 us
>>>>>>> 555 LK
>>>>>>> 0.00 322.80 us 23.00 us 4673.00 us
>>>>>>> 138 READDIR
>>>>>>> 0.00 336.56 us 106.00 us 11994.00 us
>>>>>>> 237 XATTROP
>>>>>>> 0.00 84.70 us 28.00 us 1071.00 us
>>>>>>> 3469 STATFS
>>>>>>> 0.01 387.75 us 2.00 us 146017.00 us
>>>>>>> 1467 OPENDIR
>>>>>>> 0.01 148.59 us 21.00 us 64374.00 us
>>>>>>> 4454 STAT
>>>>>>> 0.02 783.02 us 16.00 us 93502.00 us
>>>>>>> 1902 GETXATTR
>>>>>>> 0.03 1516.10 us 17.00 us 210690.00 us
>>>>>>> 1364 ENTRYLK
>>>>>>> 0.03 2555.47 us 300.00 us 674454.00 us
>>>>>>> 1064 READDIRP
>>>>>>> 0.07 85.74 us 19.00 us 68340.00 us
>>>>>>> 62849 FSTAT
>>>>>>> 0.07 1978.12 us 59.00 us 202596.00 us
>>>>>>> 2729 OPEN
>>>>>>> 0.22 708.57 us 15.00 us 394799.00 us
>>>>>>> 25447 LOOKUP
>>>>>>> 5.94 2331.74 us 15.00 us 1099530.00 us
>>>>>>> 207534 FINODELK
>>>>>>> 7.31 8311.75 us 58.00 us 1800216.00 us
>>>>>>> 71668 FXATTROP
>>>>>>> 12.49 7735.19 us 51.00 us 3595513.00 us
>>>>>>> 131642 WRITE
>>>>>>> 17.70 957.08 us 16.00 us 13700466.00 us
>>>>>>> 1508160 INODELK
>>>>>>> 24.55 2546.43 us 26.00 us 5077347.00 us
>>>>>>> 786060 READ
>>>>>>> 31.56 49699.15 us 47.00 us 3746331.00 us
>>>>>>> 51777 FSYNC
>>>>>>>
>>>>>>> Duration: 10101 seconds
>>>>>>> Data Read: 101562897361 bytes
>>>>>>> Data Written: 4834450432 bytes
>>>>>>>
>>>>>>> Interval 0 Stats:
>>>>>>> Block Size: 256b+ 512b+
>>>>>>> 1024b+
>>>>>>> No. of Reads: 1702 86
>>>>>>> 16
>>>>>>> No. of Writes: 0 767
>>>>>>> 71
>>>>>>>
>>>>>>> Block Size: 2048b+ 4096b+
>>>>>>> 8192b+
>>>>>>> No. of Reads: 19 51841
>>>>>>> 2049
>>>>>>> No. of Writes: 76 60668
>>>>>>> 35727
>>>>>>>
>>>>>>> Block Size: 16384b+ 32768b+
>>>>>>> 65536b+
>>>>>>> No. of Reads: 1744 639
>>>>>>> 1088
>>>>>>> No. of Writes: 8524 2410
>>>>>>> 1285
>>>>>>>
>>>>>>> Block Size: 131072b+
>>>>>>> No. of Reads: 771999
>>>>>>> No. of Writes: 29584
>>>>>>> %-latency Avg-latency Min-Latency Max-Latency No.
of
>>>>>>> calls Fop
>>>>>>> --------- ----------- ----------- -----------
>>>>>>> ------------ ----
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 2902 RELEASE
>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>> 1517 RELEASEDIR
>>>>>>> 0.00 197.00 us 197.00 us 197.00 us
>>>>>>> 1 FTRUNCATE
>>>>>>> 0.00 70.24 us 16.00 us 758.00 us
>>>>>>> 51 FLUSH
>>>>>>> 0.00 143.93 us 82.00 us 305.00 us
57
>>>>>>> REMOVEXATTR
>>>>>>> 0.00 178.63 us 105.00 us 712.00 us
>>>>>>> 60 SETATTR
>>>>>>> 0.00 67.30 us 19.00 us 572.00 us
>>>>>>> 555 LK
>>>>>>> 0.00 322.80 us 23.00 us 4673.00 us
>>>>>>> 138 READDIR
>>>>>>> 0.00 336.56 us 106.00 us 11994.00 us
>>>>>>> 237 XATTROP
>>>>>>> 0.00 84.70 us 28.00 us 1071.00 us
>>>>>>> 3469 STATFS
>>>>>>> 0.01 387.75 us 2.00 us 146017.00 us
>>>>>>> 1467 OPENDIR
>>>>>>> 0.01 148.59 us 21.00 us 64374.00 us
>>>>>>> 4454 STAT
>>>>>>> 0.02 783.02 us 16.00 us 93502.00 us
>>>>>>> 1902 GETXATTR
>>>>>>> 0.03 1516.10 us 17.00 us 210690.00 us
>>>>>>> 1364 ENTRYLK
>>>>>>> 0.03 2555.47 us 300.00 us 674454.00 us
>>>>>>> 1064 READDIRP
>>>>>>> 0.07 85.73 us 19.00 us 68340.00 us
>>>>>>> 62849 FSTAT
>>>>>>> 0.07 1978.12 us 59.00 us 202596.00 us
>>>>>>> 2729 OPEN
>>>>>>> 0.22 708.57 us 15.00 us 394799.00 us
>>>>>>> 25447 LOOKUP
>>>>>>> 5.94 2334.57 us 15.00 us 1099530.00 us
>>>>>>> 207534 FINODELK
>>>>>>> 7.31 8311.49 us 58.00 us 1800216.00 us
>>>>>>> 71668 FXATTROP
>>>>>>> 12.49 7735.32 us 51.00 us 3595513.00 us
>>>>>>> 131642 WRITE
>>>>>>> 17.71 957.08 us 16.00 us 13700466.00 us
>>>>>>> 1508160 INODELK
>>>>>>> 24.56 2546.42 us 26.00 us 5077347.00 us
>>>>>>> 786060 READ
>>>>>>> 31.54 49651.63 us 47.00 us 3746331.00 us
>>>>>>> 51777 FSYNC
>>>>>>>
>>>>>>> Duration: 10101 seconds
>>>>>>> Data Read: 101562897361 bytes
>>>>>>> Data Written: 4834450432 bytes
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you for your response.
>>>>>>>>
>>>>>>>> I have 4 gluster volumes. 3 are replica 2 + arbitrator.
replica
>>>>>>>> bricks are on ovirt1 and ovirt2, arbitrator on ovirt3.
The 4th volume is
>>>>>>>> replica 3, with a brick on all three ovirt machines.
>>>>>>>>
>>>>>>>> The first 3 volumes are on an SSD disk; the 4th is on a
Seagate
>>>>>>>> SSHD (same in all three machines). On ovirt3, the SSHD
has reported hard
>>>>>>>> IO failures, and that brick is offline. However, the
other two replicas
>>>>>>>> are fully operational (although they still show contents
in the heal info
>>>>>>>> command that won't go away, but that may be the case
until I replace the
>>>>>>>> failed disk).
>>>>>>>>
>>>>>>>> What is bothering me is that ALL 4 gluster volumes are
showing
>>>>>>>> horrible performance issues. At this point, as the bad
disk has been
>>>>>>>> completely offlined, I would expect gluster to perform at
normal speed, but
>>>>>>>> that is definitely not the case.
>>>>>>>>
>>>>>>>> I've also noticed that the performance hits seem to
come in waves:
>>>>>>>> things seem to work acceptably (but slow) for a while,
then suddenly, its
>>>>>>>> as if all disk IO on all volumes (including non-gluster
local OS disk
>>>>>>>> volumes for the hosts) pause for about 30 seconds, then
IO resumes again.
>>>>>>>> During those times, I start getting VM not responding and
host not
>>>>>>>> responding notices as well as the applications having
major issues.
>>>>>>>>
>>>>>>>> I've shut down most of my VMs and am down to just my
essential
>>>>>>>> core VMs (shedded about 75% of my VMs). I still am
experiencing the same
>>>>>>>> issues.
>>>>>>>>
>>>>>>>> Am I correct in believing that once the failed disk was
brought
>>>>>>>> offline that performance should return to normal?
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K
<rightkicktech(a)gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would check disks status and accessibility of mount
points
>>>>>>>>> where your gluster volumes reside.
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir
<jim(a)palousetech.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On one ovirt server, I'm now seeing these
messages:
>>>>>>>>>> [56474.239725] blk_update_request: 63 callbacks
suppressed
>>>>>>>>>> [56474.239732] blk_update_request: I/O error, dev
dm-2, sector 0
>>>>>>>>>> [56474.240602] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 3905945472
>>>>>>>>>> [56474.241346] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 3905945584
>>>>>>>>>> [56474.242236] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 2048
>>>>>>>>>> [56474.243072] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 3905943424
>>>>>>>>>> [56474.243997] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 3905943536
>>>>>>>>>> [56474.247347] blk_update_request: I/O error, dev
dm-2, sector 0
>>>>>>>>>> [56474.248315] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 3905945472
>>>>>>>>>> [56474.249231] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 3905945584
>>>>>>>>>> [56474.250221] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>>> 2048
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir
<
>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I see in messages on ovirt3 (my 3rd machine,
the one upgraded
>>>>>>>>>>> to 4.2):
>>>>>>>>>>>
>>>>>>>>>>> May 29 11:54:41 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>> database connection failed (No such file or
directory)
>>>>>>>>>>> May 29 11:54:51 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>> database connection failed (No such file or
directory)
>>>>>>>>>>> May 29 11:55:01 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>> database connection failed (No such file or
directory)
>>>>>>>>>>> (appears a lot).
>>>>>>>>>>>
>>>>>>>>>>> I also found on the ssh session of that, some
sysv warnings
>>>>>>>>>>> about the backing disk for one of the gluster
volumes (straight replica
>>>>>>>>>>> 3). The glusterfs process for that disk on
that machine went offline. Its
>>>>>>>>>>> my understanding that it should continue to
work with the other two
>>>>>>>>>>> machines while I attempt to replace that
disk, right? Attempted writes
>>>>>>>>>>> (touching an empty file) can take 15 seconds,
repeating it later will be
>>>>>>>>>>> much faster.
>>>>>>>>>>>
>>>>>>>>>>> Gluster generates a bunch of different log
files, I don't know
>>>>>>>>>>> what ones you want, or from which
machine(s).
>>>>>>>>>>>
>>>>>>>>>>> How do I do "volume profiling"?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose
<
>>>>>>>>>>> sabose(a)redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Do you see errors reported in the mount
logs for the volume?
>>>>>>>>>>>> If so, could you attach the logs?
>>>>>>>>>>>> Any issues with your underlying disks.
Can you also attach
>>>>>>>>>>>> output of volume profiling?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 30, 2018 at 12:13 AM, Jim
Kusznir <
>>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, things have gotten MUCH worse
this morning. I'm getting
>>>>>>>>>>>>> random errors from VMs, right now,
about a third of my VMs have been paused
>>>>>>>>>>>>> due to storage issues, and most of
the remaining VMs are not performing
>>>>>>>>>>>>> well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> At this point, I am in full EMERGENCY
mode, as my production
>>>>>>>>>>>>> services are now impacted, and
I'm getting calls coming in with problems...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd greatly appreciate help...VMs
are running VERY slowly
>>>>>>>>>>>>> (when they run), and they are
steadily getting worse. I don't know why. I
>>>>>>>>>>>>> was seeing CPU peaks (to 100%) on
several VMs, in perfect sync, for a few
>>>>>>>>>>>>> minutes at a time (while the VM
became unresponsive and any VMs I was
>>>>>>>>>>>>> logged into that were linux were
giving me the CPU stuck messages in my
>>>>>>>>>>>>> origional post). Is all this storage
related?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also have two different gluster
volumes for VM storage, and
>>>>>>>>>>>>> only one had the issues, but now VMs
in both are being affected at the same
>>>>>>>>>>>>> time and same way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, May 28, 2018 at 10:50 PM,
Sahina Bose <
>>>>>>>>>>>>> sabose(a)redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Adding gluster-users to look at
the heal issue]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 29, 2018 at 9:17 AM,
Jim Kusznir <
>>>>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've been having some
cluster and gluster performance
>>>>>>>>>>>>>>> issues lately. I also found
that my cluster was out of date, and was
>>>>>>>>>>>>>>> trying to apply updates
(hoping to fix some of these), and discovered the
>>>>>>>>>>>>>>> ovirt 4.1 repos were taken
completely offline. So, I was forced to begin
>>>>>>>>>>>>>>> an upgrade to 4.2. According
to docs I found/read, I needed only add the
>>>>>>>>>>>>>>> new repo, do a yum update,
reboot, and be good on my hosts (did the yum
>>>>>>>>>>>>>>> update, the engine-setup on
my hosted engine). Things seemed to work
>>>>>>>>>>>>>>> relatively well, except for a
gluster sync issue that showed up.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My cluster is a 3 node
hyperconverged cluster. I upgraded
>>>>>>>>>>>>>>> the hosted engine first, then
engine 3. When engine 3 came back up, for
>>>>>>>>>>>>>>> some reason one of my gluster
volumes would not sync. Here's sample output:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [root@ovirt3 ~]# gluster
volume heal data-hdd info
>>>>>>>>>>>>>>> Brick
172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brick
172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brick
172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>>> Its been in this state for a
couple days now, and bandwidth
>>>>>>>>>>>>>>> monitoring shows no
appreciable data moving. I've tried repeatedly
>>>>>>>>>>>>>>> commanding a full heal from
all three clusters in the node. Its always the
>>>>>>>>>>>>>>> same files that need
healing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When running gluster volume
heal data-hdd statistics, I see
>>>>>>>>>>>>>>> sometimes different
information, but always some number of "heal failed"
>>>>>>>>>>>>>>> entries. It shows 0 for
split brain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not quite sure what
to do. I suspect it may be due to
>>>>>>>>>>>>>>> nodes 1 and 2 still being on
the older ovirt/gluster release, but I'm
>>>>>>>>>>>>>>> afraid to upgrade and reboot
them until I have a good gluster sync (don't
>>>>>>>>>>>>>>> need to create a split brain
issue). How do I proceed with this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Second issue: I've been
experiencing VERY POOR performance
>>>>>>>>>>>>>>> on most of my VMs. To the
tune that logging into a windows 10 vm via
>>>>>>>>>>>>>>> remote desktop can take 5
minutes, launching quickbooks inside said vm can
>>>>>>>>>>>>>>> easily take 10 minutes. On
some linux VMs, I get random messages like this:
>>>>>>>>>>>>>>> Message from syslogd@unifi at
May 28 20:39:23 ...
>>>>>>>>>>>>>>> kernel:[6171996.308904] NMI
watchdog: BUG: soft lockup -
>>>>>>>>>>>>>>> CPU#0 stuck for 22s!
[mongod:14766]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (the process and PID are
often different)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not quite sure what
to do about this either. My
>>>>>>>>>>>>>>> initial thought was upgrad
everything to current and see if its still
>>>>>>>>>>>>>>> there, but I cannot move
forward with that until my gluster is healed...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>>> Users mailing list --
users(a)ovirt.org
>>>>>>>>>>>>>>> To unsubscribe send an email
to users-leave(a)ovirt.org
>>>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/pri
>>>>>>>>>>>>>>> vacy-policy/
>>>>>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>>>>
y/about/community-guidelines/
>>>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>>>
es/list/users(a)ovirt.org/messag
>>>>>>>>>>>>>>>
e/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org
>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>> y/about/community-guidelines/
>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>
es/list/users(a)ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDE
>>>>>>>>>> S5MF/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>> y/about/community-guidelines/
>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>
es/list/users(a)ovirt.org/message/3DEQQLJM3WHQNZJ7KEMRZVFZ52MTIL74/
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
https://www.ovirt.org/communit
> y/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archiv
> es/list/users(a)ovirt.org/message/27P5WCXLKQ6T3UPN7BGXJWHRJWA2E54S/
>
>