[ovirt-users] VM ramdomly unresponsive

Tuesday, 13 November 2018

Hi all,
I continue to try to understand my problem between (I suppose) oVirt anf Gluster.
After my recents posts titled 'VMs unexpectidly restarted' that did not provide
solution nor search idea, I submit to you another (related ?) problem.
Parallely with the problem of VMs down (that did not reproduce since Oct 16), I have
ramdomly some events in the GUI saying "VM xxxxx is not responding." For
example, VM "patjoub1" on 2018-11-11 14:34. Never the same hour, not all the
days, often this VM patjoub1 but not always : I had it on two others. All VMs disks are on
a volume DATA02 (with leases on the same volume).

Searching in engine.log, I found :
2018-11-11 14:34:32,953+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-28) [] VM
'6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'Up' -->
'NotResponding'
2018-11-11 14:34:33,116+01 WARN
 [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder]
(EE-ManagedThreadFactory-engineScheduled-Thread-1) [] Invalid or unknown guest
architecture type '' received from guest agent
2018-11-11 14:34:33,176+01 WARN
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: VM_NOT_RESPONDING(126),
VM patjoub1 is not responding.
...
...
2018-11-11 14:34:48,278+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-48) [] VM
'6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from
'NotResponding' --> 'Up'So it becomes up 15s after, and the VM (and the
monitoring) see no downtime.
At this time, I see in vdsm.log of the nodes :
2018-11-11 14:33:49,450+0100 ERROR (check/loop) [storage.Monitor] Error checking path
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata
(monitor:498)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in
_pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in
delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure:
(u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata',
1, 'Read timeout')
2018-11-11 14:33:49,450+0100 INFO  (check/loop) [storage.Monitor] Domain
ffc53fd8-c5d1-4070-ae51-2e91835cd937 became INVALID (monitor:469)

2018-11-11 14:33:59,451+0100 WARN  (check/loop) [storage.check] Checker
u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata'
is blocked for 20.00 seconds (check:282)

2018-11-11 14:34:09,480+0100 INFO  (event/37) [storage.StoragePool] Linking
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937
to
/rhev/data-center/6efda7f8-b62f-11e8-9d16-00163e263d21/ffc53fd8-c5d1-4070-ae51-2e91835cd937
(sp:1230)OK : so, DATA02 marked as blocked for 20s ? I definitly have a problem with
gluster ? I'll inevitably find the reason in the gluster logs ? Uh : not at all.
Please see gluster logs here :
https://seafile.systea.fr/d/65df86cca9d34061a1e4/

Unfortunatly I discovered this morning that I have not the sanlock.log for this date. I
don't understand why, the log rotate seems OK with "rotate 3", but I have no
backups files :(.
But, luck in bad luck, the same event occurs this morning ! Same VM patjoub1, 2018-11-13
08:01:37. So I have added the sanlock.log for today, maybe it can help.

IMPORTANT NOTE : don't forget that Gluster log with on hour shift. For this event at
14:34, search at 13h34 in gluster logs.
I recall my configuration :
Gluster 3.12.13
oVirt 4.2.3
3 nodes where the third is arbiter (volumes in replica 2)

The nodes are never overloaded (CPU average 5%, no peak detected at the time of the event,
mem 128G used at 15% (only 10 VMs on this cluster)). Network underused, gluster is on a
separate network on a bond (2 NICs) 1+1Gb mode 4 = 2Gb, used in peak at 10%.

Here is the configuration for the given volume :
# gluster volume status DATA02
Status of volume: DATA02
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata02/data02/brick                          49158     0          Y       4990 
Brick gingerstorage.local.systea.fr:/home/d
ata02/data02/brick                          49153     0          Y       8460 
Brick eskarinastorage.local.systea.fr:/home
/data01/data02/brick                        49158     0          Y       2470 
Self-heal Daemon on localhost               N/A       N/A        Y       8771 
Self-heal Daemon on eskarinastorage.local.s
ystea.fr                                    N/A       N/A        Y       11745
Self-heal Daemon on victorstorage.local.sys
tea.fr                                      N/A       N/A        Y       17055

Task Status of Volume DATA02
------------------------------------------------------------------------------
There are no active volume tasks

# gluster volume info DATA02

Volume Name: DATA02
Type: Replicate
Volume ID: 48bf5871-339b-4f39-bea5-9b5848809c83
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: victorstorage.local.systea.fr:/home/data02/data02/brick
Brick2: gingerstorage.local.systea.fr:/home/data02/data02/brick
Brick3: eskarinastorage.local.systea.fr:/home/data01/data02/brick (arbiter)
Options Reconfigured:
network.ping-timeout: 30
server.allow-insecure: on
cluster.granular-entry-heal: enable
features.shard-block-size: 64MB
performance.stat-prefetch: on
server.event-threads: 3
client.event-threads: 8
performance.io-thread-count: 32
storage.owner-gid: 36
storage.owner-uid: 36
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.server-quorum-ratio: 51%
So : is there someone around trying to make me understand what append ? Pleeease :/

--

Regards,

Frank

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] VM ramdomly unresponsive