Hi all,
I continue to try to understand my problem between (I suppose) oVirt anf Gluster.
After my recents posts titled 'VMs unexpectidly restarted' that did not provide solution nor search idea, I submit to you another (related ?) problem.
Parallely with the problem of VMs down (that did not reproduce since Oct 16), I have ramdomly some events in the GUI saying "VM xxxxx is not responding." For example, VM "patjoub1" on 2018-11-11 14:34. Never the same hour, not all the days, often this VM patjoub1 but not always : I had it on two others. All VMs disks are on a volume DATA02 (with leases on the same volume).

Searching in engine.log, I found :

2018-11-11 14:34:32,953+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] VM '6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'Up' --> 'NotResponding'
2018-11-11 14:34:33,116+01 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] Invalid or unknown guest architecture type '' received from guest agent
2018-11-11 14:34:33,176+01 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] EVENT_ID: VM_NOT_RESPONDING(126), VM patjoub1 is not responding.
...
...
2018-11-11 14:34:48,278+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-48) [] VM '6116fb07-096b-4c7e-97fe-01ecc9a6bd9b'(patjoub1) moved from 'NotResponding' --> 'Up'

So it becomes up 15s after, and the VM (and the monitoring) see no downtime.
At this time, I see in vdsm.log of the nodes :

2018-11-11 14:33:49,450+0100 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata (monitor:498)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata', 1, 'Read timeout')
2018-11-11 14:33:49,450+0100 INFO  (check/loop) [storage.Monitor] Domain ffc53fd8-c5d1-4070-ae51-2e91835cd937 became INVALID (monitor:469)

2018-11-11 14:33:59,451+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/metadata' is blocked for 20.00 seconds (check:282)

2018-11-11 14:34:09,480+0100 INFO  (event/37) [storage.StoragePool] Linking /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937 to /rhev/data-center/6efda7f8-b62f-11e8-9d16-00163e263d21/ffc53fd8-c5d1-4070-ae51-2e91835cd937 (sp:1230)

OK : so, DATA02 marked as blocked for 20s ? I definitly have a problem with gluster ? I'll inevitably find the reason in the gluster logs ? Uh : not at all.
Please see gluster logs here :
https://seafile.systea.fr/d/65df86cca9d34061a1e4/

Unfortunatly I discovered this morning that I have not the sanlock.log for this date. I don't understand why, the log rotate seems OK with "rotate 3", but I have no backups files :(.
But, luck in bad luck, the same event occurs this morning ! Same VM patjoub1, 2018-11-13 08:01:37. So I have added the sanlock.log for today, maybe it can help.

IMPORTANT NOTE : don't forget that Gluster log with on hour shift. For this event at 14:34, search at 13h34 in gluster logs.
I recall my configuration :
Gluster 3.12.13
oVirt 4.2.3
3 nodes where the third is arbiter (volumes in replica 2)

The nodes are never overloaded (CPU average 5%, no peak detected at the time of the event, mem 128G used at 15% (only 10 VMs on this cluster)). Network underused, gluster is on a separate network on a bond (2 NICs) 1+1Gb mode 4 = 2Gb, used in peak at 10%.

Here is the configuration for the given volume :

# gluster volume status DATA02
Status of volume: DATA02
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata02/data02/brick                          49158     0          Y       4990 
Brick gingerstorage.local.systea.fr:/home/d
ata02/data02/brick                          49153     0          Y       8460 
Brick eskarinastorage.local.systea.fr:/home
/data01/data02/brick                        49158     0          Y       2470 
Self-heal Daemon on localhost               N/A       N/A        Y       8771 
Self-heal Daemon on eskarinastorage.local.s
ystea.fr                                    N/A       N/A        Y       11745
Self-heal Daemon on victorstorage.local.sys
tea.fr                                      N/A       N/A        Y       17055
 
Task Status of Volume DATA02
------------------------------------------------------------------------------
There are no active volume tasks
 

 

# gluster volume info DATA02
 
Volume Name: DATA02
Type: Replicate
Volume ID: 48bf5871-339b-4f39-bea5-9b5848809c83
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: victorstorage.local.systea.fr:/home/data02/data02/brick
Brick2: gingerstorage.local.systea.fr:/home/data02/data02/brick
Brick3: eskarinastorage.local.systea.fr:/home/data01/data02/brick (arbiter)
Options Reconfigured:
network.ping-timeout: 30
server.allow-insecure: on
cluster.granular-entry-heal: enable
features.shard-block-size: 64MB
performance.stat-prefetch: on
server.event-threads: 3
client.event-threads: 8
performance.io-thread-count: 32
storage.owner-gid: 36
storage.owner-uid: 36
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.server-quorum-ratio: 51%


So : is there someone around trying to make me understand what append ? Pleeease :/

--

Regards,

Frank