Hi everyone,

Until today my environment was fully updated (3.6.5+centos7.2) with 3 nodes (kvm1,kvm2 and kvm3 hosts) . I also have 3 external gluster nodes (gluster-root1,gluster1 and gluster2 hosts ) , replica 3, which the engine storage domain is sitting on top (3.7.11 fully updated+centos7.2)

For some weird reason i've been receiving emails from oVirt with EngineUnexpectedDown (attached picture) on a daily basis more or less, but the engine seems to be working fine and my vm's are up and running normally. I've never had any issue to access the User Interface to manage the vm's

Today I run "yum update" on the nodes and realised that vdsm was outdated, so I updated the kvm hosts and they are now , again, fully updated.

Reviewing the logs It seems to be an intermittent connectivity issue when trying to access the gluster engine storage domain as you can see below. I don't have any network issue in place and I'm 100% sure about it. I have another oVirt Cluster using the same network and using a engine storage domain on top of an iSCSI Storage Array with no issues.

Here seems to be the issue:

Thread-1111::INFO::2016-04-27 23:01:27,864::fileSD::357::Storage.StorageDomain::(validate) sdUUID=03926733-1872-4f85-bb21-18dc320560db

Thread-1111::DEBUG::2016-04-27 23:01:27,865::persistentDict::234::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=[]

Thread-1111::DEBUG::2016-04-27 23:01:27,865::persistentDict::252::Storage.PersistentDict::(refresh) Empty metadata

Thread-1111::ERROR::2016-04-27 23:01:27,865::task::866::Storage.TaskManager.Task::(_setError) Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Unexpected error

Traceback (most recent call last):

File "/usr/share/vdsm/storage/task.py", line 873, in _run

return fn(*args, **kargs)

File "/usr/share/vdsm/logUtils.py", line 49, in wrapper

res = f(*args, **kwargs)

File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo

dom = self.validateSdUUID(sdUUID)

File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID

sdDom.validate()

File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate

raise se.StorageDomainAccessError(self.sdUUID)

StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: (u'03926733-1872-4f85-bb21-18dc320560db',)

Thread-1111::DEBUG::2016-04-27 23:01:27,865::task::885::Storage.TaskManager.Task::(_run) Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Task._run: d2acf575-1a60-4fa0-a5bb-cd4363636b94 ('03926733-1872-4f85-bb21-18dc320560db',) {} failed - stopping task

Thread-1111::DEBUG::2016-04-27 23:01:27,865::task::1246::Storage.TaskManager.Task::(stop) Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::stopping in state preparing (force False)

Thread-1111::DEBUG::2016-04-27 23:01:27,865::task::993::Storage.TaskManager.Task::(_decref) Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::ref 1 aborting True

Thread-1111::INFO::2016-04-27 23:01:27,865::task::1171::Storage.TaskManager.Task::(prepare) Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::aborting: Task is aborted: 'Domain is either partially accessible or entirely inaccessible' - code 379

Thread-1111::DEBUG::2016-04-27 23:01:27,866::task::1176::Storage.TaskManager.Task::(prepare) Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Prepare: aborted: Domain is either partially accessible or entirely inaccessible

Question: Anyone know what might be happening? I have several gluster config's, as you can see below. All the storage domain are using the same config's

More information:

I have the "engine" storage domain, "vmos1" storage domain and "master" storage domain, so everything looks good.

[root@kvm1 vdsm]# vdsClient -s 0 getStorageDomainsList

03926733-1872-4f85-bb21-18dc320560db

35021ff4-fb95-43d7-92a3-f538273a3c2e

e306e54e-ca98-468d-bb04-3e8900f8840c

Gluster config:

[root@gluster-root1 ~]# gluster volume info

Volume Name: engine

Type: Replicate

Volume ID: 64b413d2-c42e-40fd-b356-3e6975e941b0

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: gluster1.xyz.com:/gluster/engine/brick1

Brick2: gluster2.xyz.com:/gluster/engine/brick1

Brick3: gluster-root1.xyz.com:/gluster/engine/brick1

Options Reconfigured:

performance.cache-size: 1GB

performance.write-behind-window-size: 4MB

performance.write-behind: off

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

cluster.quorum-type: auto

network.remote-dio: enable

cluster.server-quorum-type: server

cluster.data-self-heal-algorithm: full

performance.low-prio-threads: 32

features.shard-block-size: 512MB

features.shard: on

storage.owner-gid: 36

storage.owner-uid: 36

performance.readdir-ahead: on

Volume Name: master

Type: Replicate

Volume ID: 20164808-7bbe-4eeb-8770-d222c0e0b830

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: gluster1.xyz.com:/home/storage/master/brick1

Brick2: gluster2.xyz.com:/home/storage/master/brick1

Brick3: gluster-root1.xyz.com:/home/storage/master/brick1

Options Reconfigured:

performance.readdir-ahead: on

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

network.remote-dio: enable

cluster.quorum-type: auto

cluster.server-quorum-type: server

storage.owner-uid: 36

storage.owner-gid: 36

features.shard: on

features.shard-block-size: 512MB

performance.low-prio-threads: 32

cluster.data-self-heal-algorithm: full

performance.write-behind: off

performance.write-behind-window-size: 4MB

performance.cache-size: 1GB

Volume Name: vmos1

Type: Replicate

Volume ID: ea8fb50e-7bc8-4de3-b775-f3976b6b4f13

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: gluster1.xyz.com:/gluster/vmos1/brick1

Brick2: gluster2.xyz.com:/gluster/vmos1/brick1

Brick3: gluster-root1.xyz.com:/gluster/vmos1/brick1

Options Reconfigured:

network.ping-timeout: 60

performance.readdir-ahead: on

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

network.remote-dio: enable

cluster.quorum-type: auto

cluster.server-quorum-type: server

storage.owner-uid: 36

storage.owner-gid: 36

features.shard: on

features.shard-block-size: 512MB

performance.low-prio-threads: 32

cluster.data-self-heal-algorithm: full

performance.write-behind: off

performance.write-behind-window-size: 4MB

performance.cache-size: 1GB

Attached goes all the logs...

Thanks

-Luiz