Fwd: Having issues with Hosted Engine

28 Apr 2016

      Hi everyone,

Until today my environment was fully updated (3.6.5+centos7.2) with 3 nodes
(kvm1,kvm2 and kvm3 hosts) . I also have 3 external gluster nodes
(gluster-root1,gluster1 and gluster2 hosts ) , replica 3, which the engine
storage domain is sitting on top (3.7.11 fully updated+centos7.2)

For some weird reason i've been receiving emails from oVirt with
EngineUnexpectedDown (attached picture) on a daily basis more or less, but
the engine seems to be working fine and my vm's are up and running
normally. I've never had any issue to access the User Interface to manage
the vm's

Today I run "yum update" on the nodes and realised that vdsm was outdated,
so I updated the kvm hosts and they are now , again, fully updated.

Reviewing the logs It seems to be an intermittent connectivity issue when
trying to access the gluster engine storage domain as you can see below. I
don't have any network issue in place and I'm 100% sure about it. I have
another oVirt Cluster using the same network and using a engine storage
domain on top of an iSCSI Storage Array with no issues.

*Here seems to be the issue:*

Thread-1111::INFO::2016-04-27
23:01:27,864::fileSD::357::Storage.StorageDomain::(validate)
sdUUID=03926733-1872-4f85-bb21-18dc320560db

Thread-1111::DEBUG::2016-04-27
23:01:27,865::persistentDict::234::Storage.PersistentDict::(refresh) read
lines (FileMetadataRW)=[]

Thread-1111::DEBUG::2016-04-27
23:01:27,865::persistentDict::252::Storage.PersistentDict::(refresh) Empty
metadata

Thread-1111::ERROR::2016-04-27
23:01:27,865::task::866::Storage.TaskManager.Task::(_setError)
Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Unexpected error

Traceback (most recent call last):

  File "/usr/share/vdsm/storage/task.py", line 873, in _run

    return fn(*args, **kargs)

  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper

    res = f(*args, **kwargs)

  File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo

    dom = self.validateSdUUID(sdUUID)

  File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID

    sdDom.validate()

  File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate

    raise se.StorageDomainAccessError(self.sdUUID)

StorageDomainAccessError: Domain is either partially accessible or entirely
inaccessible: (u'03926733-1872-4f85-bb21-18dc320560db',)

Thread-1111::DEBUG::2016-04-27
23:01:27,865::task::885::Storage.TaskManager.Task::(_run)
Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Task._run:
d2acf575-1a60-4fa0-a5bb-cd4363636b94
('03926733-1872-4f85-bb21-18dc320560db',) {} failed - stopping task

Thread-1111::DEBUG::2016-04-27
23:01:27,865::task::1246::Storage.TaskManager.Task::(stop)
Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::stopping in state preparing
(force False)

Thread-1111::DEBUG::2016-04-27
23:01:27,865::task::993::Storage.TaskManager.Task::(_decref)
Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::ref 1 aborting True

Thread-1111::INFO::2016-04-27
23:01:27,865::task::1171::Storage.TaskManager.Task::(prepare)
Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::aborting: Task is aborted:
'Domain is either partially accessible or entirely inaccessible' - code 379
Thread-1111::DEBUG::2016-04-27
23:01:27,866::task::1176::Storage.TaskManager.Task::(prepare)
Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Prepare: aborted: Domain is
either partially accessible or entirely inaccessible

*Question: Anyone know what might be happening? I have several gluster
config's, as you can see below. All the storage domain are using the same
config's*

*More information:*

I have the "engine" storage domain, "vmos1" storage domain and "master"
storage domain, so everything looks good.

[root@kvm1 vdsm]# vdsClient -s 0 getStorageDomainsList

03926733-1872-4f85-bb21-18dc320560db

35021ff4-fb95-43d7-92a3-f538273a3c2e

e306e54e-ca98-468d-bb04-3e8900f8840c

*Gluster config:*

[root@gluster-root1 ~]# gluster volume info

Volume Name: engine

Type: Replicate

Volume ID: 64b413d2-c42e-40fd-b356-3e6975e941b0

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: gluster1.xyz.com:/gluster/engine/brick1

Brick2: gluster2.xyz.com:/gluster/engine/brick1

Brick3: gluster-root1.xyz.com:/gluster/engine/brick1

Options Reconfigured:

performance.cache-size: 1GB

performance.write-behind-window-size: 4MB

performance.write-behind: off

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

cluster.quorum-type: auto

network.remote-dio: enable

cluster.server-quorum-type: server

cluster.data-self-heal-algorithm: full

performance.low-prio-threads: 32

features.shard-block-size: 512MB

features.shard: on

storage.owner-gid: 36

storage.owner-uid: 36

performance.readdir-ahead: on

Volume Name: master

Type: Replicate

Volume ID: 20164808-7bbe-4eeb-8770-d222c0e0b830

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: gluster1.xyz.com:/home/storage/master/brick1

Brick2: gluster2.xyz.com:/home/storage/master/brick1

Brick3: gluster-root1.xyz.com:/home/storage/master/brick1

Options Reconfigured:

performance.readdir-ahead: on

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

network.remote-dio: enable

cluster.quorum-type: auto

cluster.server-quorum-type: server

storage.owner-uid: 36

storage.owner-gid: 36

features.shard: on

features.shard-block-size: 512MB

performance.low-prio-threads: 32

cluster.data-self-heal-algorithm: full

performance.write-behind: off

performance.write-behind-window-size: 4MB

performance.cache-size: 1GB

Volume Name: vmos1

Type: Replicate

Volume ID: ea8fb50e-7bc8-4de3-b775-f3976b6b4f13

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: gluster1.xyz.com:/gluster/vmos1/brick1

Brick2: gluster2.xyz.com:/gluster/vmos1/brick1

Brick3: gluster-root1.xyz.com:/gluster/vmos1/brick1

Options Reconfigured:

network.ping-timeout: 60

performance.readdir-ahead: on

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

network.remote-dio: enable

cluster.quorum-type: auto

cluster.server-quorum-type: server

storage.owner-uid: 36

storage.owner-gid: 36

features.shard: on

features.shard-block-size: 512MB

performance.low-prio-threads: 32

cluster.data-self-heal-algorithm: full

performance.write-behind: off

performance.write-behind-window-size: 4MB

performance.cache-size: 1GB

Attached goes all the logs...

Thanks

-Luiz

Luiz Claudio Prazeres Goncalves

Sahina Bose

Simone Tiraboschi

Luiz Claudio Prazeres Goncalves

Luiz Claudio Prazeres Goncalves

Simone Tiraboschi

Luiz Claudio Prazeres Goncalves

tags

participants (3)