[ovirt-users] Fwd: Having issues with Hosted Engine

Thu Apr 28 06:32:09 UTC 2016

This seems like issue reported in 
https://bugzilla.redhat.com/show_bug.cgi?id=1327121

Nir, Simone?

On 04/28/2016 05:35 AM, Luiz Claudio Prazeres Goncalves wrote:
>
> Hi everyone,
>
> Until today my environment was fully updated (3.6.5+centos7.2) with 3 
> nodes (kvm1,kvm2 and kvm3 hosts) . I also have 3 external gluster 
> nodes (gluster-root1,gluster1 and gluster2 hosts ) , replica 3, which 
> the engine storage domain is sitting on top (3.7.11 fully 
> updated+centos7.2)
>
> For some weird reason i've been receiving emails from oVirt with 
> EngineUnexpectedDown (attached picture) on a daily basis more or less, 
> but the engine seems to be working fine and my vm's are up and running 
> normally. I've never had any issue to access the User Interface to 
> manage the vm's
>
> Today I run "yum update" on the nodes and realised that vdsm was 
> outdated, so I updated the kvm hosts and they are now , again, fully 
> updated.
>
>
> Reviewing the logs It seems to be an intermittent connectivity issue 
> when trying to access the gluster engine storage domain as you can see 
> below. I don't have any network issue in place and I'm 100% sure about 
> it. I have another oVirt Cluster using the same network and using a 
> engine storage domain on top of an iSCSI Storage Array with no issues.
>
> *Here seems to be the issue:*
>
> Thread-1111::INFO::2016-04-27 
> 23:01:27,864::fileSD::357::Storage.StorageDomain::(validate) 
> sdUUID=03926733-1872-4f85-bb21-18dc320560db
>
> Thread-1111::DEBUG::2016-04-27 
> 23:01:27,865::persistentDict::234::Storage.PersistentDict::(refresh) 
> read lines (FileMetadataRW)=[]
>
> Thread-1111::DEBUG::2016-04-27 
> 23:01:27,865::persistentDict::252::Storage.PersistentDict::(refresh) 
> Empty metadata
>
> Thread-1111::ERROR::2016-04-27 
> 23:01:27,865::task::866::Storage.TaskManager.Task::(_setError) 
> Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Unexpected error
>
> Traceback (most recent call last):
>
>   File "/usr/share/vdsm/storage/task.py", line 873, in _run
>
>     return fn(*args, **kargs)
>
>   File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
>
>     res = f(*args, **kwargs)
>
>   File "/usr/share/vdsm/storage/hsm.py", line 2835, in 
> getStorageDomainInfo
>
>     dom = self.validateSdUUID(sdUUID)
>
>   File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID
>
>     sdDom.validate()
>
>   File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate
>
>     raise se.StorageDomainAccessError(self.sdUUID)
>
> StorageDomainAccessError: Domain is either partially accessible or 
> entirely inaccessible: (u'03926733-1872-4f85-bb21-18dc320560db',)
>
> Thread-1111::DEBUG::2016-04-27 
> 23:01:27,865::task::885::Storage.TaskManager.Task::(_run) 
> Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Task._run: 
> d2acf575-1a60-4fa0-a5bb-cd4363636b94 
> ('03926733-1872-4f85-bb21-18dc320560db',) {} failed - stopping task
>
> Thread-1111::DEBUG::2016-04-27 
> 23:01:27,865::task::1246::Storage.TaskManager.Task::(stop) 
> Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::stopping in state 
> preparing (force False)
>
> Thread-1111::DEBUG::2016-04-27 
> 23:01:27,865::task::993::Storage.TaskManager.Task::(_decref) 
> Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::ref 1 aborting True
>
> Thread-1111::INFO::2016-04-27 
> 23:01:27,865::task::1171::Storage.TaskManager.Task::(prepare) 
> Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::aborting: Task is 
> aborted: 'Domain is either partially accessible or entirely 
> inaccessible' - code 379
>
> Thread-1111::DEBUG::2016-04-27 
> 23:01:27,866::task::1176::Storage.TaskManager.Task::(prepare) 
> Task=`d2acf575-1a60-4fa0-a5bb-cd4363636b94`::Prepare: aborted: Domain 
> is either partially accessible or entirely inaccessible
>
>
> *Question: Anyone know what might be happening? I have several gluster 
> config's, as you can see below. All the storage domain are using the 
> same config's*
>
>
> *More information:*
>
> I have the "engine" storage domain, "vmos1" storage domain and 
> "master" storage domain, so everything looks good.
>
> [root at kvm1 vdsm]# vdsClient -s 0 getStorageDomainsList
>
> 03926733-1872-4f85-bb21-18dc320560db
>
> 35021ff4-fb95-43d7-92a3-f538273a3c2e
>
> e306e54e-ca98-468d-bb04-3e8900f8840c
>
>
> *Gluster config:*
>
> [root at gluster-root1 ~]# gluster volume info
>
> Volume Name: engine
>
> Type: Replicate
>
> Volume ID: 64b413d2-c42e-40fd-b356-3e6975e941b0
>
> Status: Started
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: gluster1.xyz.com:/gluster/engine/brick1
>
> Brick2: gluster2.xyz.com:/gluster/engine/brick1
>
> Brick3: gluster-root1.xyz.com:/gluster/engine/brick1
>
> Options Reconfigured:
>
> performance.cache-size: 1GB
>
> performance.write-behind-window-size: 4MB
>
> performance.write-behind: off
>
> performance.quick-read: off
>
> performance.read-ahead: off
>
> performance.io-cache: off
>
> performance.stat-prefetch: off
>
> cluster.eager-lock: enable
>
> cluster.quorum-type: auto
>
> network.remote-dio: enable
>
> cluster.server-quorum-type: server
>
> cluster.data-self-heal-algorithm: full
>
> performance.low-prio-threads: 32
>
> features.shard-block-size: 512MB
>
> features.shard: on
>
> storage.owner-gid: 36
>
> storage.owner-uid: 36
>
> performance.readdir-ahead: on
>
>
> Volume Name: master
>
> Type: Replicate
>
> Volume ID: 20164808-7bbe-4eeb-8770-d222c0e0b830
>
> Status: Started
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: gluster1.xyz.com:/home/storage/master/brick1
>
> Brick2: gluster2.xyz.com:/home/storage/master/brick1
>
> Brick3: gluster-root1.xyz.com:/home/storage/master/brick1
>
> Options Reconfigured:
>
> performance.readdir-ahead: on
>
> performance.quick-read: off
>
> performance.read-ahead: off
>
> performance.io-cache: off
>
> performance.stat-prefetch: off
>
> cluster.eager-lock: enable
>
> network.remote-dio: enable
>
> cluster.quorum-type: auto
>
> cluster.server-quorum-type: server
>
> storage.owner-uid: 36
>
> storage.owner-gid: 36
>
> features.shard: on
>
> features.shard-block-size: 512MB
>
> performance.low-prio-threads: 32
>
> cluster.data-self-heal-algorithm: full
>
> performance.write-behind: off
>
> performance.write-behind-window-size: 4MB
>
> performance.cache-size: 1GB
>
>
> Volume Name: vmos1
>
> Type: Replicate
>
> Volume ID: ea8fb50e-7bc8-4de3-b775-f3976b6b4f13
>
> Status: Started
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: gluster1.xyz.com:/gluster/vmos1/brick1
>
> Brick2: gluster2.xyz.com:/gluster/vmos1/brick1
>
> Brick3: gluster-root1.xyz.com:/gluster/vmos1/brick1
>
> Options Reconfigured:
>
> network.ping-timeout: 60
>
> performance.readdir-ahead: on
>
> performance.quick-read: off
>
> performance.read-ahead: off
>
> performance.io-cache: off
>
> performance.stat-prefetch: off
>
> cluster.eager-lock: enable
>
> network.remote-dio: enable
>
> cluster.quorum-type: auto
>
> cluster.server-quorum-type: server
>
> storage.owner-uid: 36
>
> storage.owner-gid: 36
>
> features.shard: on
>
> features.shard-block-size: 512MB
>
> performance.low-prio-threads: 32
>
> cluster.data-self-heal-algorithm: full
>
> performance.write-behind: off
>
> performance.write-behind-window-size: 4MB
>
> performance.cache-size: 1GB
>
>
>
> Attached goes all the logs...
>
>
>
> Thanks
>
> -Luiz
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160428/045e0e9b/attachment-0001.html>