
I recently removed a host from my cluster to upgrade it to 4.4, after I removed the host from the datacenter VMs started to pause on the second system they all migrated to. Investigating via the engine showed the storage domain was showing as "unknown", when I try to activate it via the engine it cycles to locked then to unknown again. /var/log/sanlock.log contains a repeating: add_lockspace e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0 conflicts with name of list1 s1 e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0 vdsm.log contains these (maybe related) snippets: --- 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [vdsm.api] FINISH getAllTasksStatuses error=Secured object is not in safe state from=::ffff:137.79.52.43,36326, flow_id=18031a91, task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getAllTasksStatuses File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, in getAllTasksStatuses allTasksStatus = self._pool.getAllTasksStatuses() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 77, in wrapper raise SecureError("Secured object is not in safe state") SecureError: Secured object is not in safe state 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted: u'Secured object is not in safe state' - code 100 (task:1181) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.Dispatcher] FINISH getAllTasksStatuses error=Secured object is not in safe state (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error SecureError: Secured object is not in safe state --- 2020-09-03 20:44:23,252+0000 INFO (tasks/2) [storage.ThreadPool.WorkerThread] START task 76415a77-9d29-4b72-ade1-53207cfc503b (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7fe99c27dea8>>, args=None) (thre adPool:208) 2020-09-03 20:44:23,266+0000 INFO (tasks/2) [storage.SANLock] Acquiring host id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True) (clusterlock:313) 2020-09-03 20:44:23,267+0000 ERROR (tasks/2) [storage.TaskManager.Task] (Task='76415a77-9d29-4b72-ade1-53207cfc503b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 957, in acquireHostId self._manifest.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 501, in acquireHostId self._domainLock.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 344, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('e1270474-108c-4cae-83d6-51698cffebbf', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) --- Another symptom is in the hosts view of the engine SPM bounces between "Normal" and "Contending". When it's Normal if I select Management -> Select as SPM I get "Error while executing action: Cannot force select SPM. Unknown Data Center status." I've tried rebooting the one remaining host in the cluster no to avail, hosted-engine --reinitialize-lockspace also seems to not solve the issue. I'm kind of stumped as to what else to try, would appreciate any guidance on how to resolve this. Thank You

On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via Users wrote:
I recently removed a host from my cluster to upgrade it to 4.4, after I removed the host from the datacenter VMs started to pause on the second system they all migrated to. Investigating via the engine showed the storage domain was showing as "unknown", when I try to activate it via the engine it cycles to locked then to unknown again.
/var/log/sanlock.log contains a repeating: add_lockspace e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf febbf/ids:0 conflicts with name of list1 s1 e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf febbf/ids:0
how do you remove the fist host, did you put it into maintenance first? I wonder, how this situation (two lockspaces with conflicting names) can occur. You can try to re-initialize the lockspace directly using sanlock command (see man sanlock), but it would be good to understand the situation first.
vdsm.log contains these (maybe related) snippets: --- 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [vdsm.api] FINISH getAllTasksStatuses error=Secured object is not in safe state from=::ffff:137.79.52.43,36326, flow_id=18031a91, task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52)
2020-09-03
20:19:53,483+0000 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getAllTasksStatuses File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, in getAllTasksStatuses allTasksStatus = self._pool.getAllTasksStatuses() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 77, in wrapper raise SecureError("Secured object is not in safe state") SecureError: Secured object is not in safe state 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted: u'Secured object is not in safe state' - code 100 (task:1181) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.Dispatcher] FINISH getAllTasksStatuses error=Secured object is not in safe state (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error SecureError: Secured object is not in safe state --- 2020-09-03 20:44:23,252+0000 INFO (tasks/2) [storage.ThreadPool.WorkerThread] START task 76415a77-9d29-4b72-ade1-53207cfc503b (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7fe99c27dea8>>, args=None) (thre adPool:208) 2020-09-03 20:44:23,266+0000 INFO (tasks/2) [storage.SANLock] Acquiring host id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True) (clusterlock:313) 2020-09-03 20:44:23,267+0000 ERROR (tasks/2) [storage.TaskManager.Task] (Task='76415a77-9d29-4b72-ade1-53207cfc503b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 957, in acquireHostId self._manifest.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 501, in acquireHostId self._domainLock.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 344, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('e1270474-108c-4cae-83d6-51698cffebbf', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
Another symptom is in the hosts view of the engine SPM bounces between "Normal" and "Contending". When it's Normal if I select Management -> Select as SPM I get "Error while executing action: Cannot force select SPM. Unknown Data Center status."
I've tried rebooting the one remaining host in the cluster no to avail, hosted-engine --reinitialize-lockspace also seems to not solve the issue.
I'm kind of stumped as to what else to try, would appreciate any guidance on how to resolve this.
Thank You
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FMJZV2OEKHPTS TROSPLCQ3WJUIPB6CKL/

On 9/4/20, 4:50 AM, "Vojtech Juranek" <vjuranek@redhat.com> wrote: On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via Users wrote: > I recently removed a host from my cluster to upgrade it to 4.4, after I > removed the host from the datacenter VMs started to pause on the second > system they all migrated to. Investigating via the engine showed the > storage domain was showing as "unknown", when I try to activate it via the > engine it cycles to locked then to unknown again. > /var/log/sanlock.log contains a repeating: > add_lockspace > e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0 conflicts with name of list1 s1 > e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0 how do you remove the fist host, did you put it into maintenance first? I wonder, how this situation (two lockspaces with conflicting names) can occur. You can try to re-initialize the lockspace directly using sanlock command (see man sanlock), but it would be good to understand the situation first. Just as you said, put into maintenance mode, shut it down, removed it via the engine UI.

On Fri, Sep 4, 2020 at 5:43 PM Gillingham, Eric J (US 393D) via Users <users@ovirt.org> wrote:
On 9/4/20, 4:50 AM, "Vojtech Juranek" <vjuranek@redhat.com> wrote:
On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via Users wrote: > I recently removed a host from my cluster to upgrade it to 4.4, after I > removed the host from the datacenter VMs started to pause on the second > system they all migrated to. Investigating via the engine showed the > storage domain was showing as "unknown", when I try to activate it via the > engine it cycles to locked then to unknown again.
> /var/log/sanlock.log contains a repeating: > add_lockspace > e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0 conflicts with name of list1 s1 > e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0
David, what does this message mean? It is clear that there is a conflict, but not clear what is the conflicting item. The host id in the request is 1, and in the conflicting item, 3. No conflicting data is displayed in the error message.
how do you remove the fist host, did you put it into maintenance first? I wonder, how this situation (two lockspaces with conflicting names) can occur.
You can try to re-initialize the lockspace directly using sanlock command (see man sanlock), but it would be good to understand the situation first.
Just as you said, put into maintenance mode, shut it down, removed it via the engine UI.
Eric, it is possible that you shutdown the host too quickly, before it actually disconnected from the lockspace? When engine move a host to maintenance, it does not wait until the host actually move into maintenance. This is actually a bug, so it would be good idea to file a bug about this. Nir

On Sat, Sep 05, 2020 at 12:25:45AM +0300, Nir Soffer wrote:
> /var/log/sanlock.log contains a repeating: > add_lockspace > e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0 conflicts with name of list1 s1 > e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0
David, what does this message mean?
It is clear that there is a conflict, but not clear what is the conflicting item. The host id in the request is 1, and in the conflicting item, 3. No conflicting data is displayed in the error message.
The lockspace being added is already being managed by sanlock, but using host_id 3. sanlock.log should show when lockspace e1270474 with host_id 3 was added. Dave

On Sat, Sep 5, 2020 at 12:45 AM David Teigland <teigland@redhat.com> wrote:
On Sat, Sep 05, 2020 at 12:25:45AM +0300, Nir Soffer wrote:
> /var/log/sanlock.log contains a repeating: > add_lockspace > e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0 conflicts with name of list1 s1 > e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf > febbf/ids:0
David, what does this message mean?
It is clear that there is a conflict, but not clear what is the conflicting item. The host id in the request is 1, and in the conflicting item, 3. No conflicting data is displayed in the error message.
The lockspace being added is already being managed by sanlock, but using host_id 3. sanlock.log should show when lockspace e1270474 with host_id 3 was added.
Do you mean that the host reporting this already joined the lockspace with host_id=3, and then tried to join again with host_id=1?

On 9/4/20, 2:26 PM, "Nir Soffer" <nsoffer@redhat.com> wrote: On Fri, Sep 4, 2020 at 5:43 PM Gillingham, Eric J (US 393D) via Users <users@ovirt.org> wrote: > > On 9/4/20, 4:50 AM, "Vojtech Juranek" <vjuranek@redhat.com> wrote: > > On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via Users > wrote: > > how do you remove the fist host, did you put it into maintenance first? I > wonder, how this situation (two lockspaces with conflicting names) can occur. > > You can try to re-initialize the lockspace directly using sanlock command (see > man sanlock), but it would be good to understand the situation first. > > > Just as you said, put into maintenance mode, shut it down, removed it via the engine UI. Eric, it is possible that you shutdown the host too quickly, before it actually disconnected from the lockspace? When engine move a host to maintenance, it does not wait until the host actually move into maintenance. This is actually a bug, so it would be good idea to file a bug about this. That is a possibility, from the UI view it usually takes a bit for the host to show is in maintenance, so I assumed it was an accurate representation of the state. Unfortunately all hosts have since been completely wiped and re-installed, this issue brought down the entire cluster for over a day so I needed to get everything up again ASAP. I did not archive/backup the sanlock logs beforehand, so I can't check for the sanlock events David mentioned. When I cleared the sanlock there were no s or r entries listed in sanlock client status, and there were no other running hosts to obtain other locks, but I don’t fully grok sanlock if there was maybe some lock that existed only on the iscsi space separate from any current or past hosts.

On Sat, Sep 5, 2020 at 1:49 AM Gillingham, Eric J (US 393D) <eric.j.gillingham@jpl.nasa.gov> wrote:
On 9/4/20, 2:26 PM, "Nir Soffer" <nsoffer@redhat.com> wrote: On Fri, Sep 4, 2020 at 5:43 PM Gillingham, Eric J (US 393D) via Users <users@ovirt.org> wrote: > > On 9/4/20, 4:50 AM, "Vojtech Juranek" <vjuranek@redhat.com> wrote: > > On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via Users > wrote: > > how do you remove the fist host, did you put it into maintenance first? I > wonder, how this situation (two lockspaces with conflicting names) can occur. > > You can try to re-initialize the lockspace directly using sanlock command (see > man sanlock), but it would be good to understand the situation first. > > > Just as you said, put into maintenance mode, shut it down, removed it via the engine UI.
Eric, it is possible that you shutdown the host too quickly, before it actually disconnected from the lockspace?
When engine move a host to maintenance, it does not wait until the host actually move into maintenance. This is actually a bug, so it would be good idea to file a bug about this.
That is a possibility, from the UI view it usually takes a bit for the host to show is in maintenance, so I assumed it was an accurate representation of the state. Unfortunately all hosts have since been completely wiped and re-installed, this issue brought down the entire cluster for over a day so I needed to get everything up again ASAP.
I did not archive/backup the sanlock logs beforehand, so I can't check for the sanlock events David mentioned. When I cleared the sanlock there were no s or r entries listed in sanlock client status, and there were no other running hosts to obtain other locks, but I don’t fully grok sanlock if there was maybe some lock that existed only on the iscsi space separate from any current or past hosts.
Looks like we lost all evidence. If this happens again, please file a bug and attach the logs. Nir

Is this a HCI setup ? If yes, check gluster status (I prefer cli but is also valid in the UI). gluster pool list gluster volume status gluster volume heal <VOL> info summary Best Regards, Strahil Nikolov В петък, 4 септември 2020 г., 00:38:13 Гринуич+3, Gillingham, Eric J (US 393D) via Users <users@ovirt.org> написа: I recently removed a host from my cluster to upgrade it to 4.4, after I removed the host from the datacenter VMs started to pause on the second system they all migrated to. Investigating via the engine showed the storage domain was showing as "unknown", when I try to activate it via the engine it cycles to locked then to unknown again. /var/log/sanlock.log contains a repeating: add_lockspace e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0 conflicts with name of list1 s1 e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0 vdsm.log contains these (maybe related) snippets: --- 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [vdsm.api] FINISH getAllTasksStatuses error=Secured object is not in safe state from=::ffff:137.79.52.43,36326, flow_id=18031a91, task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getAllTasksStatuses File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, in getAllTasksStatuses allTasksStatus = self._pool.getAllTasksStatuses() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 77, in wrapper raise SecureError("Secured object is not in safe state") SecureError: Secured object is not in safe state 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted: u'Secured object is not in safe state' - code 100 (task:1181) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.Dispatcher] FINISH getAllTasksStatuses error=Secured object is not in safe state (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error SecureError: Secured object is not in safe state --- 2020-09-03 20:44:23,252+0000 INFO (tasks/2) [storage.ThreadPool.WorkerThread] START task 76415a77-9d29-4b72-ade1-53207cfc503b (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7fe99c27dea8>>, args=None) (thre adPool:208) 2020-09-03 20:44:23,266+0000 INFO (tasks/2) [storage.SANLock] Acquiring host id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True) (clusterlock:313) 2020-09-03 20:44:23,267+0000 ERROR (tasks/2) [storage.TaskManager.Task] (Task='76415a77-9d29-4b72-ade1-53207cfc503b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 957, in acquireHostId self._manifest.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 501, in acquireHostId self._domainLock.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 344, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('e1270474-108c-4cae-83d6-51698cffebbf', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) --- Another symptom is in the hosts view of the engine SPM bounces between "Normal" and "Contending". When it's Normal if I select Management -> Select as SPM I get "Error while executing action: Cannot force select SPM. Unknown Data Center status." I've tried rebooting the one remaining host in the cluster no to avail, hosted-engine --reinitialize-lockspace also seems to not solve the issue. I'm kind of stumped as to what else to try, would appreciate any guidance on how to resolve this. Thank You _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FMJZV2OEKHPTST...

This is using iscsi storage. I stopped the ovirt broker/agents/vdsm and used sanlock to remove the locks it was complaining about, but as soon as I started the ovirt tools up and the engine came online again the same messages reappeared. After spending more than a day trying to resolve this nicely I gave up. Installed ovirt-node on the host I originally removed, added that to the cluster, then removed and nuked the misbehaving host and did a clean install there. I did run into an issue where the first host had an empty hosted-engine.conf, only had the cert and the id settings in it so it wouldn’t connect properly, but I worked around that by just copying the fully populated one from the semi-working host and changing the id to match. No idea if this is the right solution but it _seems_ to be working and my VMs are back to running, just got too frustrated trying to debug through normal methods and finding solutions offered via the ovirt tools and documentation. - Eric On 9/4/20, 10:59 AM, "Strahil Nikolov" <hunter86_bg@yahoo.com> wrote: Is this a HCI setup ? If yes, check gluster status (I prefer cli but is also valid in the UI). gluster pool list gluster volume status gluster volume heal <VOL> info summary Best Regards, Strahil Nikolov В петък, 4 септември 2020 г., 00:38:13 Гринуич+3, Gillingham, Eric J (US 393D) via Users <users@ovirt.org> написа: I recently removed a host from my cluster to upgrade it to 4.4, after I removed the host from the datacenter VMs started to pause on the second system they all migrated to. Investigating via the engine showed the storage domain was showing as "unknown", when I try to activate it via the engine it cycles to locked then to unknown again. /var/log/sanlock.log contains a repeating: add_lockspace e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0 conflicts with name of list1 s1 e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0 vdsm.log contains these (maybe related) snippets: --- 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [vdsm.api] FINISH getAllTasksStatuses error=Secured object is not in safe state from=::ffff:137.79.52.43,36326, flow_id=18031a91, task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getAllTasksStatuses File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, in getAllTasksStatuses allTasksStatus = self._pool.getAllTasksStatuses() File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 77, in wrapper raise SecureError("Secured object is not in safe state") SecureError: Secured object is not in safe state 2020-09-03 20:19:53,483+0000 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted: u'Secured object is not in safe state' - code 100 (task:1181) 2020-09-03 20:19:53,483+0000 ERROR (jsonrpc/6) [storage.Dispatcher] FINISH getAllTasksStatuses error=Secured object is not in safe state (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error SecureError: Secured object is not in safe state --- 2020-09-03 20:44:23,252+0000 INFO (tasks/2) [storage.ThreadPool.WorkerThread] START task 76415a77-9d29-4b72-ade1-53207cfc503b (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7fe99c27dea8>>, args=None) (thre adPool:208) 2020-09-03 20:44:23,266+0000 INFO (tasks/2) [storage.SANLock] Acquiring host id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True) (clusterlock:313) 2020-09-03 20:44:23,267+0000 ERROR (tasks/2) [storage.TaskManager.Task] (Task='76415a77-9d29-4b72-ade1-53207cfc503b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 957, in acquireHostId self._manifest.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 501, in acquireHostId self._domainLock.acquireHostId(hostId, wait) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 344, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('e1270474-108c-4cae-83d6-51698cffebbf', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) --- Another symptom is in the hosts view of the engine SPM bounces between "Normal" and "Contending". When it's Normal if I select Management -> Select as SPM I get "Error while executing action: Cannot force select SPM. Unknown Data Center status." I've tried rebooting the one remaining host in the cluster no to avail, hosted-engine --reinitialize-lockspace also seems to not solve the issue. I'm kind of stumped as to what else to try, would appreciate any guidance on how to resolve this. Thank You _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://urldefense.us/v3/__https://www.ovirt.org/privacy-policy.html__;!!PvB... oVirt Code of Conduct: https://urldefense.us/v3/__https://www.ovirt.org/community/about/community-g... List Archives: https://urldefense.us/v3/__https://lists.ovirt.org/archives/list/users@ovirt...
participants (5)
-
David Teigland
-
Gillingham, Eric J (US 393D)
-
Nir Soffer
-
Strahil Nikolov
-
Vojtech Juranek