Managed Block Storage: ceph detach_volume failing after migration
by Dan Poltawski
On ovirt 4.3.5 we are seeing various problems related to the rbd device staying mapped after a guest has been live migrated. This causes problems migrating the guest back, as well as rebooting the guest when it starts back up on the original host. The error returned is ‘nrbd: unmap failed: (16) Device or resource busy’. I’ve pasted the full vdsm log below.
As far as I can tell this isn’t happening 100% of the time, and seems to be more prevalent on busy guests.
(Not sure if I should create a bug for this, so thought I’d start here first)
Thanks,
Dan
Sep 24 19:26:18 mario vdsm[5485]: ERROR FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmptQzb10/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 76076\\nTraceback (most recent call last):\\n File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n sys.exit(main(sys.argv[1:]))\\n File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n ignore_errors=False)\\n File "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n run_as_root=True)\\n File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute\\n result = self.__execute(*args, **kwargs)\\n File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241, in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in remote_call\\n raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rbd/volume-0e8c1056-45d6-4740-934d-eb07a9f73160 --conf /tmp/brickrbd_LCKezP --id ovirt --mon_host 172.16.10.13:3300 --mon_host 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout: u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 124, in method#012 ret = func(*args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1766, in detach_volume#012 return managedvolume.detach_volume(vol_id)#012 File "/usr/lib/python2.7/site-packages/vdsm/storage/managedvolume.py", line 67, in wrapper#012 return func(*args, **kwargs)#012 File "/usr/lib/python2.7/site-packages/vdsm/storage/managedvolume.py", line 135, in detach_volume#012 run_helper("detach", vol_info)#012 File "/usr/lib/python2.7/site-packages/vdsm/storage/managedvolume.py", line 179, in run_helper#012 sub_cmd, cmd_input=cmd_input)#012 File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__#012 return callMethod()#012 File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>#012 **kwargs)#012 File "<string>", line 2, in managedvolume_run_helper#012 File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod#012 raise convert_to_error(kind, result)#012ManagedVolumeHelperFailed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmptQzb10/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 76076\\nTraceback (most recent call last):\\n File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n sys.exit(main(sys.argv[1:]))\\n File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n ignore_errors=False)\\n File "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n run_as_root=True)\\n File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute\\n result = self.__execute(*args, **kwargs)\\n File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241, in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in remote_call\\n raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rbd/volume-0e8c1056-45d6-4740-934d-eb07a9f73160 --conf /tmp/brickrbd_LCKezP --id ovirt --mon_host 172.16.10.13:3300 --mon_host 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout: u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)
________________________________
The Networking People (TNP) Limited. Registered office: Network House, Caton Rd, Lancaster, LA1 3PE. Registered in England & Wales with company number: 07667393
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
5 years, 1 month
Urgent help needed / Snapshot deletion failure
by smirta@gmx.net
Dear all we are desperate
We have tried to delete a Snapshot and it hangs at merging the snapshots. We've found out that this is a known bug because the merge process is being called with wrong parameters according to the bug report https://bugzilla.redhat.com/show_bug.cgi?id=1601212. The Snapshot's Disks are in illegal state and the VM is locked. Since it is a very important VM for our non-profit organization, we need to have this machine back online as soon as possible. Is there a way to fix this without updating to qemu-kvm-ev-2.12.0? At least back to the state before the deletion would be fantastic.
Our vdsm.log:
2019-09-25 14:01:41,283+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'topVolUUID': u'8a9f190f-2725-4535-a69d-c74e4e57d372', u'vmID': u'1422899a-2151-4d5d-9d66-e74f19084542', u'drive': {u'imageID': u'eb2bce92-e758-4bea-93fa-02a56574b932', u'volumeID': u'8a9f190f-2725-4535-a69d-c74e4e57d372', u'domainID': u'022f39ee-eeb8-4b51-9549-9d7e3c88d4a8', u'poolID': u'00000001-0001-0001-0001-000000000307'}, u'bandwidth': u'0', u'jobUUID': u'c3fdc4a9-9d6d-424a-b9df-b96be5622e0a', u'baseVolUUID': u'3e15121b-0795-4056-bafe-448068c9ec71'}, 'jsonrpc': '2.0', 'method': u'VM.merge', 'id': u'9cd540b7-a32f-4f95-9fe2-9ce70d5b6478'} at 0x7f674fec5710> timeout=60, duration=6420 at 0x7f674fec58d0> task#=1896299 at 0x7f674c06ae90>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
result = fn(*methodArgs)
File: "<string>", line 2, in merge
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
ret = func(*args, **kwargs)
File: "<string>", line 2, in merge
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 739, in merge
drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6041, in merge
self.updateVmJobs()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5818, in updateVmJobs
self._vmJobs = self.queryBlockJobs()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5832, in queryBlockJobs
with self._jobsLock:
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 60, in __enter__
self.acquire()
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 68, in acquire
rc = self.lock() if blocking else self.trylock()
File: "/usr/lib/python2.7/site-packages/pthread.py", line 96, in lock
return _libpthread.pthread_mutex_lock(self._mutex) (executor:363)
Kind regards
Simon
5 years, 1 month
How to delete obsolete Data Centers with no hosts, but with domains inside
by Claudio Soprano
Hi to all,
We are using ovirt to manage 6 Data Centers, 3 of them are old Data
Centers with no hosts inside, but with domains, storage and VMs not running.
We left them because we wanted to have some backups in case of failure
of the new Data Centers created.
Time pass and now we would like to remove these Data Centers, but we got
no way for now to remove them.
If we try to remove the Storage Domains (using remove o destroy) we get
"Error while executing action: Cannot destroy the master Storage Domain
from the Data Center without another active Storage Domain to take its
place.
-Either activate another Storage Domain in the Data Center, or remove
the Data Center.
-If you have problems with the master Data Domain, consider following
the recovery process described in the documentation, or contact your
system administrator."
if we try to remove the Data Center directly we get
"Error while executing action: Cannot remove Data Center. There is no
active Host in the Data Center."
How can we solve the problem ?
It can be done via ovirt-shell or using some script or via ovirt
management interface ?
Thanks in advance
Claudio
5 years, 1 month
Add External Provider - OpenStack Glance - Test Failed - "Failed to communicate with the External Provider"
by Pravin Mohandass
We have installed Ovirt Manager 4.3 with couple of KVM compute host added into cluster.
When we add OpenStack Glance as an External Provider, failing with following error message - "Failed to communicate with the External Provider"
We are able to add the external provider for Openstack Neutron and Openstack Cinder and import the storage and network in to ovirt manager and use it for vm.
We are getting following error message on engine.log file "Failed with error PROVIDER_FAILURE and code 5050". But we are able to fetch the images using postman tool.
Requesting you to provide some insights on adding the Glance provider.
5 years, 1 month
[ANN] oVirt 4.3.6 Sixth Release Candidate is now available for testing
by Sandro Bonazzola
The oVirt Project is pleased to announce the availability of the oVirt
4.3.6 Sixth Release Candidate for testing, as of September 25th, 2019.
This update is a release candidate of the sixth in a series of
stabilization updates to the 4.3 series.
This is pre-release software. This pre-release should not to be used in
production.
This release is available now on x86_64 architecture for:
* Red Hat Enterprise Linux 7.7 or later (but <8)
* CentOS Linux (or similar) 7.7 or later (but <8)
This release supports Hypervisor Hosts on x86_64 and ppc64le architectures
for:
* Red Hat Enterprise Linux 7.7 or later (but <8)
* CentOS Linux (or similar) 7.7 or later (but <8)
* oVirt Node 4.3 (available for x86_64 only) has been built consuming
CentOS 7.7 Release
See the release notes [1] for known issues, new features and bugs fixed.
Notes:
- oVirt Appliance is already available
- oVirt Node is already available
Additional Resources:
* Read more about the oVirt 4.3.6 release highlights:
http://www.ovirt.org/release/4.3.6/
* Get more oVirt Project updates on Twitter: https://twitter.com/ovirt
* Check out the latest project news on the oVirt blog:
http://www.ovirt.org/blog/
[1] http://www.ovirt.org/release/4.3.6/
[2] http://resources.ovirt.org/pub/ovirt-4.3-pre/iso/
--
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <https://www.redhat.com/>
sbonazzo(a)redhat.com
<https://www.redhat.com/>*Red Hat respects your work life balance.
Therefore there is no need to answer this email out of your office hours.*
5 years, 1 month
Messed up 4.2.3.1 installation - SSL handshake ERROR
by souvaliotimaria@mail.com
Hello, everyone!
So, I have an experimental installation of ovirt 4.2.3.1 with 3 nodes and glustered. Recently I deployed a new installation with ovirt 4.3.5.2, 3nodes and glustered storage here as well. The thing is, in my enthousiasm I thought "hey! what if I can import the experimental nodes as hosts in the new installation in a new cluster and see what happens? Will the 4.3.5.2 engine see them? Probably yeah. But will it see the VMs I have there?"
And so I imported the experimental nodes. Without detaching them from their hosted engine. I could see the only VM that was active at the moment and not one of the suspended ones and of course I could not see the 4.2.3.1 HE VM.
I have removed the hosts from the new installation and I have tried reconnecting the old engine and its nodes. Passwordless ssh works just fine, but the problem persists.
hosted-engine --vm-status reports stale-data on node 2 and node 3
The thing is I know I messed the experimental installation (and I blame only my curiosity), SSL handshake is no longer feasable and I can't remove the hosts from the initial Cluster to Import them again. Basically everything is either in the process of activating without ever being able to do so or down or Non responsive.
I would like to find a way around this, as I have seen in other posts in the ovirt forum that the SSL handshake error appears in some other cases and I would like to have a know-how if an occasion like this occurs in the future in production.
Is it possible to re-deploy the engine on the nodes and not lose the glustered space or the existing VMs? Can the HE be destroyed and then deployed from scratch? What about the glustered space and the VMs' space? Will the VMs just take up space without being able to neither bring them up nor destroy them?
I know I'm asking a lot and it was my fault to begin with but I am really curious if we can see this through.
Thanks in advance
5 years, 1 month
oVirt node - post-upgrade tasks
by dan.munteanu@mdc-berlin.de
Dear oVirt community,
recently I've started to use oVirt as a replacement for KVM + VirtManager and I've decided to use the oVirt Node installed on Dell EMC servers with self-hosted engine. So far, everything works fine, except that, each time I'm updating the nodes, I must reinstall Dell OMSA needed by the monitoring system (via SNMP). Is there any way to automatize the OMSA installation as a post-upgrade task/hook?
Thank you
Dan
5 years, 1 month
Got a RequestError: status: 409 reason: Conflict
by smidhunraj@gmail.com
I'm trying to clone the snapshot into a new vm. The tool I am using is ovirtBAckup from the github wefixit-AT The link is here https://github.com/wefixit-AT/oVirtBackup
this piece of code snippet throws me error.
if not config.get_dry_run():
api.vms.add(params.VM(name=vm_clone_name, memory=vm.get_memory(), cluster=api.clusters.get(config.get_cluster_name()), snapshots=snapshots_param))
VMTools.wait_for_vm_operation(api, config, "Cloning", vm_from_list)
print 'hellooooo'
logger.info("Cloning finished")
The above lines are from the 325 line number of backup.py of the github.
I am getting error as
!!! Got a RequestError:
status: 409
reason: Conflict
detail: Cannot add VM. The VM is performing an operation on a Snapshot. Please wait for the operation to finish, and try again.
How can i further debug the code to know what is happening wrong in my program.I am new to python please help me.
5 years, 2 months
Super Low VM disk IO via Shared Storage
by Vrgotic, Marko
Dear oVirt,
I have executed some tests regarding IO disk speed on the VMs, running on shared storage and local storage in oVirt.
Results of the tests on local storage domains:
avlocal2:
[root@mpollocalcheck22 ~]# dd if=/dev/zero of=/tmp/test2.img bs=512 count=100000 oflag=dsync
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 45.9756 s, 1.1 MB/s
avlocal3:
[root@mpollocalcheck3 ~]# dd if=/dev/zero of=/tmp/test2.img bs=512 count=100000 oflag=dsync
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 43.6179 s, 1.2 MB/s
Results of the test on shared storage domain:
avshared:
[root@mpoludctest4udc-1 ~]# dd if=/dev/zero of=/tmp/test2.img bs=512 count=100000 oflag=dsync
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 283.499 s, 181 kB/s
Why is it so low? Is there anything I can do to tune, configure VDSM or other service to speed this up?
Any advice is appreciated.
Shared storage is based on Netapp with 20Gbps LACP path from Hypervisor to Netapp volume, and set to MTU 9000. Used protocol is NFS4.0.
oVirt is 4.3.4.3 SHE.
5 years, 2 months