----- Original Message -----
From: "Nir Soffer" <nsoffer(a)redhat.com>
To: "Trey Dockendorf" <treydock(a)gmail.com>
Cc: "users" <users(a)ovirt.org>, "Michal Skrivanek"
<mskrivan(a)redhat.com>
Sent: Wednesday, February 12, 2014 10:04:04 AM
Subject: Re: [Users] Host Non-Operational from sanlock and VM fails to migrate
[...]
The vm was starting a migration to the other host:
Thread-26::DEBUG::2014-02-03 07:49:18,067::BindingXMLRPC::965::vds::(wrapper)
client [192.168.202.99]::call vmMigrate with ({'tunneled': 'false',
'dstqemu': '192.168.202.103',
'src': 'vm01.brazos.tamu.edu', 'dst':
'vm02.brazos.tamu.edu:54321', 'vmId':
'741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'},) {}
flowID
[7829ae2a]
Thread-26::DEBUG::2014-02-03 07:49:18,067::API::463::vds::(migrate)
{'tunneled': 'false', 'dstqemu': '192.168.202.103',
'src':
'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.
edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57',
'method':
'online'}
Thread-26::DEBUG::2014-02-03 07:49:18,068::BindingXMLRPC::972::vds::(wrapper)
return vmMigrate with {'status': {'message': 'Migration in
progress',
'code': 0}, 'progress': 0}
The migration was almost complete after 20 seconds:
Thread-29::INFO::2014-02-03 07:49:38,329::vm::815::vm.Vm::(run)
vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration Progress: 20 seconds
elapsed, 99% of data processed, 99% of mem processed
But it never completed:
Thread-29::WARNING::2014-02-03 07:54:38,383::vm::792::vm.Vm::(run)
vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration is stuck: Hasn't
progressed in 300.054134846 seconds. Aborting.
CCing Michal to inspect why the migration has failed.
Hi,
I had a look at the logs, and this looks like another libvirt/qemu I/O related issue.
If QEMU on the src hosts cannot reliably access storage, migration may get stuck;
this seems the case given the information provided; VDSM detected the migration
was not progressing and aborted it.
libvirt has an option (which we already use) to detect those scenarios, the
VIR_MIGRATE_ABORT_ON_ERROR flag, but unfortunately this is not 100% reliable yet,
for reasons outlined below in this mail.
We are aware of this issue and we are actively working to improve the handling
of such scenarios, but actually this is mostly on QEMU.
The core issue here is that when we use NFS (or ISCSI), and there is an I/O error,
QEMU can get blocked inside the kernel, waiting for the faulty I/O operation to complete,
and thus fail to report an I/O error.
It really depends on what specific operation fails, and there are many possible cases
and error scenarios.
Of course, if QEMU is blocked and fails to report the I/O error, libvirt can do nothing
to report/recover error, so VDSM can do even less.
This is known and acknowledged both by libvirt and QEMU developers.
But there are some good news, because newer versions of QEMU have improvements on this
field: QEMU recently gained native block devices[1], which, among other things, will make
it more robust in presence of I/O errors, and should improve the error reporting as well.
RHEL7 should have a version of QEMU with native ISCSI; hopefully NFS will follow soon
enough.
HTH,
+++
[1] for example, ISCSI, recently merged:
http://comments.gmane.org/gmane.comp.emulators.qemu/92599
work on NFS is ongoing.
--
Francesco Romani
RedHat Engineering Virtualization R & D
IRC: fromani