Hey guys,
I have tried to write on the IRC channel, regarding this issue, but I'm no sure if
I'm not doing it right or maybe there are not many people watching the #ovirt
channel.
We have deployed a 6 node oVirt cluster, and have moved roughly 100 VMs on them. We
started the cluster with NFS storage, which we were building and testing with our Ceph
cluster. The Ceph has been ready for a few months now, and we have tested running oVirt
VMs on it using RBD, CephFS and by using the NFS Ganesha server which can be enabled on
Ceph.
Initially I tested with RBD using the cinderlib functionality in oVirt. This appears to be
working fine, and I can live with the fact tha live storage migrations from posix storage
are not possible. The biggest hurdle vi encountered are, that the backup APIs in oVirt
cannot be used to create a backup and download the backups. This breaks our RHV Veeam
backup solution, which we have grown quite fond of. But even my homegrown backup solutions
from earlier don't work either, as they use the APIs.
For this reason we have now changed to CephFS. It has a different set of problems, as we
can only supply 1 monitor when mounting the CephFS storage, making i less robust as it
should be. It is also making metadata storage dependent on the MDS. As far as I
understand, is data access still connecting to the OSDs where it resides directly. I have
multiple MDS containers on the Ceph cluster for loadbalancing and redundancy, but it still
feels less tidy than RBD with native Ceph support. The good thing is, that as CephFS is a
POSIX filesystem, the backup APIs work and so does Veeam backup.
The biggest problems I am struggling with, is the unexplained pausing of exclusively VMs
running the Windows Operating system. This is why I didn't notice issues initially, as
I have been testing the storage with Linux VMs. The machines pause at a random moment in
time with a "Storage Error". I cannot resume the machine, not even using virsh
on the host it is running on. The only thing I can do is power it off, and reboot. And in
many cases, it then pauses during the boot process. I have been watching the VDSM logs,
and couldn't work out why this happens. When I move the disk to plain NFS storage (not
on Ceph), this never happens. The result is that most Windows based VMs have not been
moved yet to CephFS. I have 0 occurrences of this happening with Linux VMs, of which I
have many (I think we have 80 linux VMs vs 20 Windows VMs).
The Ceph cluster does not show and problems before, during or after this happening. Is
there anyone with experience wih oVirt and Ceph, that can share experiences or help me
find the root cause of this problem? I have a couple of things I can progress wih:
1. Move a couple of Windows VMs to RBD, and install a Veeam agent on the servers instead.
This is just opening a new can of worms, as it requires some work on the servers each time
I add a new server and I also need to make network openings from the server to the backup
repos. It is just a little less neat than the RHV Veeam backup solution using the
hypervisor.
2. Move a couple of Windows VMs to the NFS Ganesha implementation. This means ALL traffic
is going through the NFS containers created on the Ceph cluster, and I loose the
distributed nature of the oVirt hosts talking direcly to the OSDs on the Ceph cluster. If
I was to go this way, I should probably create some NFS Ganesha servers that connect to
Ceph natively on the one end and provide NFS services to oVirt.
Both tests would still test Ceph, but using an alternative method to CephFS. My preferred
solution really is 1., was it not for the backup APIs being rendered useless. Is work
still being carried out on development in these areas, or has oVirt/Ceph development
ceased?
Thoughts and comments are welcome. Looking forward to sparring with someone that has
experience with this :-)
With kind regards
Jelle