New subject: Xcp-ng, first impressions as an oVirt HCI alternative

14 Feb 2022

      Comments & motivational stuff were moved to the end...

Source/license:
Xen the hypervisor is moved to the Linux foundation. Perpetual open source, free to use.
Xcp-ng is a distribution of Xen, produced by a small French company based on Xen using (currently) a Linux 4.19 LTS kernel and an EL7 frozen userland from July 2020: they promise open source and free to use forever

Mode of operation:
You install Xcp-ng on your (bare metal) hardware. You manage nodes via XenOrchestrator, which is a big Node.js application you can run in pretty much any way you want.

Business model:
You can buy support for Xcp-ng at different levels of quality.
XenOrchestrator as AN APPLIANCE exposes different levels of functionality depending on the level of support you buy.
But you get the full source code of the appliance and can compile it yourself to support the full set of qualities.
There is a script out there, which allows you to auto-generate the appliance with a single command.
In short you are never forced to pay, but better help can be purchased.

How does it feel to a CentOS/RHEL user?
The userland on the nodes is EL7, but you shouldn't touch that.
CLI is classic Xen, nothing like KVM or oVirt.
I guess libVirt and virsh should be similar, if they live up to their promise at all.
Standard user-land on Orchestrator appliance is Debian, but you can build it on pretty much any Linux with that script: all interaction is meant to be done via Web-UI.

Installation/setup:
There is an image/ISO much like the oVirt node image. Based on a Linux 4.19 LTS kernel and an EL7 frozen userland from July 2020 and a freshly maintained Xen with tools.
Installation on bare metal or VMs (e.g. for nested experimentation) is a snap, HCL isn't extraordinary. I'm still fighting to get the 2.5/5GBit USB3 NIC working that I like using for my smallest test systems.

A single command on one node will download the "free"-Orchestrator appliance (aka Xoa) and install it as a VM on that node. It's installed as auto-launch and just point your brower to its IP to start with the GUI.

There is various other ways to build or run the GUI, which can be run on anything remotely Linux, within the nodes or outside: more on this in the next section.

The management appliance (Xoa) will run with only 2GB of RAM and 10GB of disk for a couple of hosts. You grow to dozens of hosts, give a little more RAM and it will be fine. Compare to the oVirt management engine it's very, very light seems to have vastly less parts that can break.

And if it does, that doesn't matter, because it is pretty much stateless. E.g. pool membership and configuratoin is on the nodes, so if you connect from another Xoa they will just carry over. Ditto storage, that configuration which oVirt keeps in the management engines Postgres database, is on the nodes in Xcp and can be changed by any connected Xoa.

Operation:
Xen nodes are much more autonomous then oVirt hosts. The use whatever storage they might have locally, or attached via SAN/NAS/Gluster[!!!] and others.They will operate without a management engine much like "single node HCI oVirt" or they can be joined into a pool, which opens up live migration and HA. A pool is created by telling a node that it's the master now and then adding other nodes to join in. The master can be changed and nodes can be moved to other pools. Adding and removing nodes to a pool is very quick and easy and it's the same for additional storage repositories: Any shared storage added to any node is immediately visible to the pool and disks can be flipped between local and shared storage very easily (I haven't tried live disk moves, but they could work).

Having nodes in a pool qualfies them for live migration (CPU architecture caveats apply). If storage is local, it will move with the VM, if storage is shared, only RAM will move.

You can also move VMs not sharing a pool and even across different x86 variants (e.g. AMD and Intel), when VMs are down. If you've ever daddled with "Export domains" or "Backup domains" in oVirt, you just can't believe how quick and easy these things are in Xcp-ng. VMs and their disks can be moved, copied, cloned, backed-up and restore with a minimum of fuzz including continuous backups on running machines.

You can label machines as "HA" so they'll always be restarted elsewhere, should a host go down. You can define policies for how to balance workloads across hosts and ensure that HA pairs won't share a host, pretty similar to oVirt.

The "free" Xoa has plenty of "upgrade!" buttons all over the place. So I went ahead and build an appliance from source, that doesn't have these restrictions, just to see what that would get me.

With this script here: https://github.com/ronivay/XenOrchestraInstallerUpdater you can build the Xoa on any machine/VM you happen to be running with one of the many supported Linux variants.

I build one variant to run as a VM on Xcp-ng and I used another to run on an external machine: All three Xoas don't step on each others toes more than you make them: very, very cool!

The built-in hypervisor interfaces and VIO drivers of almost any modern 64-Bit Linux make VMs very easy. For Windows guests there are drivers available, which make things as easy as with KVM. Unfortunately they don't seem to be exactly the same, so you may do better to remove guest drivers on machines you want to move. BTW good luck on trying to import OVAs exported from oVirt: Those contain tags, no other hypervisor wants to accept, sometimes not even oVirt itself. Articles on the forum suggest running CloneZilla at both ends for a strorage-less migration of VMs.

I have not yet tested nested virtualization inside Xcp-ng (Xcp-ng works quite well in a nested environment), nor have I done tests with pass-through devices and GPUs. All of that is officially supported, with the typical caveats in case of older Nvidia drivers.

So far every operation was quick and every button did what I expected it to do. When things failed, I didn't have to go through endless log files to find out what was wrong: the error messages were helpful enough to find the issues so far.

Caveats:
I haven't done extensive failure testing yet, except one: I've been running four host nodes as VMs on VMware workstation, with XCP-ng VMs being nested inside. At one point I managed to run out of storage and I had to shut down all "bare metal VMs" hard. Everything came back clean, only the migration project that had tripped the storage failure had to be restarted, after I created some space.

There is no VDO support in the kernel provided. It should be easy enough to add given the sources and build scripts. But the benefit and the future of VDO are increasingly debated.

The VMs use the VDI disk format, which is less powerful/flexible as CQOW with regards to thin allocation and trimming. It's on the roadmap, not yet in the product.

Hyperconverged Storage:
Xcp-ng so far doesn't come with a hyperconvered storage solution built in. Like so many clusters it moves the responsability to your storage layer (could be Gluster...).

They are in the process of making LINSTOR a directly supported HCI option, even with such features as dispersed (erasure-coded) volumes to manage the write amplification/resilience as the node numbers increase beyond three. That's not ready today and they seem to take their sweet time about it. They label it "XOASAN" and want to make that a paid option. But again, the full source code is there and in the self-compiled Xoa appliance you should be able to use it for free.

The current beta release only supports replicated mode and only up to four nodes. But it seems to work reliably. Write amplification is 4x, so bandwidth drops to 25% and is limited to the network speed, but reads will go to the local node at storage hardware bandwidths.

The 2, 3 and 4 node replicated setup works today with the scripts they provide. That's not quite as efficient as the 2R+1A setup in oVirt, but seems rock solid and just works, which was never that easy in oVirt.

Hyperconverged is very attractive at 3 nodes, because good fault resilience can't be done any cheaper. But the industry agrees that it tends to lose financial attraction when you grow to dozens of machines and nobody in is right mind would operate a real cloud using HCI. But going from 3 to say two dozen should be doable and easy: it never was for oVirt.

XCP-ng or rather LINSTOR won't support any number or storage nodes or easy linear increases like Gluster could (in theory).
So far it's only much, much better at getting things going.

Forum and community:
The documentation is rather good, but can be lighter on things which are "native Xen", as that would repeat a lot of the effort already expended by Citrix. It's good that I've been around the block a couple of times already since VM/370, but there are holes or missing details when you need to ask questions.

The community isn't giant, but comfortably big enough. The forum's user interface is vastly better than this one, but then I've never seen something as slow as ovirt.org for a long time.

Technical questions are answered extremely quickly and mostly by the staff from the small French company themselves. But mostly it's much easier to find answers to questions already asked, which is the most typical case.

The general impression is that there are much fewer moving parts and things things that can go wrong. There is no Ansible, not a single daemon on the nodes, a management engine that seems very light and with minimal state that doesn't need a DBA to hold and manage it. The Xen hypervisor seems much smarter than KVM on its own and the Xoa has both a rich API to do things, but also offers an API to next-level management.

It may have less overall features than oVirt, but I haven't found anything that I really missed. It's much easier and quicker to install and operate with nothing but GUI which is a godsent: I want to use the farm, not spend my time managing it.

Motivational rant originally at the top... tl;dr

My original priorities for chosing oVirt were:
1. CentOS as RHEL downstream -> stable platform, full vendor vulnerability management included, big benefit in compliance
2. Integrated HCI -> Just slap a couple (3/6/9) of leftover servers together for something fault resilient, no other components required, quick start out of the box solution with a few clicks in a GUI
3. Fully open source (& logs), can always read the source to understand what's going on, better than your typical support engineer
4. No support or license contract, unless you want/need it, but the ability to switch that on when it paid for itself

The more famous competitors, vSphere and Nutanix didn't offer any of that.

(Citrix) Xen I excluded, because a) Xen seemd "old school+niche" compared to KVM b) Citrix reduced "free" to "useless"

I fell in love with Gluster from its design: it felt like a really smart solution.
I fell out with Gluster from its operation and performance: I can't count how many times I had to restart daemons and issue "gluster heal" commands to resettle things after little more than a node update.

I rediscovered Xcp-ng when I discovered that HCI and RHV had ben EOLd.

Xcp-ng, first impressions as an oVirt HCI alternative

Thomas Hoberg

Nathanaël Blanchet

Glen Jarvis

Sandro Bonazzola

Strahil Nikolov

Thomas Hoberg

Yedidyah Bar David

Thomas Hoberg

Sketch

Thomas Hoberg

tags

participants (7)