[Users] General challenges w/ Ovirt 3.1

Sat Sep 29 11:37:00 UTC 2012

Hi -

I apologize in advance that this email is less about a specific
problem and more a general inquiry as to the most recommended /
likely-to-be-successful way path.

I am on my third attempt to get an ovirt system up and running (using
a collection of spare servers, that meet the requirements set out in
the user guide).  I'm looking to implement a viable evolution to the
unscalable stove-piped ESXi servers (i.e. which are free).  And while
I'm happy to learn more about the underpinnings, I recognize that to
really be a replacement for these VMWare solutions, this has to mostly
"Just Work" -- and that's eventually why I've given up on previous
occasions (after a couple days of work) and decided to revisit in 6
months.

My basic server setup is:
 - oVirt Mgmt (engine)
 - oVirt HV1 (hypervisor node)
 - oVirt HV2 (hypervisor node)
 - oVirt Disk (NFS share)

1st attempt: I installed the latest stable Node image (2.5.1) on the
HV1 and HV2 machines and re-installed the Mgmt server w/ Fedora 17
(64-bit) and all the latest stable engine packages.  For the first
time, Node installation and Engine setup all went flawlessly.  But I
could not mount the NFS shares.  Upon deeper research, this appeared
to be the bug mentioned about NFS; I was *sure* that the official
stable Node image would have had a downgraded kernel, but apparently
not :) I have no idea if there is an officially supported way to
downgrade the kernel on the Node images; the warnings say that any
changes will not persist, so I assume there is not.  (I am frankly a
little surprised that the official stable packages & ISO won't
actually work to mount NFS shares, which is the recommended storage
strategy and kinda critical to this thing!?) .  FWIW, the oVirt Disk
system is a CentOS 6.2 system.

2nd attempt: I re-installed the nodes as Fedora 17 boxes and
downgraded the kernels to 3.4.6-2.  Then I connected these from the
Engine (specifying the root pw) and watched the logs while things
installed.  After reboot neither of the servers were reachable.
Sitting in front of the console, I realized that networking was
refusing to start; several errors printed to the console looked like:

device-mapper: table: 253:??: multipath: error getting device (I don't
remember exactly what was after the "253:")

calling "multipath -ll" yielded no output, calling "multipath -r"
re-issued the above errors

Obviously the Engine did a lot of work there, setting up the bridge,
etc.  I did not spend a long time trying to untangle this.  (In
retrospect, I will go back and probably spend more time trying to
track this down, but it's difficult since I lose network & have to
stand at the console in the server room :))

3rd attempt: I re-installed the nodes with Fedora 17 and attempted to
install VDSM manually by RPM.  Despite following the instructions to
turn off ssl (ssl=false in /etc/vdsm/vdsm.conf), I am seeing SSL
"unknown cert" errors from the python socket server with every attempt
of the engine to talk to the node.   I added the CA from the master
into the /etc/pki/vdsm (since that was the commented-out path in the
config file as the trust store)  and added the server's cert here too,
but have no idea what form these files should take to be respected by
the python server -- or if they are respected at all.  I couldn't find
this documented anywhere, so I left the servers spewing logs for the
weekend figuring that I'll either give up or try another strategy on
Monday.

So is there a general strategy that should get me to a working system
here?  I suspect that the Node image is not a good path, since it
appears to be incompatible with NFS mounting.  The
Fedora-17-installed-by-engine sounds good, but there's a lot of magic
there & it obviously completely broke my systems.  Is that where I
should focus my efforts?  Should I ditch NFS storage and just try to
get something working with local-only storage on the nodes?  (Shared
storage would be a primary motivation for moving to ovirt, though.)

I am very excited for this to work for me someday.  I think it has
been frustrating to have such sparse (or outdated?) documentation and
such fundamental problems/bugs/configuration challenges.  I'm using
pretty standard (Dell) commodity servers (SATA drives, simple RAID
setups, etc.).

Sorry for no log output, I can provide more of that when back at work
on Monday, but this was more of a general inquiry on where I should
plan to take this.

Thanks in advance!
Hans