[Users] General challenges w/ Ovirt 3.1

Hi - I apologize in advance that this email is less about a specific problem and more a general inquiry as to the most recommended / likely-to-be-successful way path. I am on my third attempt to get an ovirt system up and running (using a collection of spare servers, that meet the requirements set out in the user guide). I'm looking to implement a viable evolution to the unscalable stove-piped ESXi servers (i.e. which are free). And while I'm happy to learn more about the underpinnings, I recognize that to really be a replacement for these VMWare solutions, this has to mostly "Just Work" -- and that's eventually why I've given up on previous occasions (after a couple days of work) and decided to revisit in 6 months. My basic server setup is: - oVirt Mgmt (engine) - oVirt HV1 (hypervisor node) - oVirt HV2 (hypervisor node) - oVirt Disk (NFS share) 1st attempt: I installed the latest stable Node image (2.5.1) on the HV1 and HV2 machines and re-installed the Mgmt server w/ Fedora 17 (64-bit) and all the latest stable engine packages. For the first time, Node installation and Engine setup all went flawlessly. But I could not mount the NFS shares. Upon deeper research, this appeared to be the bug mentioned about NFS; I was *sure* that the official stable Node image would have had a downgraded kernel, but apparently not :) I have no idea if there is an officially supported way to downgrade the kernel on the Node images; the warnings say that any changes will not persist, so I assume there is not. (I am frankly a little surprised that the official stable packages & ISO won't actually work to mount NFS shares, which is the recommended storage strategy and kinda critical to this thing!?) . FWIW, the oVirt Disk system is a CentOS 6.2 system. 2nd attempt: I re-installed the nodes as Fedora 17 boxes and downgraded the kernels to 3.4.6-2. Then I connected these from the Engine (specifying the root pw) and watched the logs while things installed. After reboot neither of the servers were reachable. Sitting in front of the console, I realized that networking was refusing to start; several errors printed to the console looked like: device-mapper: table: 253:??: multipath: error getting device (I don't remember exactly what was after the "253:") calling "multipath -ll" yielded no output, calling "multipath -r" re-issued the above errors Obviously the Engine did a lot of work there, setting up the bridge, etc. I did not spend a long time trying to untangle this. (In retrospect, I will go back and probably spend more time trying to track this down, but it's difficult since I lose network & have to stand at the console in the server room :)) 3rd attempt: I re-installed the nodes with Fedora 17 and attempted to install VDSM manually by RPM. Despite following the instructions to turn off ssl (ssl=false in /etc/vdsm/vdsm.conf), I am seeing SSL "unknown cert" errors from the python socket server with every attempt of the engine to talk to the node. I added the CA from the master into the /etc/pki/vdsm (since that was the commented-out path in the config file as the trust store) and added the server's cert here too, but have no idea what form these files should take to be respected by the python server -- or if they are respected at all. I couldn't find this documented anywhere, so I left the servers spewing logs for the weekend figuring that I'll either give up or try another strategy on Monday. So is there a general strategy that should get me to a working system here? I suspect that the Node image is not a good path, since it appears to be incompatible with NFS mounting. The Fedora-17-installed-by-engine sounds good, but there's a lot of magic there & it obviously completely broke my systems. Is that where I should focus my efforts? Should I ditch NFS storage and just try to get something working with local-only storage on the nodes? (Shared storage would be a primary motivation for moving to ovirt, though.) I am very excited for this to work for me someday. I think it has been frustrating to have such sparse (or outdated?) documentation and such fundamental problems/bugs/configuration challenges. I'm using pretty standard (Dell) commodity servers (SATA drives, simple RAID setups, etc.). Sorry for no log output, I can provide more of that when back at work on Monday, but this was more of a general inquiry on where I should plan to take this. Thanks in advance! Hans

Hi, On 09/29/2012 01:37 PM, Hans Lellelid wrote:
I apologize in advance that this email is less about a specific problem and more a general inquiry as to the most recommended / likely-to-be-successful way path.
Having just gone through the process, I hope I can help a little! You might want to check (and add to) the Troubleshooting page where I documented the various hiccups I had, and how I addressed them: http://wiki.ovirt.org/wiki/Troubleshooting There's also "Node Troubleshooting" and "Troubleshooting NFS Storage Issues" which might help you: http://wiki.ovirt.org/wiki/Node_Troubleshooting and http://wiki.ovirt.org/wiki/Troubleshooting_NFS_Storage_Issues Also Jason Brooks's "Up and running with oVirt 3.1" article is useful I think: http://blog.jebpages.com/archives/up-and-running-with-ovirt-3-1-edition/
2nd attempt: I re-installed the nodes as Fedora 17 boxes and downgraded the kernels to 3.4.6-2. Then I connected these from the Engine (specifying the root pw) and watched the logs while things installed. After reboot neither of the servers were reachable. Sitting in front of the console, I realized that networking was refusing to start; several errors printed to the console looked like:
When you say that they are not reachable, what do you mean? By default, installing F17 as a node sets the iptables settings to: # oVirt default firewall configuration. Automatically generated by vdsm bootstrap script. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT # vdsm -A INPUT -p tcp --dport 54321 -j ACCEPT # libvirt tls -A INPUT -p tcp --dport 16514 -j ACCEPT # SSH -A INPUT -p tcp --dport 22 -j ACCEPT # guest consoles -A INPUT -p tcp -m multiport --dports 5634:6166 -j ACCEPT # migration -A INPUT -p tcp -m multiport --dports 49152:49216 -j ACCEPT # snmp -A INPUT -p udp --dport 161 -j ACCEPT # Reject any other input traffic -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -m physdev ! --physdev-is-bridged -j REJECT --reject-with icmp-host-prohibited COMMIT So if you're trying to ping the nodes, you should see nothing, but ssh, snmp and vdsm should be available. If you have a local console access to the nodes, you should check the IPTables config. I don't understand why you would lose your network connection entirely, though. I don't think that the network config for the nodes is changed by the installer.
3rd attempt: I re-installed the nodes with Fedora 17 and attempted to install VDSM manually by RPM. Despite following the instructions to turn off ssl (ssl=false in /etc/vdsm/vdsm.conf), I am seeing SSL "unknown cert" errors from the python socket server with every attempt of the engine to talk to the node.
Hopefully the "Node Troubleshooting" page (or somebody else) can help you here, I'm afraid I can't.
The Fedora-17-installed-by-engine sounds good, but there's a lot of magic there & it obviously completely broke my systems. Is that where I should focus my efforts? Should I ditch NFS storage and just try to get something working with local-only storage on the nodes? (Shared storage would be a primary motivation for moving to ovirt, though.)
I would focus on this approach, and would continue to aim to use NFS storage. It works fine as long as you are on the 3?4?x kernels.
I am very excited for this to work for me someday. I think it has been frustrating to have such sparse (or outdated?) documentation and such fundamental problems/bugs/configuration challenges. I'm using pretty standard (Dell) commodity servers (SATA drives, simple RAID setups, etc.).
The "Quick Setup Guide" was useful to me, as long as everything went well: http://wiki.ovirt.org/wiki/Quick_Start_Guide Hope some of that is helpful! Cheers, Dave. -- Dave Neary Community Action and Impact Open Source and Standards, Red Hat Ph: +33 9 50 71 55 62 / Cell: +33 6 77 01 92 13

Dave Neary wrote:
Hi,
On 09/29/2012 01:37 PM, Hans Lellelid wrote:
I apologize in advance that this email is less about a specific problem and more a general inquiry as to the most recommended / likely-to-be-successful way path. Having been there I know the difficulty of setting up oVirt but with the release of 3.1 things have improved immensely. One problem remains and you have found it already and that is the NFS problem with kernel 3.5.x I have installed several Fed17 nodes and have had no problems sofar. I install from the LiveCD Fed17 KDE 64bits and then add the ovirt repo and install vdsm and its dependancies. After that I add the node through engine and after the installation is ready and the node has rebooted it functions as one would expect. If you want I can paste my bash history. One thing to look out for is to switch off NetworkManager before hand and configure and activate 'network'
Joop

On Sat, Sep 29, 2012 at 2:10 PM, Joop <jvdwege@xs4all.nl> wrote:
Dave Neary wrote:
Hi,
On 09/29/2012 01:37 PM, Hans Lellelid wrote:
I apologize in advance that this email is less about a specific problem and more a general inquiry as to the most recommended / likely-to-be-successful way path.
Having been there I know the difficulty of setting up oVirt but with the release of 3.1 things have improved immensely. One problem remains and you have found it already and that is the NFS problem with kernel 3.5.x I have installed several Fed17 nodes and have had no problems sofar. I install from the LiveCD Fed17 KDE 64bits and then add the ovirt repo and install vdsm and its dependancies. After that I add the node through engine and after the installation is ready and the node has rebooted it functions as one would expect. If you want I can paste my bash history. One thing to look out for is to switch off NetworkManager before hand and configure and activate 'network'
Yeah, I did have success with network configuration when I did it manually. However, I did not switch off NetworkManager when I was doing the installation from the engine ... so maybe that is something to try next. Thanks for the tips! I'll report back. Hans

Just to follow-up, the following worked for me: 1- Install stock Fedora 17 (x86_64) on nodes -- We do some basic puppet-based configuration such as LDAP auth for the machine, etc. but otherwise pretty plain server config. 2- Downgrade the kernel to 3.4.6-2 version (and make this the default) 3- Stop and disable NetworkManager service 4- Start and enable network service 5- (reboot) When doing the remote install (from engine) of the vdsm stuff on this blank system, the network/bridge will *not* be automatically setup. (I think it will be setup if you leave NetworkManager running, at least that was the experience last time.) This worked out for the best, since it meant that my network didn't get broken this time. I manually setup the ovirgmgmt bridge (which you must do or it will not activate the new server) and everything is happy (talking to NFS share, running VMs, etc.). Lingering issues are minor: - Not sure how to get connection (port) info for Spice client. I kinda prefer VNC anyway, so that's not a big deal. - Guest VMs appear to always prioritize PXE over hard disk even though that is the second option in startup order. (I have not looked into this yet, since it's not a big deal.) Thanks to the responses that urged me to continue in this direction. Hans

Hi Hans, On 10/01/2012 06:26 PM, Hans Lellelid wrote:
Just to follow-up, the following worked for me:
<snip>
Thanks to the responses that urged me to continue in this direction.
Thanks for documenting this and coming back to the list! Would you mind adding a "Symptom/Cure" entry to the Troubleshooting page for the ovirtmgmt bridge issue, please? I think others could find it really useful. Thanks, Dave. -- Dave Neary Community Action and Impact Open Source and Standards, Red Hat Ph: +33 9 50 71 55 62 / Cell: +33 6 77 01 92 13

On Mon, Oct 1, 2012 at 1:39 PM, Dave Neary <dneary@redhat.com> wrote:
Hi Hans,
On 10/01/2012 06:26 PM, Hans Lellelid wrote:
Just to follow-up, the following worked for me:
<snip>
Thanks to the responses that urged me to continue in this direction.
Thanks for documenting this and coming back to the list! Would you mind adding a "Symptom/Cure" entry to the Troubleshooting page for the ovirtmgmt bridge issue, please? I think others could find it really useful.
Sure, will do. Thanks- Hans

Thanks for the response! On Sat, Sep 29, 2012 at 8:21 AM, Dave Neary <dneary@redhat.com> wrote:
Hi,
On 09/29/2012 01:37 PM, Hans Lellelid wrote:
I apologize in advance that this email is less about a specific problem and more a general inquiry as to the most recommended / likely-to-be-successful way path.
Having just gone through the process, I hope I can help a little! You might want to check (and add to) the Troubleshooting page where I documented the various hiccups I had, and how I addressed them:
http://wiki.ovirt.org/wiki/**Troubleshooting<http://wiki.ovirt.org/wiki/Troubleshooting>
There's also "Node Troubleshooting" and "Troubleshooting NFS Storage Issues" which might help you: http://wiki.ovirt.org/wiki/** Node_Troubleshooting <http://wiki.ovirt.org/wiki/Node_Troubleshooting>and http://wiki.ovirt.org/wiki/**Troubleshooting_NFS_Storage_**Issues<http://wiki.ovirt.org/wiki/Troubleshooting_NFS_Storage_Issues>
Also Jason Brooks's "Up and running with oVirt 3.1" article is useful I think: http://blog.jebpages.com/**archives/up-and-running-with-** ovirt-3-1-edition/<http://blog.jebpages.com/archives/up-and-running-with-ovirt-3-1-edition/>
I have read a few of those resources, but not the main "Troubleshooting" page, so I will scour the wiki to see if something might help me out.
2nd attempt: I re-installed the nodes as Fedora 17 boxes and
downgraded the kernels to 3.4.6-2. Then I connected these from the Engine (specifying the root pw) and watched the logs while things installed. After reboot neither of the servers were reachable. Sitting in front of the console, I realized that networking was refusing to start; several errors printed to the console looked like:
When you say that they are not reachable, what do you mean? By default, installing F17 as a node sets the iptables settings to: <snip>
I mean, that the network interfaces cannot be brought up, not an iptables issue. Sitting (well, standing, they're rack-mounted) in front of the servers yields the multipath errors I mention when trying to start networking. What I started doing (and will likely continue to pursue) is running etc under source control and start combing through the changes that are introduced when I do the remote setup from the engine to see if I can pick apart where it's going south.
So if you're trying to ping the nodes, you should see nothing, but ssh, snmp and vdsm should be available. If you have a local console access to the nodes, you should check the IPTables config.
I don't understand why you would lose your network connection entirely, though. I don't think that the network config for the nodes is changed by the installer.
Yeah, it's definitely changed by the installer. The installer sets up the ovirt-bridge (I think that is what it was called) and changes the primary interfaces to reference the bridge, etc. I didn't seen anything obviously wrong with the setup, but clearly it was not working. (I also didn't know exactly what I was starting from, so that is my mistake and I should be able to approach the next time with more confidence.) I did the bridge setup manually myself for attempt #3 and didn't have any problems.
3rd attempt: I re-installed the nodes with Fedora 17 and attempted to
install VDSM manually by RPM. Despite following the instructions to turn off ssl (ssl=false in /etc/vdsm/vdsm.conf), I am seeing SSL "unknown cert" errors from the python socket server with every attempt of the engine to talk to the node.
Hopefully the "Node Troubleshooting" page (or somebody else) can help you here, I'm afraid I can't.
The
Fedora-17-installed-by-engine sounds good, but there's a lot of magic there & it obviously completely broke my systems. Is that where I should focus my efforts? Should I ditch NFS storage and just try to get something working with local-only storage on the nodes? (Shared storage would be a primary motivation for moving to ovirt, though.)
I would focus on this approach, and would continue to aim to use NFS storage. It works fine as long as you are on the 3?4?x kernels.
I am very excited for this to work for me someday. I think it has
been frustrating to have such sparse (or outdated?) documentation and such fundamental problems/bugs/configuration challenges. I'm using pretty standard (Dell) commodity servers (SATA drives, simple RAID setups, etc.).
The "Quick Setup Guide" was useful to me, as long as everything went well: http://wiki.ovirt.org/wiki/**Quick_Start_Guide<http://wiki.ovirt.org/wiki/Quick_Start_Guide>
Hope some of that is helpful!
I will take a look a that guide -- not sure if I've read that one yet. I will follow back up with what I learn / what works so it might help others. If there's a way that I can update the wiki to help those in my specific predicament, I will do that too. (It's possible there is something about my [Dell] hardware that is not compatible with oVirt's default installer, etc.) Thanks, Hans
participants (3)
-
Dave Neary
-
Hans Lellelid
-
Joop