On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez <jhernand@redhat.com> wrote:
On 10/09/2012 11:36 PM, Itamar Heim wrote:
> well, hacking the db will work, but reproducing this and logs for us to
> fix the actually bug would also help.

Unfortunately, it will would be very difficult to reproduce since I have replaced the boot drive that I believe was causing the failures and I have no idea what state the host was in when it went down (details below).  Plus, our testing folks will become very angry if I keep crashing their VMs. =)  Still, I can probably come up with some spare hardware eventually and try to reproduce if needed, but let's see what we see with the logs, etc., first, yeah?
 

We have seen this before, and we thought it was fixed.

Was that very recently?  Perhaps I was not running the fixed version.  I might not be still.  I'm running on CentOS 6.3 using this repo:  http://www.dreyou.org/ovirt/ovirt-dre.repo

I used these instructions to setup the nodes and these instructions to setup the engine.  (I have not configured any glusterfs volumes, but I will likely play with that soon.)  When setting up the engine, I had to fix a bunch of broken symlinks to jar files that were included with the various packages.  Both the broken symlinks and jar files there there, but many of the symlinks to the jars where broken.  I'll be reporting that to the package maintainers soon, but mention it here just in case it turns out to be relevant.

Here are my ovirt package versions:
[root@admin ~]# rpm -qa | fgrep -i ovirt
ovirt-log-collector-3.1.0-16.el6.noarch
ovirt-image-uploader-3.1.0-16.el6.noarch
ovirt-engine-userportal-3.1.0-3.19.el6.noarch
ovirt-engine-setup-3.1.0-3.19.el6.noarch
ovirt-engine-restapi-3.1.0-3.19.el6.noarch
ovirt-engine-config-3.1.0-3.19.el6.noarch
ovirt-engine-notification-service-3.1.0-3.19.el6.noarch
ovirt-engine-backend-3.1.0-3.19.el6.noarch
ovirt-engine-sdk-3.1.0.5-1.el6.noarch
ovirt-iso-uploader-3.1.0-16.el6.noarch
ovirt-engine-jbossas711-1-0.x86_64
ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch
ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch
ovirt-engine-genericapi-3.1.0-3.19.el6.noarch
ovirt-engine-tools-common-3.1.0-3.19.el6.noarch
ovirt-engine-3.1.0-3.19.el6.noarch

[root@cloudhost03 ~]# rpm -qa | fgrep -i vdsm
vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch
vdsm-gluster-4.10.0-0.42.13.el6.noarch
vdsm-python-4.10.0-0.42.13.el6.x86_64
vdsm-4.10.0-0.42.13.el6.x86_64
vdsm-cli-4.10.0-0.42.13.el6.noarch

 

Alan, can you describe exactly the sequence of events that leaded to
this problem? When you say that the host died while going to maintenance
what do you mean exactly? It crashed, rebooted, hung, was fenced?

I'll do my best. =)  First, some back ground.  I have 3 hosts in oVirt currently: cloudhost0{2,3,4}.  (1 out there but not in oVirt just yet, so of no consequence.  It is currently running the VM hosting oVirt engine, but I doubt that is relevant either.)  I originally build these using USB sticks as the boot drives hoping to leave the drive bays dedicated to VM storage.  I bought what seemed to be the best ones for the jobs, but the root file system kept started going into read-only due to device errors recently.  It only happened once before I installed oVirt a few weeks ago, but I brough to new hosts on line with oVirt, so maybe I was just lucky with the first 2 hosts.

After converting to oVirt, the USB drives starting going read-only more often and I think they might not have been fast enough to keep up with the write requests from a host OS running a full load of VMs (~10), causing performance problems with some of our VMs.  So, I started replacing the USB drives putting one host into maintenance at a time.

I started with cloudhost04 not because it had gone read-only, but in hopes of improving performance.  While it was migrating VMs for maintenance mode, the engine marked it in an unknown state.  vdsmd had  crashed and would not start because it could not write logs or pid files.  The / mount had gone ro.  

I'm not 100% sure now, but I think a couple of VMs were successfully migrated, but several were not.  libvirtd was up and the VMs were none the wiser, but I could not figure out the virsh password to tell it to migrate VMs despite finding a thread in this list with some pointers.  So, we logged into each VM and shut them down manually.  Once virsh listed no running VMs, I shut down clh04, Confirm 'Host has been Rebooted', in the GUI, and the engine let me start the VMs on another host.  (I.e., they went from status "unknown" to status "down".)

I yanked the USB drive, removed the host from in the GUI, and then rebuilt the node on a SATA SSD using the instructions linked above.  All has been fine with that node since.  I follow the same procedure on cloudhost03 with no trouble at all.  It went into maintenance mode without a hitch.

Now, for the case in question.  I put cloudhost02 into maintenance mode and went to lunch.  After lunch and some other distractions, I found it successfully migrated all but 2 of the VMs to other nodes before becoming unresponsive.  Unlike cloudhost04, which I could ssh into and manipulate, 02 had become completely unresponsive   Even the console was just a black screen with not response to mouse or keyboard.  The rest is covered previously in this thread.  In short, no amount of node rebooting, confirming host reboots, engine restarting, or removing and adding hosts back in convinced the engine to take the VMs out of unknown status.  Also, the host column was blank for the entire time they were in unknown status, as far as I know.

(In case anyone is curious, I have had no confirmation on performance improvements since converting all nodes to SSD boot drives, but I have had no complaints since either.)
 
It would be very helpful if you have the engine logs at the time the host
went down.

Attached.  Also, I have the USB stick from cloudhost02, so if there are logs on there that might help, just point me to them.  The migration started around 2012-10-09 12:36:28.  I started fighting with it again after 2PM.
 

Once you have those logs in a safe place, the only way to get the VM out
of that status is to update the database manually. First make completely
sure that the VM is not running in any host, then do the following:

  # psql -U engine
  psql (9.1.6)
  Type "help" for help.

  engine=> update vm_dynamic set status = 0 where vm_guid = (select
vm_gui from vm_static where vm_name = 'myvm');
  UPDATE 1

(Assuming that the name of your VM is "myvm").

That did the trick!  Thank you so much, Juan!  I had to fix a typo by changing 'vm_gui' to 'vm_guid' after 'select', but the error from the first try made that clear.  Regardless, it was all I needed.   Here is the corrected syntax all on one line for others that may be looking for the same fix:

engine=> update vm_dynamic set status = 0 where vm_guid = (select vm_guid from vm_static where vm_name = 'myvm');