<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix"><br>

      since the VM run_on_vds was empty, the "confirm host..." didn't

      clear its status because its not selected from the DB as one of

      the host VMs. <br>

      I'll try to dig in to see at what point this value was cleared -

      probably around the failed migration.<br>

      <br>

      <br>

      On 10/10/2012 07:25 PM, Alan Johnson wrote:<br>

    </div>

    <blockquote

cite="mid:CAAhwQij0zfoxwO0hzywG_bWTNs+VbCw7KTQyrk+Vbqy2UNNRmQ@mail.gmail.com"

      type="cite">On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez <span

        dir="ltr">&lt;<a moz-do-not-send="true"

          href="mailto:jhernand@redhat.com" target="_blank">jhernand@redhat.com</a>&gt;</span>

      wrote:<br>

      <div class="gmail_quote">

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <div class="HOEnZb">

            <div class="h5">On 10/09/2012 11:36 PM, Itamar Heim wrote:<br>

              &gt; well, hacking the db will work, but reproducing this

              and logs for us to<br>

              &gt; fix the actually bug would also help.<br>

            </div>

          </div>

        </blockquote>

        <div><br>

        </div>

        <div>Unfortunately, it will would be very difficult to reproduce

          since I have replaced the boot drive that I believe was

          causing the failures and I have no idea what state the host

          was in when it went down (details below). &nbsp;Plus, our testing

          folks will become very angry if I keep crashing their VMs. =)

          &nbsp;Still, I can probably come up with some spare hardware

          eventually and try to reproduce if needed, but let's see what

          we see with the logs, etc., first, yeah?</div>

        <div>&nbsp;</div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <div class="HOEnZb">

            <div class="h5">

              <br>

            </div>

          </div>

          We have seen this before, and we thought it was fixed.<br>

        </blockquote>

        <div><br>

        </div>

        <div>Was that very recently? &nbsp;Perhaps I was not running the

          fixed version. &nbsp;I might not be still. &nbsp;I'm running on CentOS

          6.3 using this repo: &nbsp;<a moz-do-not-send="true"

            href="http://www.dreyou.org/ovirt/ovirt-dre.repo">http://www.dreyou.org/ovirt/ovirt-dre.repo</a></div>

        <div><br>

        </div>

        <div>I used these <a moz-do-not-send="true"

href="http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-either-nfs-or-posix-native-file-system-node-install">instructions

            to setup the nodes</a>&nbsp;and these <a moz-do-not-send="true"

href="http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-either-nfs-or-posix-native-file-system-engine">instructions

            to setup the engine</a>. &nbsp;(I have not configured any

          glusterfs volumes, but I will likely play with that soon.)

          &nbsp;When setting up the engine, I had to fix a bunch of broken

          symlinks to jar files that were included with the various

          packages. &nbsp;Both the broken symlinks and jar files there there,

          but many of the symlinks to the jars where broken. &nbsp;I'll be

          reporting that to the package maintainers soon, but mention it

          here just in case it turns out to be relevant.</div>

        <div><br>

        </div>

        <div>Here are my ovirt package versions:</div>

        <div>

          <div><font face="courier new, monospace">[root@admin ~]# rpm

              -qa | fgrep -i ovirt</font></div>

          <div><font face="courier new, monospace">ovirt-log-collector-3.1.0-16.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-image-uploader-3.1.0-16.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-userportal-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-setup-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-restapi-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-config-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-notification-service-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-backend-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-sdk-3.1.0.5-1.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-iso-uploader-3.1.0-16.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-jbossas711-1-0.x86_64</font></div>

          <div><font face="courier new, monospace">ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-genericapi-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-tools-common-3.1.0-3.19.el6.noarch</font></div>

          <div><font face="courier new, monospace">ovirt-engine-3.1.0-3.19.el6.noarch</font></div>

        </div>

        <div><font face="courier new, monospace"><br>

          </font></div>

        <div>

          <div><font face="courier new, monospace">[root@cloudhost03 ~]#

              rpm -qa | fgrep -i vdsm</font></div>

          <div><font face="courier new, monospace">vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch</font></div>

          <div><font face="courier new, monospace">vdsm-gluster-4.10.0-0.42.13.el6.noarch</font></div>

          <div><font face="courier new, monospace">vdsm-python-4.10.0-0.42.13.el6.x86_64</font></div>

          <div><font face="courier new, monospace">vdsm-4.10.0-0.42.13.el6.x86_64</font></div>

          <div><font face="courier new, monospace">vdsm-cli-4.10.0-0.42.13.el6.noarch</font></div>

        </div>

        <div><br>

        </div>

        <div>&nbsp;</div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <br>

          Alan, can you describe exactly the sequence of events that

          leaded to<br>

          this problem?&nbsp;When you say that the host died while going to

          maintenance</blockquote>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          what do you mean exactly? It crashed, rebooted, hung, was

          fenced? </blockquote>

        <div><br>

        </div>

        <div>

          <div>I'll do my best. =) &nbsp;First, some back ground. &nbsp;I have 3

            hosts in oVirt currently: cloudhost0{2,3,4}. &nbsp;(1 out there

            but not in oVirt just yet, so of no&nbsp;consequence. &nbsp;It is

            currently running the VM hosting oVirt engine, but I doubt

            that is relevant either.) &nbsp;I originally build these using

            USB sticks as the boot drives hoping to leave the drive bays

            dedicated to VM storage. &nbsp;I bought what seemed to be the

            best ones for the jobs, but the root file system kept

            started going into read-only due to device errors recently.

            &nbsp;It only happened once before I installed oVirt a few weeks

            ago, but I brough to new hosts on line with oVirt, so maybe

            I was just lucky with the first 2 hosts.</div>

          <div><br>

          </div>

          <div>After&nbsp;converting&nbsp;to oVirt, the USB drives starting going

            read-only more often and I think they might not have been

            fast enough to keep up with the write requests from a host

            OS running a full load of VMs (~10), causing performance

            problems with some of our VMs. &nbsp;So, I started replacing the

            USB drives putting one host into maintenance at a time.</div>

        </div>

        <div><br>

        </div>

        <div>I started with cloudhost04 not because it had gone

          read-only, but in hopes of improving performance. &nbsp;While it

          was migrating VMs for maintenance mode, the engine marked it

          in an unknown state. &nbsp;vdsmd had &nbsp;crashed and would not start

          because it could not write logs or pid files. &nbsp;The / mount had

          gone ro. &nbsp;</div>

        <div><br>

        </div>

        <div>I'm not 100% sure now, but I think a couple of VMs were

          successfully&nbsp;migrated, but&nbsp;several&nbsp;were not. &nbsp;libvirtd was up

          and the VMs were none the wiser, but I could not figure out

          the virsh password to tell it to migrate VMs despite finding a

          thread in this list with some pointers. &nbsp;So, we logged into

          each VM and shut them down manually. &nbsp;Once virsh listed no

          running VMs, I shut down clh04, Confirm 'Host has been

          Rebooted', in the GUI, and the engine let me start the VMs on

          another host. &nbsp;(I.e., they went from status "unknown" to

          status "down".)</div>

        <div><br>

        </div>

        <div>I yanked the USB drive,&nbsp;removed the host from in the GUI,

          and then&nbsp;rebuilt the node on a SATA SSD using the instructions

          linked above. &nbsp;All has been fine with that node since. &nbsp;I

          follow the same procedure on cloudhost03 with no trouble at

          all. &nbsp;It went into maintenance mode without a hitch.</div>

        <div><br>

        </div>

        <div>Now, for the case in question. &nbsp;I put cloudhost02 into

          maintenance mode and went to lunch. &nbsp;After lunch and some

          other distractions, I found it successfully migrated all but 2

          of the VMs to other nodes before becoming unresponsive.

          &nbsp;Unlike cloudhost04, which I could ssh into and manipulate, 02

          had become completely&nbsp;unresponsive&nbsp; &nbsp;Even the console was just

          a black screen with not response to mouse or keyboard. &nbsp;The

          rest is covered previously in this thread. &nbsp;In short, no

          amount of node rebooting, confirming host reboots, engine

          restarting, or removing and adding hosts back in convinced the

          engine to take the VMs out of unknown status. &nbsp;Also, the host

          column was blank for the entire time they were in unknown

          status, as far as I know.</div>

        <div><br>

        </div>

        <div>(In case anyone is curious, I have had no confirmation on

          performance improvements since converting all nodes to SSD

          boot drives, but I have had no complaints since either.)</div>

        <div>&nbsp;</div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          It&nbsp;would be very helpful if you have the engine logs at the

          time the host<br>

          went down.<br>

        </blockquote>

        <div><br>

        </div>

        <div>Attached. &nbsp;Also, I have the USB stick from cloudhost02, so

          if there are logs on there that might help, just point me to

          them. &nbsp;The migration started around 2012-10-09 12:36:28. &nbsp;I

          started fighting with it again after 2PM.</div>

        <div>&nbsp;</div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <br>

          Once you have those logs in a safe place, the only way to get

          the VM out<br>

          of that status is to update the database manually. First make

          completely<br>

          sure that the VM is not running in any host, then do the

          following:<br>

          <br>

          &nbsp; # psql -U engine<br>

          &nbsp; psql (9.1.6)<br>

          &nbsp; Type "help" for help.<br>

          <br>

          &nbsp; engine=&gt; update vm_dynamic set status = 0 where vm_guid =

          (select<br>

          vm_gui from vm_static where vm_name = 'myvm');<br>

          &nbsp; UPDATE 1<br>

          <br>

          (Assuming that the name of your VM is "myvm").<br>

        </blockquote>

        <div><br>

        </div>

        <div>That did the trick! &nbsp;Thank you so much, Juan! &nbsp;I had to fix

          a typo by changing 'vm_gui' to 'vm_guid' after 'select', but

          the error from the first try made that clear. &nbsp;Regardless, it

          was all I needed. &nbsp; Here is the corrected syntax all on one

          line for others that may be looking for the same fix:</div>

        <div><br>

        </div>

        <div>engine=&gt; update vm_dynamic set status = 0 where vm_guid

          = (select&nbsp;vm_guid from vm_static where vm_name = 'myvm');</div>

        <div><br>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a>

<a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a></pre>

    </blockquote>

    <br>

  </body>

</html>