Re: [ovirt-users] vms in paused state

29 Apr 2016

      --------------010504060305050806040805
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit

yes they are still saying "paused" state.
No, bouncing libvirt didn't help.

I noticed the errors about the ISO domain. Didn't think that was related.
I have been migrating a lot of VMs to ovirt lately, and recently added 
another node.
Also had some problems with /etc/exports for a while, but I think those 
issues are all resolved.

Last "unresponsive" message in vdsm.log was:

vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::*2016-04-21* 
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become unresponsive 
(command timeout, age=310323.97)
vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21 
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become unresponsive 
(command timeout, age=310323.97)

Thanks.

On 4/29/16 1:40 AM, Michal Skrivanek wrote:
...
...
On 28 Apr 2016, at 19:40, Bill James <bill.james@j2.com 
<mailto:bill.james@j2.com>> wrote:
thank you for response.
I bold-ed the ones that are listed as "paused".
[root@ovirt1 test vdsm]# virsh -r list --all
 Id    Name                           State
----------------------------------------------------
...
Looks like problem started around 2016-04-17 20:19:34,822, based on 
engine.log attached.
yes, that time looks correct. Any idea what might have been a trigger? 
Anything interesting happened at that time (power outage of some host, 
some maintenance action, anything)?
logs indicate a problem when vdsm talks to libvirt(all those "monitor 
become unresponsiveâ)
It does seem that at that time you started to have some storage 
connectivity issues - first one at 2016-04-17 20:06:53,929. And it 
doesnât look temporary because such errors are still there couple 
hours later(in your most recent file you attached I can see at 23:00:54)
When I/O gets blocked the VMs may experience issues (then VM gets 
Paused), or their qemu process gets stuck(resulting in libvirt either 
reporting error or getting stuck as well -> resulting in what vdsm 
sees as âmonitor unresponsiveâ)
Since you now bounced libvirtd - did it help? Do you still see wrong 
status for those VMs and still those "monitor unresponsive" errors in 
vdsm.log?
If notâŠthen I would suspect the âvm recoveryâ code not working 
correctly. Milan is looking at that.
Thanks,
michal
...
There's a lot of vdsm logs!
fyi, the storage domain for these Vms is a "local" nfs share, 
7e566f55-e060-47b7-bfa4-ac3c48d70dda.
attached more logs.
On 04/28/2016 12:53 AM, Michal Skrivanek wrote:
...
...
On 27 Apr 2016, at 19:16, Bill James<bill.james@j2.com>  wrote:
virsh # list --all
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory
you need to run virsh in read-only mode
virsh -r list âall
...
[root@ovirt1 test vdsm]# systemctl status libvirtd
â libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/libvirtd.service.d
           ââunlimited-core.conf
   Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago
tried systemctl restart libvirtd.
No change.
Attached vdsm.log and supervdsm.log.
[root@ovirt1 test vdsm]# systemctl status vdsmd
â vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago
vdsm-4.17.18-0.el7.centos.noarch
the vdsm.log attach is good, but itâs too short interval, it only shows recovery(vdsm restart) phase when the VMs are identified as pausedâŠ.can you add earlier logs? Did you restart vdsm yourself or did it crash?
...
libvirt-daemon-1.2.17-13.el7_2.4.x86_64
Thanks.
On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
...
...
On 27 Apr 2016, at 02:04, Nir Soffer<nsoffer@redhat.com>  wrote:
jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James<bill.james@j2.com>  wrote:
> I have a hardware node that has 26 VMs.
> 9 are listed as "running", 17 are listed as "paused".
>
> In truth all VMs are up and running fine.
>
> I tried telling the db they are up:
>
> engine=> update vm_dynamic set status = 1 where vm_guid =(select
> vm_guid from vm_static where vm_name = 'api1.test.j2noc.com <http://api1.test.j2noc.com>');
>
> GUI then shows it up for a short while,
>
> then puts it back in paused state.
>
> 2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
> (DefaultQuartzScheduler_Worker-16) [157cc21e] VM '242ca0af-4ab2-4dd6-b515-5
> d435e6452c4'(api1.test.j2noc.com <http://api1.test.j2noc.com>) moved from 'Up' --> 'Paused'
> 2016-04-26 15:16:46,221 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh
> andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] Cor
> relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM api1.
> test.j2noc.com <http://test.j2noc.com>  has been paused.
>
>
> Why does the engine think the VMs are paused?
> Attached engine.log.
>
> I can fix the problem by powering off the VM then starting it back up.
> But the VM is working fine! How do I get ovirt to realize that?
If this is an issue in engine, restarting engine may fix this.
but having this problem only with one node, I don't think this is the issue.
If this is an issue in vdsm, restarting vdsm may fix this.
If this does not help, maybe this is libvirt issue? did you try to check vm
status using virsh?
this looks more likely as it seems such status is being reported
logs would help, vdsm.log at the very least.
...
If virsh thinks that the vms are paused, you can try to restart libvirtd.
Please file a bug about this in any case with engine and vdsm logs.
Adding Michal in case he has better idea how to proceed.
Nir
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
<engine.log-20160421.gz><vdsm.logs.tar.gz>
Cloud Services for Business www.j2.com
j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox

This email, its contents and attachments contain information from j2 Global, Inc. and/or its affiliates which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is prohibited. If you have received this email in error please notify the sender by reply e-mail and delete the original message and any copies. (c) 2015 j2 Global, Inc. All rights reserved. eFax, eVoice, Campaigner, FuseMail, KeepItSafe, and Onebox are registered trademarks of j2 Global, Inc. and its affiliates.

--------------010504060305050806040805
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    yes they are still saying "paused" state.<br>
    No, bouncing libvirt didn't help.<br>
    <br>
    I noticed the errors about the ISO domain. Didn't think that was
    related.<br>
    I have been migrating a lot of VMs to ovirt lately, and recently
    added another node.<br>
    Also had some problems with /etc/exports for a while, but I think
    those issues are all resolved.<br>
    <br>
    <br>
    Last "unresponsive" message in vdsm.log was:<br>
    <br>
    vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::<b>2016-04-21</b>
    11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout)
    vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become
    unresponsive (command timeout, age=310323.97)<br>
    vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21
    11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout)
    vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become
    unresponsive (command timeout, age=310323.97)<br>
    <br>
    <br>
    <br>
    Thanks.<br>
    <br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 4/29/16 1:40 AM, Michal Skrivanek
      wrote:<br>
    </div>
    <blockquote
      cite="mid:656BFC5C-A6F5-4332-90AC-C039D4E9170E@redhat.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <br class="">
      <div>
        <blockquote type="cite" class="">
          <div class="">On 28 Apr 2016, at 19:40, Bill James <<a
              moz-do-not-send="true" href="mailto:bill.james@j2.com"
              class=""><a class="moz-txt-link-abbreviated" href="mailto:bill.james@j2.com">bill.james@j2.com</a></a>> wrote:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <div bgcolor="#FFFFFF" text="#000000" class=""> thank you
              for response.<br class="">
              I bold-ed the ones that are listed as "paused".<br
                class="">
              <br class="">
              <br class="">
              [root@ovirt1 test vdsm]# virsh -r list --all<br class="">
              Â IdÂ Â Â  NameÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  State<br class="">
              ----------------------------------------------------<br
                class="">
            </div>
          </div>
        </blockquote>
      </div>
      <div><br class="">
      </div>
      <div>
        <blockquote type="cite" class="">
          <div class="">
            <div bgcolor="#FFFFFF" text="#000000" class=""> <br
                class="">
              <br class="">
              Looks like problem started around 2016-04-17 20:19:34,822,
              based on engine.log attached.<br class="">
            </div>
          </div>
        </blockquote>
        <div><br class="">
        </div>
        <div>yes, that time looks correct. Any idea what might have been
          a trigger? Anything interesting happened at that time (power
          outage of some host, some maintenance action, anything)?Â </div>
        <div>logs indicate a problem when vdsm talks to libvirt(all
          those "monitor become unresponsiveâ)</div>
        <div><br class="">
        </div>
        <div>It does seem that at that time you started to have some
          storage connectivity issues - first one atÂ 2016-04-17
          20:06:53,929. And it doesnât look temporary because such
          errors are still there couple hours later(in your most recent
          file you attached I can see at 23:00:54)</div>
        <div>When I/O gets blocked the VMs may experience issues (then
          VM gets Paused), or their qemu process gets stuck(resulting in
          libvirt either reporting error or getting stuck as well ->
          resulting in what vdsm sees as âmonitor unresponsiveâ)</div>
        <div><br class="">
        </div>
        <div>Since you now bounced libvirtd - did it help? Do you still
          see wrong status for those VMs and still those "monitor
          unresponsive" errors in vdsm.log?</div>
        <div>If notâŠthen I would suspect the âvm recoveryâ code not
          working correctly. Milan is looking at that.</div>
        <div><br class="">
        </div>
        <div>Thanks,</div>
        <div>michal</div>
        <div>
          <div><br class="">
          </div>
        </div>
        <div class=""><br class="">
        </div>
        <blockquote type="cite" class="">
          <div class="">
            <div bgcolor="#FFFFFF" text="#000000" class=""> There's a
              lot of vdsm logs!<br class="">
              <br class="">
              fyi, the storage domain for these Vms is a "local" nfs
              share, 7e566f55-e060-47b7-bfa4-ac3c48d70dda.<br class="">
              <br class="">
              attached more logs.<br class="">
              <br class="">
              <br class="">
              <div class="moz-cite-prefix">On 04/28/2016 12:53 AM,
                Michal Skrivanek wrote:<br class="">
              </div>
              <blockquote
                cite="mid:28BF55E6-3A90-4BB7-90B9-1EE0A82FC460@redhat.com"
                type="cite" class="">
                <blockquote type="cite" class="">
                  <pre class="" wrap="">On 27 Apr 2016, at 19:16, Bill James <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:bill.james@j2.com"><bill.james@j2.com></a> wrote:

virsh # list --all
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

</pre>
                </blockquote>
                <pre class="" wrap="">you need to run virsh in read-only mode
virsh -r list âall

</pre>
                <blockquote type="cite" class="">
                  <pre class="" wrap="">[root@ovirt1 test vdsm]# systemctl status libvirtd
â libvirtd.service - Virtualization daemon
  Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
 Drop-In: /etc/systemd/system/libvirtd.service.d
          ââunlimited-core.conf
  Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago

tried systemctl restart libvirtd.
No change.

Attached vdsm.log and supervdsm.log.

[root@ovirt1 test vdsm]# systemctl status vdsmd
â vdsmd.service - Virtual Desktop Server Manager
  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
  Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago

vdsm-4.17.18-0.el7.centos.noarch
</pre>
                </blockquote>
                <pre class="" wrap="">the vdsm.log attach is good, but itâs too short interval, it only shows recovery(vdsm restart) phase when the VMs are identified as pausedâŠ.can you add earlier logs? Did you restart vdsm yourself or did it crash?

</pre>
                <blockquote type="cite" class="">
                  <pre class="" wrap="">libvirt-daemon-1.2.17-13.el7_2.4.x86_64

Thanks.

On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
</pre>
                  <blockquote type="cite" class="">
                    <blockquote type="cite" class="">
                      <pre class="" wrap="">On 27 Apr 2016, at 02:04, Nir Soffer <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:nsoffer@redhat.com"><nsoffer@redhat.com></a> wrote:

jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:bill.james@j2.com"><bill.james@j2.com></a> wrote:
</pre>
                      <blockquote type="cite" class="">
                        <pre class="" wrap="">I have a hardware node that has 26 VMs.
9 are listed as "running", 17 are listed as "paused".

In truth all VMs are up and running fine.

I tried telling the db they are up:

engine=> update vm_dynamic set status = 1 where vm_guid =(select
vm_guid from vm_static where vm_name = '<a moz-do-not-send="true" href="http://api1.test.j2noc.com" class="">api1.test.j2noc.com</a>');

GUI then shows it up for a short while,

then puts it back in paused state.

2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
(DefaultQuartzScheduler_Worker-16) [157cc21e] VM '242ca0af-4ab2-4dd6-b515-5
d435e6452c4'(<a moz-do-not-send="true" href="http://api1.test.j2noc.com" class="">api1.test.j2noc.com</a>) moved from 'Up' --> 'Paused'
2016-04-26 15:16:46,221 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh
andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] Cor
relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM api1.
<a moz-do-not-send="true" href="http://test.j2noc.com" class="">test.j2noc.com</a> has been paused.

Why does the engine think the VMs are paused?
Attached engine.log.

I can fix the problem by powering off the VM then starting it back up.
But the VM is working fine! How do I get ovirt to realize that?
</pre>
                      </blockquote>
                      <pre class="" wrap="">If this is an issue in engine, restarting engine may fix this.
but having this problem only with one node, I don't think this is the issue.

If this is an issue in vdsm, restarting vdsm may fix this.

If this does not help, maybe this is libvirt issue? did you try to check vm
status using virsh?
</pre>
                    </blockquote>
                    <pre class="" wrap="">this looks more likely as it seems such status is being reported
logs would help, vdsm.log at the very least.

</pre>
                    <blockquote type="cite" class="">
                      <pre class="" wrap="">If virsh thinks that the vms are paused, you can try to restart libvirtd.

Please file a bug about this in any case with engine and vdsm logs.

Adding Michal in case he has better idea how to proceed.

Nir
</pre>
                    </blockquote>
                  </blockquote>
                  <pre class="" wrap="">
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a>
</pre>
                </blockquote>
              </blockquote>
              <br class="">
            </div>
            <span
              id="cid:EB02B488-C070-46FA-9938-DC7D6DF5BEED@brq.redhat.com"><engine.log-20160421.gz></span><span
id="cid:52E27023-A602-4DB0-B69A-18237CC048A3@brq.redhat.com"><vdsm.logs.tar.gz></span></div>
        </blockquote>
      </div>
      <br class="">
    </blockquote>
    <br>

<p><a href="http://www.j2.com/?utm_source=j2global&utm_medium=xsell-referral&utm_campaign=employeeemail"><span style='color:windowtext;
text-decoration:none'><img border=0 width=391 height=46
src="http://home.j2.com/j2_Global_Cloud_Services/j2_Global_Email_Footer.jpg" alt="www.j2.com"></span></a></p>

<p><span style='font-size:8.0pt;font-family:"Arial","sans-serif";
color:gray'>This email, its contents and attachments contain information from <a href="http://www.j2.com/?utm_source=j2global&utm_medium=xsell-referral&utm_campaign=employemail">j2 Global, Inc</a>. and/or its affiliates which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is prohibited. If you have received this email in error please notify the sender by reply e-mail and delete the original message and any copies. © 2015 <a href="http://www.j2.com/">j2 Global, Inc</a>. All rights reserved. <a href="http://www.efax.com/">eFax ®</a>, <a href="http://www.evoice.com/">eVoice ®</a>, <a href="http://www.campaigner.com/">Campaigner ®</a>, <a href="http://www.fusemail.com/">FuseMail ®</a>, <a href="http://www.keepitsafe.com/">KeepItSafe ®</a> and <a href="http://www.onebox.com/">Onebox ®</a> are registered trademarks of <a href="http://www.j2.com/">j2 Global, Inc</a>. and its affiliates.</span></p></body>
</html>

--------------010504060305050806040805--