
--Apple-Mail=_FC29514C-FC8B-44E8-A9D8-B78C0939837D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
On 28 Apr 2016, at 19:40, Bill James <bill.james@j2.com> wrote: =20 thank you for response. I bold-ed the ones that are listed as "paused". =20 =20 [root@ovirt1 test vdsm]# virsh -r list --all Id Name State ---------------------------------------------------- 2 puppet.test.j2noc.com running 4 sftp2.test.j2noc.com running 5 oct.test.j2noc.com running 6 sftp2.dev.j2noc.com running 10 darmaster1.test.j2noc.com running 14 api1.test.j2noc.com running 25 ftp1.frb.test.j2noc.com running 26 auto7.test.j2noc.com running 32 epaymv02.j2noc.com running 34 media2.frb.test.j2noc.com running 36 auto2.j2noc.com running 44 nfs.testhvy2.colo.j2noc.com running 53 billapp-zuma1.dev.j2noc.com running 54 billing-ci.dev.j2noc.com running 60 log2.test.j2noc.com running 63 log1.test.j2noc.com running 69 sonar.dev.j2noc.com running 73 billapp-ui1.dev.j2noc.com running 74 billappvm01.dev.j2noc.com running 75 db2.frb.test.j2noc.com running 83 billapp-ui1.test.j2noc.com running 84 epayvm01.test.j2noc.com running 87 billappvm01.test.j2noc.com running 89 etapi1.test.j2noc.com running 93 billapp-zuma2.test.j2noc.com running 94 git.dev.j2noc.com running =20 Yes I did "systemctl restart libvirtd" which apparently also restart = vdsm?
yes, it does.=20
=20 =20 Looks like problem started around 2016-04-17 20:19:34,822, based on = engine.log attached.
There's a lot of vdsm logs! =20 fyi, the storage domain for these Vms is a "local" nfs share, = 7e566f55-e060-47b7-bfa4-ac3c48d70dda. =20 attached more logs. =20 =20 On 04/28/2016 12:53 AM, Michal Skrivanek wrote:
On 27 Apr 2016, at 19:16, Bill James <bill.james@j2.com> = <mailto:bill.james@j2.com> wrote: =20 virsh # list --all error: failed to connect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': = No such file or directory =20 you need to run virsh in read-only mode virsh -r list =E2=80=94all =20 [root@ovirt1 test vdsm]# systemctl status libvirtd =E2=97=8F libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; = vendor preset: enabled) Drop-In: /etc/systemd/system/libvirtd.service.d =E2=94=94=E2=94=80unlimited-core.conf Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days = ago =20 =20 tried systemctl restart libvirtd. No change. =20 Attached vdsm.log and supervdsm.log. =20 =20 [root@ovirt1 test vdsm]# systemctl status vdsmd =E2=97=8F vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; = vendor preset: enabled) Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min = 46s ago =20 =20 vdsm-4.17.18-0.el7.centos.noarch the vdsm.log attach is good, but it=E2=80=99s too short interval, it = only shows recovery(vdsm restart) phase when the VMs are identified as =
=20 =20
libvirt-daemon-1.2.17-13.el7_2.4.x86_64 =20 =20 Thanks. =20 =20 On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
On 27 Apr 2016, at 02:04, Nir Soffer <nsoffer@redhat.com> = <mailto:nsoffer@redhat.com> wrote: =20 jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James <bill.james@j2.com> = <mailto:bill.james@j2.com> wrote:
I have a hardware node that has 26 VMs. 9 are listed as "running", 17 are listed as "paused". =20 In truth all VMs are up and running fine. =20 I tried telling the db they are up: =20 engine=3D> update vm_dynamic set status =3D 1 where vm_guid = =3D(select vm_guid from vm_static where vm_name =3D 'api1.test.j2noc.com'); =20 GUI then shows it up for a short while, =20 then puts it back in paused state. =20 2016-04-26 15:16:46,095 INFO = [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-16) [157cc21e] VM = '242ca0af-4ab2-4dd6-b515-5 d435e6452c4'(api1.test.j2noc.com) moved from 'Up' --> 'Paused' 2016-04-26 15:16:46,221 INFO = [org.ovirt.engine.core.dal.dbbroker.auditlogh andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) = [157cc21e] Cor relation ID: null, Call Stack: null, Custom Event ID: -1, = Message: VM api1. test.j2noc.com has been paused. =20 =20 Why does the engine think the VMs are paused? Attached engine.log. =20 I can fix the problem by powering off the VM then starting it = back up. But the VM is working fine! How do I get ovirt to realize that? If this is an issue in engine, restarting engine may fix this. but having this problem only with one node, I don't think this is =
=20 If this is an issue in vdsm, restarting vdsm may fix this. =20 If this does not help, maybe this is libvirt issue? did you try to = check vm status using virsh? this looks more likely as it seems such status is being reported logs would help, vdsm.log at the very least. =20 If virsh thinks that the vms are paused, you can try to restart =
=20 Please file a bug about this in any case with engine and vdsm = logs. =20 Adding Michal in case he has better idea how to proceed. =20 Nir =20 Cloud Services for Business www.j2.com <http://www.j2.com/> j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox =20 =20 This email, its contents and attachments contain information from j2 = Global, Inc. and/or its affiliates which may be privileged, confidential = or otherwise protected from disclosure. The information is intended to = be for the addressee(s) only. If you are not an addressee, any = disclosure, copy, distribution, or use of the contents of this message = is prohibited. If you have received this email in error please notify =
yes, that time looks correct. Any idea what might have been a trigger? = Anything interesting happened at that time (power outage of some host, = some maintenance action, anything)?=20 logs indicate a problem when vdsm talks to libvirt(all those "monitor = become unresponsive=E2=80=9D) It does seem that at that time you started to have some storage = connectivity issues - first one at 2016-04-17 20:06:53,929. And it = doesn=E2=80=99t look temporary because such errors are still there = couple hours later(in your most recent file you attached I can see at = 23:00:54) When I/O gets blocked the VMs may experience issues (then VM gets = Paused), or their qemu process gets stuck(resulting in libvirt either = reporting error or getting stuck as well -> resulting in what vdsm sees = as =E2=80=9Cmonitor unresponsive=E2=80=9D) Since you now bounced libvirtd - did it help? Do you still see wrong = status for those VMs and still those "monitor unresponsive" errors in = vdsm.log? If not=E2=80=A6then I would suspect the =E2=80=9Cvm recovery=E2=80=9D = code not working correctly. Milan is looking at that. Thanks, michal paused=E2=80=A6.can you add earlier logs? Did you restart vdsm yourself = or did it crash? the issue. libvirtd. the sender by reply e-mail and delete the original message and any = copies. (c) 2015 j2 Global, Inc. All rights reserved. eFax, eVoice, = Campaigner, FuseMail, KeepItSafe, and Onebox are registered trademarks = of j2 Global, Inc. and its affiliates.
= <supervdsm.log.gz><vdsm.log.gz>___________________________________________=
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 <engine.log-20160421.gz><vdsm.logs.tar.gz>
--Apple-Mail=_FC29514C-FC8B-44E8-A9D8-B78C0939837D Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 28 Apr 2016, at 19:40, Bill James <<a = href=3D"mailto:bill.james@j2.com" class=3D"">bill.james@j2.com</a>> = wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""> =20 <meta content=3D"text/html; charset=3DUTF-8" = http-equiv=3D"Content-Type" class=3D""> =20 <div bgcolor=3D"#FFFFFF" text=3D"#000000" class=3D""> thank you for response.<br class=3D""> I bold-ed the ones that are listed as "paused".<br class=3D""> <br class=3D""> <br class=3D""> [root@ovirt1 test vdsm]# virsh -r list --all<br class=3D""> Id = Name &nbs= p; = State<br class=3D""> ----------------------------------------------------<br class=3D""> <b class=3D""> 2 <a = href=3D"http://puppet.test.j2noc.com" = class=3D"">puppet.test.j2noc.com</a> &n= bsp; running</b><br class=3D""> <b class=3D""> 4 <a = href=3D"http://sftp2.test.j2noc.com" = class=3D"">sftp2.test.j2noc.com</a> &nb= sp; running</b><br class=3D""> <b class=3D""> 5 <a = href=3D"http://oct.test.j2noc.com" = class=3D"">oct.test.j2noc.com</a>  = ; running</b><br class=3D""> <b class=3D""> 6 <a = href=3D"http://sftp2.dev.j2noc.com" = class=3D"">sftp2.dev.j2noc.com</a> &nbs= p; running</b><br class=3D""> <b class=3D""> 10 <a = href=3D"http://darmaster1.test.j2noc.com" = class=3D"">darmaster1.test.j2noc.com</a> = running</b><br class=3D""> <b class=3D""> 14 <a = href=3D"http://api1.test.j2noc.com" = class=3D"">api1.test.j2noc.com</a> &nbs= p; running</b><br class=3D""> 25 <a href=3D"http://ftp1.frb.test.j2noc.com" = class=3D"">ftp1.frb.test.j2noc.com</a> = running<br class=3D""> 26 <a href=3D"http://auto7.test.j2noc.com" = class=3D"">auto7.test.j2noc.com</a> &nb= sp; running<br class=3D""> <b class=3D""> 32 <a = href=3D"http://epaymv02.j2noc.com" = class=3D"">epaymv02.j2noc.com</a>  = ; running</b><br class=3D""> 34 <a = href=3D"http://media2.frb.test.j2noc.com" = class=3D"">media2.frb.test.j2noc.com</a> = running<br class=3D""> 36 <a href=3D"http://auto2.j2noc.com" = class=3D"">auto2.j2noc.com</a> &n= bsp; running<br class=3D""> 44 <a = href=3D"http://nfs.testhvy2.colo.j2noc.com" = class=3D"">nfs.testhvy2.colo.j2noc.com</a> running<br = class=3D""> <b class=3D""> 53 <a = href=3D"http://billapp-zuma1.dev.j2noc.com" = class=3D"">billapp-zuma1.dev.j2noc.com</a> = running</b><br class=3D""> <b class=3D""> 54 <a = href=3D"http://billing-ci.dev.j2noc.com" = class=3D"">billing-ci.dev.j2noc.com</a>  = ; running</b><br class=3D""> 60 <a href=3D"http://log2.test.j2noc.com" = class=3D"">log2.test.j2noc.com</a> &nbs= p; running<br class=3D""> 63 <a href=3D"http://log1.test.j2noc.com" = class=3D"">log1.test.j2noc.com</a> &nbs= p; running<br class=3D""> <b class=3D""> 69 <a = href=3D"http://sonar.dev.j2noc.com" = class=3D"">sonar.dev.j2noc.com</a> &nbs= p; running</b><br class=3D""> <b class=3D""> 73 <a = href=3D"http://billapp-ui1.dev.j2noc.com" = class=3D"">billapp-ui1.dev.j2noc.com</a> = running</b><br class=3D""> <b class=3D""> 74 <a = href=3D"http://billappvm01.dev.j2noc.com" = class=3D"">billappvm01.dev.j2noc.com</a> = running</b><br class=3D""> 75 <a href=3D"http://db2.frb.test.j2noc.com" = class=3D"">db2.frb.test.j2noc.com</a> &= nbsp; running<br class=3D""> 83 <a = href=3D"http://billapp-ui1.test.j2noc.com" = class=3D"">billapp-ui1.test.j2noc.com</a> = running<br class=3D""> <b class=3D""> 84 <a = href=3D"http://epayvm01.test.j2noc.com" = class=3D"">epayvm01.test.j2noc.com</a> = running</b><br class=3D""> <b class=3D""> 87 <a = href=3D"http://billappvm01.test.j2noc.com" = class=3D"">billappvm01.test.j2noc.com</a> = running</b><br class=3D""> <b class=3D""> 89 <a = href=3D"http://etapi1.test.j2noc.com" = class=3D"">etapi1.test.j2noc.com</a> &n= bsp; running</b><br class=3D""> <b class=3D""> 93 <a = href=3D"http://billapp-zuma2.test.j2noc.com" = class=3D"">billapp-zuma2.test.j2noc.com</a> running</b><br = class=3D""> <b class=3D""> 94 <a = href=3D"http://git.dev.j2noc.com" = class=3D"">git.dev.j2noc.com</a> = running</b><br class=3D""> <br class=3D""> Yes I did "systemctl restart libvirtd" which apparently also restart vdsm?<br class=3D""></div></div></blockquote><div><br = class=3D""></div>yes, it does. </div><div><br = class=3D""></div><div><blockquote type=3D"cite" class=3D""><div = class=3D""><div bgcolor=3D"#FFFFFF" text=3D"#000000" class=3D""> <br class=3D""> <br class=3D""> Looks like problem started around 2016-04-17 20:19:34,822, based on engine.log attached.<br class=3D""></div></div></blockquote><div><br = class=3D""></div><div>yes, that time looks correct. Any idea what might = have been a trigger? Anything interesting happened at that time (power = outage of some host, some maintenance action, = anything)? </div><div>logs indicate a problem when vdsm talks to = libvirt(all those "monitor become unresponsive=E2=80=9D)</div><div><br = class=3D""></div><div>It does seem that at that time you started to have = some storage connectivity issues - first one at 2016-04-17 = 20:06:53,929. And it doesn=E2=80=99t look temporary because such errors = are still there couple hours later(in your most recent file you attached = I can see at 23:00:54)</div><div>When I/O gets blocked the VMs may = experience issues (then VM gets Paused), or their qemu process gets = stuck(resulting in libvirt either reporting error or getting stuck as = well -> resulting in what vdsm sees as =E2=80=9Cmonitor = unresponsive=E2=80=9D)</div><div><br class=3D""></div><div>Since you now = bounced libvirtd - did it help? Do you still see wrong status for those = VMs and still those "monitor unresponsive" errors in = vdsm.log?</div><div>If not=E2=80=A6then I would suspect the =E2=80=9Cvm = recovery=E2=80=9D code not working correctly. Milan is looking at = that.</div><div><br = class=3D""></div><div>Thanks,</div><div>michal</div><div><div><br = class=3D""></div></div><div class=3D""><br class=3D""></div><blockquote = type=3D"cite" class=3D""><div class=3D""><div bgcolor=3D"#FFFFFF" = text=3D"#000000" class=3D""> There's a lot of vdsm logs!<br class=3D""> <br class=3D""> fyi, the storage domain for these Vms is a "local" nfs share, 7e566f55-e060-47b7-bfa4-ac3c48d70dda.<br class=3D""> <br class=3D""> attached more logs.<br class=3D""> <br class=3D""> <br class=3D""> <div class=3D"moz-cite-prefix">On 04/28/2016 12:53 AM, Michal Skrivanek wrote:<br class=3D""> </div> <blockquote = cite=3D"mid:28BF55E6-3A90-4BB7-90B9-1EE0A82FC460@redhat.com" type=3D"cite"= class=3D""> <pre wrap=3D"" class=3D""></pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">On 27 Apr 2016, at 19:16, Bill James = <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:bill.james@j2.com"><bill.james@j2.com></a> wrote: virsh # list --all error: failed to connect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No = such file or directory </pre> </blockquote> <pre wrap=3D"" class=3D"">you need to run virsh in read-only mode virsh -r list =E2=80=94all </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">[root@ovirt1 test vdsm]# systemctl = status libvirtd =E2=97=8F libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; = vendor preset: enabled) Drop-In: /etc/systemd/system/libvirtd.service.d =E2=94=94=E2=94=80unlimited-core.conf Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago tried systemctl restart libvirtd. No change. Attached vdsm.log and supervdsm.log. [root@ovirt1 test vdsm]# systemctl status vdsmd =E2=97=8F vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor = preset: enabled) Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s = ago vdsm-4.17.18-0.el7.centos.noarch </pre> </blockquote> <pre wrap=3D"" class=3D"">the vdsm.log attach is good, but it=E2=80=99= s too short interval, it only shows recovery(vdsm restart) phase when = the VMs are identified as paused=E2=80=A6.can you add earlier logs? Did = you restart vdsm yourself or did it crash? </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">libvirt-daemon-1.2.17-13.el7_2.4.x86_64 Thanks. On 04/26/2016 11:35 PM, Michal Skrivanek wrote: </pre> <blockquote type=3D"cite" class=3D""> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">On 27 Apr 2016, at 02:04, Nir = Soffer <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:nsoffer@redhat.com"><nsoffer@redhat.com></a> wrote: jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James <a = class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:bill.james@j2.com"><bill.james@j2.com></a> wrote: </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">I have a hardware node that has = 26 VMs. 9 are listed as "running", 17 are listed as "paused". In truth all VMs are up and running fine. I tried telling the db they are up: engine=3D> update vm_dynamic set status =3D 1 where vm_guid =3D(select vm_guid from vm_static where vm_name =3D '<a = href=3D"http://api1.test.j2noc.com" class=3D"">api1.test.j2noc.com</a>'); GUI then shows it up for a short while, then puts it back in paused state. 2016-04-26 15:16:46,095 INFO = [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-16) [157cc21e] VM = '242ca0af-4ab2-4dd6-b515-5 d435e6452c4'(<a href=3D"http://api1.test.j2noc.com" = class=3D"">api1.test.j2noc.com</a>) moved from 'Up' --> 'Paused' 2016-04-26 15:16:46,221 INFO = [org.ovirt.engine.core.dal.dbbroker.auditlogh andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] = Cor relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM = api1. <a href=3D"http://test.j2noc.com" class=3D"">test.j2noc.com</a> has been = paused. Why does the engine think the VMs are paused? Attached engine.log. I can fix the problem by powering off the VM then starting it back up. But the VM is working fine! How do I get ovirt to realize that? </pre> </blockquote> <pre wrap=3D"" class=3D"">If this is an issue in engine, = restarting engine may fix this. but having this problem only with one node, I don't think this is the = issue. If this is an issue in vdsm, restarting vdsm may fix this. If this does not help, maybe this is libvirt issue? did you try to check = vm status using virsh? </pre> </blockquote> <pre wrap=3D"" class=3D"">this looks more likely as it seems = such status is being reported logs would help, vdsm.log at the very least. </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">If virsh thinks that the vms are = paused, you can try to restart libvirtd. Please file a bug about this in any case with engine and vdsm logs. Adding Michal in case he has better idea how to proceed. Nir </pre> </blockquote> </blockquote> <pre wrap=3D"" class=3D""> Cloud Services for Business <a class=3D"moz-txt-link-abbreviated" = href=3D"http://www.j2.com/">www.j2.com</a> j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox This email, its contents and attachments contain information from j2 = Global, Inc. and/or its affiliates which may be privileged, confidential = or otherwise protected from disclosure. The information is intended to = be for the addressee(s) only. If you are not an addressee, any = disclosure, copy, distribution, or use of the contents of this message = is prohibited. If you have received this email in error please notify = the sender by reply e-mail and delete the original message and any = copies. (c) 2015 j2 Global, Inc. All rights reserved. eFax, eVoice, = Campaigner, FuseMail, KeepItSafe, and Onebox are registered trademarks = of j2 Global, Inc. and its affiliates. = <supervdsm.log.gz><vdsm.log.gz>_______________________________= ________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" = href=3D"mailto:Users@ovirt.org">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.= org/mailman/listinfo/users</a> </pre> </blockquote> <pre wrap=3D"" class=3D""></pre> </blockquote> <br class=3D""> </div> <span = id=3D"cid:EB02B488-C070-46FA-9938-DC7D6DF5BEED@brq.redhat.com"><engine.= log-20160421.gz></span><span = id=3D"cid:52E27023-A602-4DB0-B69A-18237CC048A3@brq.redhat.com"><vdsm.lo= gs.tar.gz></span></div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_FC29514C-FC8B-44E8-A9D8-B78C0939837D--