Re: [ovirt-users] VM running on multiple hosts

On Sep 20, 2017 9:50 PM, <support@jac-properties.com> wrote: This matches about with what we were thinking, thank you! To answer your questions We do not have power management configured due to it causing a cascading failure early in our deployment. The host was not fenced and "confirm host rebooted" was never used. The VMs were powered on via virsh (this shouldn't have happened) The way they were powered on is most likely why they were corrupted is our thought We'd be happy if you could share both engine and host logs, including vdsm.log, engine.log and /var/log/messages from both. Y. Logan On September 20, 2017 at 12:03 PM Michal Skrivanek < michal.skrivanek@redhat.com> wrote: On 20 Sep 2017, at 18:06, Logan Kuhn <support@jac-properties.com> wrote: We had an incident where a VM hosts' disk filled up, the VMs all went unknown in the web console, but were fully functional if you were to login or use the services of one. Hi, yes, that can happen since the VM’s storage is on NAS whereas the server itself is non-functional as the management and all other local processes are using local resources We couldn't migrate them so we powered them down on that host and powered them up and let ovirt choose the host for it, same as always. that’s a mistake. The host should be fenced in that case, you likely do not have a power management configured, do you? Even when you do not have a fencing device available it should have been resolved manually by rebooting it manually(after fixing the disk problem), or in case of permanent damage (e.g. server needs to be replaced, that takes a week, you need to run those VMs in the meantime elsewhere) it should have been powered off and VM states should be reset by “confirm host has been rebooted” manual action. Normally you should now be able to run those VMs while the status of the host is still Not Responding - was it not the case? How exactly you get to the situation that you were able to power up the VMs? However the disk image on a few of them were corrupted because once we fixed the host with the full disk, it still thought it should be running the VM. Which promptly corrupted the disk, the error seems to be this in the logs: this can only happen for VMs flagged as HA, is it a case? Thanks, michal 2017-09-19 21:59:11,058 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [36c806f6] VM '70cf75c7-0fc2-4bbe-958e-7d0095f70960'(testhub) is running in db and not running on VDS 'ef6dc2a3-af6e-4e00-aa4 0-493b31263417'(vm-int7) We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't really think it's anything more than coincidence, but it's worrying enough to send to the community. Regards, Logan _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--Apple-Mail=_F57D6FCC-B987-4E15-887D-7030B6C7C45B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
On 20 Sep 2017, at 21:08, Yaniv Kaul <ykaul@redhat.com> wrote: =20 =20 =20 On Sep 20, 2017 9:50 PM, <support@jac-properties.com = <mailto:support@jac-properties.com>> wrote: This matches about with what we were thinking, thank you! =20 To answer your questions =20 We do not have power management configured due to it causing a = cascading failure early in our deployment. The host was not fenced and = "confirm host rebooted" was never used. The VMs were powered on via = virsh (this shouldn't have happened) =20 The way they were powered on is most likely why they were corrupted is = our thought =20 =20
=20 We'd be happy if you could share both engine and host logs, including = vdsm.log, engine.log and /var/log/messages from both.=20 Y.=20 =20 =20 =20 Logan =20
On September 20, 2017 at 12:03 PM Michal Skrivanek = <michal.skrivanek@redhat.com <mailto:michal.skrivanek@redhat.com>> = wrote: =20 =20
On 20 Sep 2017, at 18:06, Logan Kuhn <support@jac-properties.com = <mailto:support@jac-properties.com>> wrote: =20 We had an incident where a VM hosts' disk filled up, the VMs all = went unknown in the web console, but were fully functional if you were = to login or use the services of one. =20 Hi, yes, that can happen since the VM=E2=80=99s storage is on NAS whereas =
=20
We couldn't migrate them so we powered them down on that host and =
=20 that=E2=80=99s a mistake. The host should be fenced in that case, you =
=20 Normally you should now be able to run those VMs while the status of =
However the disk image on a few of them were corrupted because once = we fixed the host with the full disk, it still thought it should be = running the VM. Which promptly corrupted the disk, the error seems to = be this in the logs: =20
=20 =20 this can only happen for VMs flagged as HA, is it a case? =20 Thanks, michal =20
=20 2017-09-19 21:59:11,058 INFO = [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] = (DefaultQuartzScheduler3) [36c806f6] VM = '70cf75c7-0fc2-4bbe-958e-7d0095f70960'(testhub) is running in db and not = running on VDS 'ef6dc2a3-af6e-4e00-aa4 0-493b31263417'(vm-int7) =20 We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't really =
yes.=20 That=E2=80=99s why we put a basic password protection to the plain virsh = access. Easy to circumvent, but then you=E2=80=99re on your own=E2=80=A6 Hm, how exactly were they powered on by virsh? Normally this is not = possible for oVirt VMs at all due to initial set up of host-specific = things(disk paths), we also use transient libvirt domains so stopped VMs = are not defined in libvirt once they stop. So I wonder how exactly was = this done? Unless they were in Paused state where you indeed can simply continue = the execution.=20 the server itself is non-functional as the management and all other = local processes are using local resources powered them up and let ovirt choose the host for it, same as always.=20 likely do not have a power management configured, do you? Even when you = do not have a fencing device available it should have been resolved = manually by rebooting it manually(after fixing the disk problem), or in = case of permanent damage (e.g. server needs to be replaced, that takes a = week, you need to run those VMs in the meantime elsewhere) it should = have been powered off and VM states should be reset by =E2=80=9Cconfirm = host has been rebooted=E2=80=9D manual action. the host is still Not Responding - was it not the case? How exactly you = get to the situation that you were able to power up the VMs? sorry, I meant "Normally you should not be able to run those VMs=E2=80=9D Thanks, michal think it's anything more than coincidence, but it's worrying enough to = send to the community.
=20 Regards, Logan _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users>
--Apple-Mail=_F57D6FCC-B987-4E15-887D-7030B6C7C45B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 20 Sep 2017, at 21:08, Yaniv Kaul <<a = href=3D"mailto:ykaul@redhat.com" class=3D"">ykaul@redhat.com</a>> = wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div = style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; = font-variant-caps: normal; font-weight: normal; letter-spacing: normal; = text-align: start; text-indent: 0px; text-transform: none; white-space: = normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" = class=3D""><div class=3D"gmail_extra"><br = class=3D"Apple-interchange-newline"><br class=3D""><div = class=3D"gmail_quote">On Sep 20, 2017 9:50 PM, <<a = href=3D"mailto:support@jac-properties.com" = class=3D"">support@jac-properties.com</a>> wrote:<br = type=3D"attribution" class=3D""><blockquote class=3D"quote" = style=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px; = border-left-style: solid; border-left-color: rgb(204, 204, 204); = padding-left: 1ex;"><u class=3D""></u><div class=3D""><p class=3D"">This = matches about with what we were thinking, thank you!</p><p class=3D"">To = answer your questions</p><p class=3D"">We do not have power management = configured due to it causing a cascading failure early in our = deployment. The host was not fenced and "confirm host rebooted" = was never used. The VMs were powered on via virsh (this shouldn't = have happened)</p><p class=3D"">The way they were powered on is most = likely why they were corrupted is our thought</p><div class=3D""><br = class=3D""></div></div></blockquote></div></div></div></div></blockquote><= div><br class=3D""></div>yes. </div><div>That=E2=80=99s why we put = a basic password protection to the plain virsh access. Easy to = circumvent, but then you=E2=80=99re on your own=E2=80=A6</div><div><br = class=3D""></div><div>Hm, how exactly were they powered on by virsh? = Normally this is not possible for oVirt VMs at all due to initial set up = of host-specific things(disk paths), we also use transient libvirt = domains so stopped VMs are not defined in libvirt once they stop. So I = wonder how exactly was this done?</div><div>Unless they were in Paused = state where you indeed can simply continue the = execution. </div><div><br class=3D""></div><div><blockquote = type=3D"cite" class=3D""><div class=3D""><div dir=3D"auto" = style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; = font-variant-caps: normal; font-weight: normal; letter-spacing: normal; = text-align: start; text-indent: 0px; text-transform: none; white-space: = normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=3D""><br= class=3D""></div><div dir=3D"auto" style=3D"font-family: Helvetica; = font-size: 12px; font-style: normal; font-variant-caps: normal; = font-weight: normal; letter-spacing: normal; text-align: start; = text-indent: 0px; text-transform: none; white-space: normal; = word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=3D"">We'd be = happy if you could share both engine and host logs, including vdsm.log, = engine.log and /var/log/messages from both. </div><div dir=3D"auto" = style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; = font-variant-caps: normal; font-weight: normal; letter-spacing: normal; = text-align: start; text-indent: 0px; text-transform: none; white-space: = normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" = class=3D"">Y. </div><div dir=3D"auto" style=3D"font-family: = Helvetica; font-size: 12px; font-style: normal; font-variant-caps: = normal; font-weight: normal; letter-spacing: normal; text-align: start; = text-indent: 0px; text-transform: none; white-space: normal; = word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=3D""><br = class=3D""></div><div dir=3D"auto" style=3D"font-family: Helvetica; = font-size: 12px; font-style: normal; font-variant-caps: normal; = font-weight: normal; letter-spacing: normal; text-align: start; = text-indent: 0px; text-transform: none; white-space: normal; = word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=3D""><div = class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote = class=3D"quote" style=3D"margin: 0px 0px 0px 0.8ex; border-left-width: = 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); = padding-left: 1ex;"><div class=3D""><font color=3D"#888888" class=3D""><p = class=3D""><br class=3D""></p><p class=3D"">Logan</p></font><div = class=3D"elided-text"><blockquote type=3D"cite" class=3D""><div = id=3D"m_-4156457467377991239ox-3145df7df0" style=3D"word-wrap: = break-word;" class=3D"">On September 20, 2017 at 12:03 PM Michal = Skrivanek <<a href=3D"mailto:michal.skrivanek@redhat.com" = target=3D"_blank" class=3D"">michal.skrivanek@redhat.com</a>> = wrote:<br class=3D""><br class=3D""><br class=3D""><div = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">On 20 = Sep 2017, at 18:06, Logan Kuhn <<a = href=3D"mailto:support@jac-properties.com" target=3D"_blank" = class=3D"">support@jac-properties.com</a>> wrote:</div><br = class=3D"m_-4156457467377991239ox-3145df7df0-Apple-interchange-newline"><d= iv class=3D""><div dir=3D"ltr" class=3D""><div style=3D"font-family: = arial; font-size: 16px; background-color: rgb(253, 253, 253);" = class=3D"">We had an incident where a VM hosts' disk filled up, the VMs = all went unknown in the web console, but were fully functional if you = were to login or use the services of = one.</div></div></div></blockquote><div class=3D""><br = class=3D""></div><div class=3D"">Hi,</div>yes, that can happen since the = VM=E2=80=99s storage is on NAS whereas the server itself is = non-functional as the management and all other local processes are using = local resources</div><div class=3D""><br class=3D""><blockquote = type=3D"cite" class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div = style=3D"font-family: arial; font-size: 16px; background-color: rgb(253, = 253, 253);" class=3D""> <span = class=3D"Apple-converted-space"> </span>We couldn't migrate them so = we powered them down on that host and powered them up and let ovirt = choose the host for it, same as = always. </div></div></div></blockquote><div class=3D""><br = class=3D""></div><div class=3D"">that=E2=80=99s a mistake. The host = should be fenced in that case, you likely do not have a power management = configured, do you? Even when you do not have a fencing device available = it should have been resolved manually by rebooting it = manually(after fixing the disk problem), or in case of permanent = damage (e.g. server needs to be replaced, that takes a week, you need to = run those VMs in the meantime elsewhere) it should have been powered off = and VM states should be reset by =E2=80=9Cconfirm host has been = rebooted=E2=80=9D manual action.</div><div class=3D""><br = class=3D""></div><div class=3D"">Normally you should now be able to run = those VMs while the status of the host is still Not Responding - was it = not the case? How exactly you get to the situation that you were able to = power up the = VMs?</div></div></div></blockquote></div></div></blockquote></div></div></= div></div></blockquote><div><br class=3D""></div>sorry, I meant = "Normally you should not be able to run those VMs=E2=80=9D</div><div><br = class=3D""></div><div>Thanks,</div><div>michal</div><div><br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = dir=3D"auto" style=3D"font-family: Helvetica; font-size: 12px; = font-style: normal; font-variant-caps: normal; font-weight: normal; = letter-spacing: normal; text-align: start; text-indent: 0px; = text-transform: none; white-space: normal; word-spacing: 0px; = -webkit-text-stroke-width: 0px;" class=3D""><div = class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote = class=3D"quote" style=3D"margin: 0px 0px 0px 0.8ex; border-left-width: = 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); = padding-left: 1ex;"><div class=3D""><div class=3D"elided-text"><blockquote= type=3D"cite" class=3D""><div id=3D"m_-4156457467377991239ox-3145df7df0" = style=3D"word-wrap: break-word;" class=3D""><div class=3D""><div = class=3D""><br class=3D""></div><div class=3D""><br = class=3D""></div><blockquote type=3D"cite" class=3D""><div class=3D""><div= dir=3D"ltr" class=3D""><div style=3D"font-family: arial; font-size: = 16px; background-color: rgb(253, 253, 253);" class=3D"">However the disk = image on a few of them were corrupted because once we fixed the host = with the full disk, it still thought it should be running the VM. = Which promptly corrupted the disk, the error seems to be this in the = logs:</div></div></div></blockquote><div class=3D""><br = class=3D""></div>this can only happen for VMs flagged as HA, is it a = case?</div><div class=3D""><br class=3D""></div><div class=3D""><div = class=3D"">Thanks,</div><div class=3D"">michal</div><div class=3D""><br = class=3D""></div><blockquote type=3D"cite" class=3D""><div class=3D""><div= dir=3D"ltr" class=3D""><div style=3D"font-family: arial; font-size: = 16px; background-color: rgb(253, 253, 253);" class=3D""><br = class=3D""></div><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D""><span = style=3D"font-family: monospace;" class=3D""><span = style=3D"background-color: rgb(255, 255, 255);" class=3D""><span = class=3D"m_-4156457467377991239ox-3145df7df0-gmail-Object" = id=3D"m_-4156457467377991239ox-3145df7df0-gmail-OBJ_PREFIX_DWT446_com_zimb= ra_date" style=3D"color: rgb(111, 22, 22);"><span = class=3D"m_-4156457467377991239ox-3145df7df0-gmail-Object" = id=3D"m_-4156457467377991239ox-3145df7df0-gmail-OBJ_PREFIX_DWT447_com_zimb= ra_date">2017-09-19</span></span> 21:59:11,058 INFO = [org.ovirt.engine.core.<wbr class=3D"">vdsbroker.monitoring.<wbr = class=3D"">VmAnalyzer] (DefaultQuartzScheduler3) [36c806f6] VM = '70cf75c7-0fc2-4bbe-958e-<wbr class=3D"">7d0095f70960'(testhub) = is </span><span style=3D"font-weight: bold; color: rgb(255, 84, = 84); background-color: rgb(255, 255, 255);" class=3D"">running</span><span= style=3D"background-color: rgb(255, 255, 255);" class=3D""> in db = and not </span><span style=3D"font-weight: bold; color: rgb(255, = 84, 84); background-color: rgb(255, 255, 255);" = class=3D"">running</span><span style=3D"background-color: rgb(255, 255, = 255);" class=3D""> on VDS 'ef6dc2a3-af6e-4e00-aa4</span><br = class=3D"">0-493b31263417'(vm-int7)<br class=3D""></span></div><div = style=3D"font-family: arial; font-size: 16px; background-color: rgb(253, = 253, 253);" class=3D""><br class=3D""></div><div style=3D"font-family: = arial; font-size: 16px; background-color: rgb(253, 253, 253);" = class=3D"">We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't = really think it's anything more than coincidence, but it's worrying = enough to send to the community.</div><div style=3D"font-family: arial; = font-size: 16px; background-color: rgb(253, 253, 253);" class=3D""><br = class=3D""></div><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D"">Regards,<br = class=3D"">Logan</div></div>______________________________<wbr = class=3D"">_________________<br class=3D"">Users mailing list<br = class=3D""><a href=3D"mailto:Users@ovirt.org" target=3D"_blank" = class=3D"">Users@ovirt.org</a><br class=3D""><a = href=3D"http://lists.ovirt.org/mailman/listinfo/users" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/users</a><br = class=3D""></div></blockquote></div><br = class=3D""></div></blockquote></div></div><br = class=3D"">______________________________<wbr = class=3D"">_________________<br class=3D"">Users mailing list<br = class=3D""><a href=3D"mailto:Users@ovirt.org" = class=3D"">Users@ovirt.org</a><br class=3D""><a = href=3D"http://lists.ovirt.org/mailman/listinfo/users" rel=3D"noreferrer" = target=3D"_blank" class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/users</a></blockquote></div></div></div></div>= </blockquote></div><br class=3D""></body></html>= --Apple-Mail=_F57D6FCC-B987-4E15-887D-7030B6C7C45B--
participants (2)
-
Michal Skrivanek
-
Yaniv Kaul