VM running on multiple hosts

We had an incident where a VM hosts' disk filled up, the VMs all went unknown in the web console, but were fully functional if you were to login or use the services of one. We couldn't migrate them so we powered them down on that host and powered them up and let ovirt choose the host for it, same as always. However the disk image on a few of them were corrupted because once we fixed the host with the full disk, it still thought it should be running the VM. Which promptly corrupted the disk, the error seems to be this in the logs: 2017-09-19 21:59:11,058 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [36c806f6] VM '70cf75c7-0fc2-4bbe-958e-7d0095f70960'(testhub) is running in db and not running on VDS 'ef6dc2a3-af6e-4e00-aa4 0-493b31263417'(vm-int7) We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't really think it's anything more than coincidence, but it's worrying enough to send to the community. Regards, Logan

--Apple-Mail=_BF44E8B2-8075-4840-B5A9-81A1D7170AB7 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
On 20 Sep 2017, at 18:06, Logan Kuhn <support@jac-properties.com> = wrote: =20 We had an incident where a VM hosts' disk filled up, the VMs all went = unknown in the web console, but were fully functional if you were to = login or use the services of one.
We couldn't migrate them so we powered them down on that host and =
However the disk image on a few of them were corrupted because once we = fixed the host with the full disk, it still thought it should be running =
=20 2017-09-19 21:59:11,058 INFO = [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] = (DefaultQuartzScheduler3) [36c806f6] VM = '70cf75c7-0fc2-4bbe-958e-7d0095f70960'(testhub) is running in db and not = running on VDS 'ef6dc2a3-af6e-4e00-aa4 0-493b31263417'(vm-int7) =20 We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't really =
Hi, yes, that can happen since the VM=E2=80=99s storage is on NAS whereas = the server itself is non-functional as the management and all other = local processes are using local resources powered them up and let ovirt choose the host for it, same as always.=20 that=E2=80=99s a mistake. The host should be fenced in that case, you = likely do not have a power management configured, do you? Even when you = do not have a fencing device available it should have been resolved = manually by rebooting it manually(after fixing the disk problem), or in = case of permanent damage (e.g. server needs to be replaced, that takes a = week, you need to run those VMs in the meantime elsewhere) it should = have been powered off and VM states should be reset by =E2=80=9Cconfirm = host has been rebooted=E2=80=9D manual action. Normally you should now be able to run those VMs while the status of the = host is still Not Responding - was it not the case? How exactly you get = to the situation that you were able to power up the VMs? the VM. Which promptly corrupted the disk, the error seems to be this = in the logs: this can only happen for VMs flagged as HA, is it a case? Thanks, michal think it's anything more than coincidence, but it's worrying enough to = send to the community.
=20 Regards, Logan _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_BF44E8B2-8075-4840-B5A9-81A1D7170AB7 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 20 Sep 2017, at 18:06, Logan Kuhn <<a = href=3D"mailto:support@jac-properties.com" = class=3D"">support@jac-properties.com</a>> wrote:</div><br = class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" = class=3D""><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D"">We had an incident = where a VM hosts' disk filled up, the VMs all went unknown in the web = console, but were fully functional if you were to login or use the = services of one.</div></div></div></blockquote><div><br = class=3D""></div><div>Hi,</div>yes, that can happen since the VM=E2=80=99s= storage is on NAS whereas the server itself is non-functional as the = management and all other local processes are using local = resources</div><div><br class=3D""><blockquote type=3D"cite" = class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div = style=3D"font-family: arial; font-size: 16px; background-color: rgb(253, = 253, 253);" class=3D""> We couldn't migrate them so we powered = them down on that host and powered them up and let ovirt choose the host = for it, same as always. </div></div></div></blockquote><div><br = class=3D""></div><div>that=E2=80=99s a mistake. The host should be = fenced in that case, you likely do not have a power management = configured, do you? Even when you do not have a fencing device available = it should have been resolved manually by rebooting it = manually(after fixing the disk problem), or in case of permanent = damage (e.g. server needs to be replaced, that takes a week, you need to = run those VMs in the meantime elsewhere) it should have been powered off = and VM states should be reset by =E2=80=9Cconfirm host has been = rebooted=E2=80=9D manual action.</div><div><br = class=3D""></div><div>Normally you should now be able to run those VMs = while the status of the host is still Not Responding - was it not the = case? How exactly you get to the situation that you were able to power = up the VMs?</div><div><br class=3D""></div><div><br = class=3D""></div><blockquote type=3D"cite" class=3D""><div class=3D""><div= dir=3D"ltr" class=3D""><div style=3D"font-family: arial; font-size: = 16px; background-color: rgb(253, 253, 253);" class=3D""> However the = disk image on a few of them were corrupted because once we fixed the = host with the full disk, it still thought it should be running the = VM. Which promptly corrupted the disk, the error seems to be this = in the logs:</div></div></div></blockquote><div><br class=3D""></div>this = can only happen for VMs flagged as HA, is it a case?</div><div><br = class=3D""></div><div><div>Thanks,</div><div>michal</div><div = class=3D""><br class=3D""></div><blockquote type=3D"cite" class=3D""><div = class=3D""><div dir=3D"ltr" class=3D""><div style=3D"font-family: arial; = font-size: 16px; background-color: rgb(253, 253, 253);" class=3D""><br = class=3D""></div><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D""><span = style=3D"font-family:monospace" class=3D""><span = style=3D"background-color:rgb(255,255,255)" class=3D""><span = class=3D"gmail-Object" id=3D"gmail-OBJ_PREFIX_DWT446_com_zimbra_date" = style=3D"color:rgb(111,22,22)"><span class=3D"gmail-Object" = id=3D"gmail-OBJ_PREFIX_DWT447_com_zimbra_date" style=3D"cursor: = pointer;">2017-09-19</span></span> 21:59:11,058 INFO = [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] = (DefaultQuartzScheduler3) [36c806f6] VM = '70cf75c7-0fc2-4bbe-958e-7d0095f70960'(testhub) is </span><span = style=3D"font-weight:bold;color:rgb(255,84,84);background-color:rgb(255,25= 5,255)" class=3D"">running</span><span = style=3D"background-color:rgb(255,255,255)" class=3D""> in db and = not </span><span = style=3D"font-weight:bold;color:rgb(255,84,84);background-color:rgb(255,25= 5,255)" class=3D"">running</span><span = style=3D"background-color:rgb(255,255,255)" class=3D""> on VDS = 'ef6dc2a3-af6e-4e00-aa4</span><br class=3D"">0-493b31263417'(vm-int7)<br = class=3D""></span></div><div style=3D"font-family: arial; font-size: = 16px; background-color: rgb(253, 253, 253);" class=3D""><br = class=3D""></div><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D"">We upgraded to 4.1.6 = from 4.0.6 earlier in the day, I don't really think it's anything more = than coincidence, but it's worrying enough to send to the = community.</div><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D""><br = class=3D""></div><div style=3D"font-family: arial; font-size: 16px; = background-color: rgb(253, 253, 253);" class=3D"">Regards,<br = class=3D"">Logan</div></div> _______________________________________________<br class=3D"">Users = mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org" = class=3D"">Users@ovirt.org</a><br = class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br = class=3D""></div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_BF44E8B2-8075-4840-B5A9-81A1D7170AB7--

------=_Part_386347_617418787.1505930303404 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable This matches about with what we were thinking, thank you! To answer your questions We do not have power management configured due to it causing a cascading fa= ilure early in our deployment. The host was not fenced and "confirm host r= ebooted" was never used. The VMs were powered on via virsh (this shouldn't= have happened) The way they were powered on is most likely why they were corrupted is our = thought Logan
On September 20, 2017 at 12:03 PM Michal Skrivanek <michal.skrivanek@=
=20 =20 =20 > > On 20 Sep 2017, at 18:06, Logan Kuhn <support@jac-pro=
=20 We had an incident where a VM hosts' disk filled up, the VMs al= l went unknown in the web console, but were fully functional if you were to= login or use the services of one. =20 >=20 Hi, yes, that can happen since the VM=E2=80=99s storage is on NAS whereas=
=20 =20 > > We couldn't migrate them so we powered them down on=
=20 >=20 that=E2=80=99s a mistake. The host should be fenced in that case, you=
=20 Normally you should now be able to run those VMs while the status of =
redhat.com> wrote: perties.com mailto:support@jac-properties.com > wrote: the server itself is non-functional as the management and all other local = processes are using local resources that host and powered them up and let ovirt choose the host for it, same a= s always.=20 likely do not have a power management configured, do you? Even when you do= not have a fencing device available it should have been resolved manually = by rebooting it manually(after fixing the disk problem), or in case of per= manent damage (e.g. server needs to be replaced, that takes a week, you nee= d to run those VMs in the meantime elsewhere) it should have been powered o= ff and VM states should be reset by =E2=80=9Cconfirm host has been rebooted= =E2=80=9D manual action. the host is still Not Responding - was it not the case? How exactly you get= to the situation that you were able to power up the VMs?
=20 =20 =20 > > However the disk image on a few of them were corrupte= d because once we fixed the host with the full disk, it still thought it sh= ould be running the VM. Which promptly corrupted the disk, the error seems= to be this in the logs:
=20 >=20 this can only happen for VMs flagged as HA, is it a case? =20 Thanks, michal =20 =20 > >=20 2017-09-19 21:59:11,058 INFO [org.ovirt.engine.core.vdsbroker.= monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [36c806f6] VM '70cf75c7-0f= c2-4bbe-958e-7d0095f70960'(testhub) is running in db and not running on VDS= 'ef6dc2a3-af6e-4e00-aa4 0-493b31263417'(vm-int7) =20 We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't rea= lly think it's anything more than coincidence, but it's worrying enough to = send to the community. =20 Regards, Logan _______________________________________________ Users mailing list Users@ovirt.org mailto:Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users =20 >=20 =20
</div></blockquote><div><br class=3D""></div>this can only happen for VMs = flagged as HA, is it a case?</div><div><br class=3D""></div><div><div>Thank= s,</div><div>michal</div><div class=3D""><br class=3D""></div><blockquote t= ype=3D"cite"><div class=3D""><div dir=3D"ltr" class=3D""><div style=3D"font= -family: arial; font-size: 16px; background-color: #fdfdfd;" class=3D""><br= class=3D""></div><div style=3D"font-family: arial; font-size: 16px; backgr= ound-color: #fdfdfd;" class=3D""><span style=3D"font-family: monospace;" cl= ass=3D""><span style=3D"background-color: #ffffff;" class=3D""><span class= =3D"ox-3145df7df0-gmail-Object" id=3D"ox-3145df7df0-gmail-OBJ_PREFIX_DWT446= _com_zimbra_date" style=3D"color: #6f1616;"><span class=3D"ox-3145df7df0-gm= ail-Object" id=3D"ox-3145df7df0-gmail-OBJ_PREFIX_DWT447_com_zimbra_date" st= yle=3D"cursor: pointer;">2017-09-19</span></span> 21:59:11,058 INFO = 160;[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzS= cheduler3) [36c806f6] VM '70cf75c7-0fc2-4bbe-958e-7d0095f70960'(tes=
------=_Part_386347_617418787.1505930303404 MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <!DOCTYPE html> <html><head> <meta charset=3D"UTF-8"> </head><body><p>This matches about with what we were thinking, thank you!</= p><p>To answer your questions</p><p>We do not have power management configu= red due to it causing a cascading failure early in our deployment. Th= e host was not fenced and "confirm host rebooted" was never used.= 160; The VMs were powered on via virsh (this shouldn't have happened)</= p><p>The way they were powered on is most likely why they were corrupted is= our thought</p><p><br></p><p>Logan</p><blockquote type=3D"cite"><div id=3D= "ox-3145df7df0" style=3D"word-wrap: break-word;" class=3D"">On September 20= , 2017 at 12:03 PM Michal Skrivanek <michal.skrivanek@redhat.com> w= rote:<br><br><br class=3D""><div><blockquote type=3D"cite"><div class=3D"">= On 20 Sep 2017, at 18:06, Logan Kuhn <<a href=3D"mailto:support@jac-pro= perties.com" class=3D"">support@jac-properties.com</a>> wrote:</div><br= class=3D"ox-3145df7df0-Apple-interchange-newline"><div class=3D""><div dir= =3D"ltr" class=3D""><div style=3D"font-family: arial; font-size: 16px; back= ground-color: #fdfdfd;" class=3D"">We had an incident where a VM hosts'= disk filled up, the VMs all went unknown in the web console, but were full= y functional if you were to login or use the services of one.</div></div></= div></blockquote><div><br class=3D""></div><div>Hi,</div>yes, that can happ= en since the VM’s storage is on NAS whereas the server itself is non-= functional as the management and all other local processes are using local = resources</div><div><br class=3D""><blockquote type=3D"cite"><div class=3D"= "><div dir=3D"ltr" class=3D""><div style=3D"font-family: arial; font-size: = 16px; background-color: #fdfdfd;" class=3D""> We couldn't migrate= them so we powered them down on that host and powered them up and let ovir= t choose the host for it, same as always. </div></div></div></blockquo= te><div><br class=3D""></div><div>that’s a mistake. The host should b= e fenced in that case, you likely do not have a power management configured= , do you? Even when you do not have a fencing device available it should ha= ve been resolved manually by rebooting it manually(after fixing the d= isk problem), or in case of permanent damage (e.g. server needs to be repla= ced, that takes a week, you need to run those VMs in the meantime elsewhere= ) it should have been powered off and VM states should be reset by “c= onfirm host has been rebooted” manual action.</div><div><br class=3D"= "></div><div>Normally you should now be able to run those VMs while the sta= tus of the host is still Not Responding - was it not the case? How exactly = you get to the situation that you were able to power up the VMs?</div><div>= <br class=3D""></div><div><br class=3D""></div><blockquote type=3D"cite"><d= iv class=3D""><div dir=3D"ltr" class=3D""><div style=3D"font-family: arial;= font-size: 16px; background-color: #fdfdfd;" class=3D"">However the disk i= mage on a few of them were corrupted because once we fixed the host with th= e full disk, it still thought it should be running the VM. Which prom= ptly corrupted the disk, the error seems to be this in the logs:</div></div= thub) is </span><span style=3D"font-weight: bold; color: #ff5454; back= ground-color: #ffffff;" class=3D"">running</span><span style=3D"background-= color: #ffffff;" class=3D""> in db and not </span><span style=3D"= font-weight: bold; color: #ff5454; background-color: #ffffff;" class=3D"">r= unning</span><span style=3D"background-color: #ffffff;" class=3D""> on= VDS 'ef6dc2a3-af6e-4e00-aa4</span><br class=3D"">0-493b31263417'(v= m-int7)<br class=3D""></span></div><div style=3D"font-family: arial; font-s= ize: 16px; background-color: #fdfdfd;" class=3D""><br class=3D""></div><div= style=3D"font-family: arial; font-size: 16px; background-color: #fdfdfd;" = class=3D"">We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't = really think it's anything more than coincidence, but it's worrying= enough to send to the community.</div><div style=3D"font-family: arial; fo= nt-size: 16px; background-color: #fdfdfd;" class=3D""><br class=3D""></div>= <div style=3D"font-family: arial; font-size: 16px; background-color: #fdfdf= d;" class=3D"">Regards,<br class=3D"">Logan</div></div>____________________= ___________________________<br class=3D"">Users mailing list<br class=3D"">= <a href=3D"mailto:Users@ovirt.org" class=3D"">Users@ovirt.org</a><br class= =3D"">http://lists.ovirt.org/mailman/listinfo/users<br class=3D""></div></b= lockquote></div><br class=3D""></div></blockquote></body></html> =20 ------=_Part_386347_617418787.1505930303404--
participants (3)
-
Logan Kuhn
-
Michal Skrivanek
-
support@jac-properties.com