On Sep 20, 2017 9:50 PM, <support(a)jac-properties.com> wrote:
This matches about with what we were thinking, thank you!
To answer your questions
We do not have power management configured due to it causing a cascading
failure early in our deployment. The host was not fenced and "confirm host
rebooted" was never used. The VMs were powered on via virsh (this
shouldn't have happened)
The way they were powered on is most likely why they were corrupted is our
thought
We'd be happy if you could share both engine and host logs, including
vdsm.log, engine.log and /var/log/messages from both.
Y.
Logan
On September 20, 2017 at 12:03 PM Michal Skrivanek <
michal.skrivanek(a)redhat.com> wrote:
On 20 Sep 2017, at 18:06, Logan Kuhn <support(a)jac-properties.com> wrote:
We had an incident where a VM hosts' disk filled up, the VMs all went
unknown in the web console, but were fully functional if you were to login
or use the services of one.
Hi,
yes, that can happen since the VM’s storage is on NAS whereas the server
itself is non-functional as the management and all other local processes
are using local resources
We couldn't migrate them so we powered them down on that host and powered
them up and let ovirt choose the host for it, same as always.
that’s a mistake. The host should be fenced in that case, you likely do not
have a power management configured, do you? Even when you do not have a
fencing device available it should have been resolved manually by rebooting
it manually(after fixing the disk problem), or in case of permanent damage
(e.g. server needs to be replaced, that takes a week, you need to run those
VMs in the meantime elsewhere) it should have been powered off and VM
states should be reset by “confirm host has been rebooted” manual action.
Normally you should now be able to run those VMs while the status of the
host is still Not Responding - was it not the case? How exactly you get to
the situation that you were able to power up the VMs?
However the disk image on a few of them were corrupted because once we
fixed the host with the full disk, it still thought it should be running
the VM. Which promptly corrupted the disk, the error seems to be this in
the logs:
this can only happen for VMs flagged as HA, is it a case?
Thanks,
michal
2017-09-19 21:59:11,058 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler3) [36c806f6] VM
'70cf75c7-0fc2-4bbe-958e-7d0095f70960'(testhub)
is running in db and not running on VDS 'ef6dc2a3-af6e-4e00-aa4
0-493b31263417'(vm-int7)
We upgraded to 4.1.6 from 4.0.6 earlier in the day, I don't really think
it's anything more than coincidence, but it's worrying enough to send to
the community.
Regards,
Logan
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users