I do use automatic migration policy.
The main question you have to solve is:
1. Why the nodes became 'Non-operational' .Usually this happens when the
management interface (in your case HoatedEngine VM) could not reach the nodes over the
management network.
By default, management is going over the ovirtmgmt network. I guess you have created the
new network. Marked that new network as management network and then the switch was off ,
causing 'Non-Operational state'.
2. Migrating VMs is usually a safe approach, but this behavior is quite strange. If a node
ia Non-operational -> there could be no successful migration.
3. Some of the VMs got paused due to storage issue. Are you using GlusterFS, NFS or
iSCSI ? If yes, you need to clarify why you lost your storage.
I guess for now you can mark each VM to be migrated only manually (VM -> Edit) and if
they are critical VMs , set a high Availability from each VM's Edit options.
In such case, if a node fails, the VMs will be restarted on another node.
4. Have you setup node fencing ? For example APC, iLo, iDRAC and other fencing mechanisms
can allow the HostedEngine use another Host as fencing proxy and to reset the problematic
Hypervisor.
P.S.: You can define the following alias in '~/.bashrc' :
alias virsh='virsh -c
qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
Then you can verify your VMs even when a HostedEngine is down:
'virsh list --all'
Best Regards,
Strahil NikolovOn Dec 27, 2019 08:40, zhouhao(a)vip.friendtimes.net wrote:
I had a crash yesterday in my ovirt cluster, which is made up of 3 nodes.
I just tried to add a new network, but the whole cluster crashed
I added a new network to my cluster, but while I was debugging the newswitch, when the
switch was poweroff, the node detected the network card status down and then moved to
Non-Operational state.
At this time all of 3 nodes moved to Non-Operational state.
All virtual machines have started automatic migration,When I received the alert email,
all virtual machines were suspended
In 15 minutes my newswitch were power up again.The 3 ovirt-nodes become active again, but
many virtual machines become unresponsive or suspended due to forced migration, and only a
few virtual machines are pulled up again due to cancelled migration
After I tried to terminate the migration tasks and restart ovirt-engine service, I was
still unable to restore most of the virtual machines, so I had to restart 3 ovirt-nodes to
restore my virtual machine
I didn't recover all the virtual machines until an hour later
Then I modify my migration policy to " Do Not Migrate Virtual Machines"
Which migration Policy do you recommend?
I'm afraid to use cluster...
________________________________
zhouhao(a)vip.friendtimes.net