
Hi Haim, I wanted to wait to send this mail, until the problem occurs again. Disabled live-migration for the cluster first, to make sure the second node wouldn't have the same problem, when migration is started. It seems the problem isn't caused by migration, as I did run in the same error again today. Log snippet Webgui: 2012-Oct-19,04:28:13 "Host deovn-a01 cannot access one of the Storage Domains attached to it, or the Data Center object. Setting Host state to Non-Operational." --> all VMs are running properly, although the engine tells something different. Even the VM status in engine gui is wrong, as it's showing "<vmname> reboot in progress", but there is no reboot initiated (ssh/rdp connections, file operations are working fine) Engine log says for this period: cat /var/log/ovirt-engine/engine.log | grep 04:2* 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 2012-10-19 04:28:13,775 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 2012-10-19 04:28:13,799 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational 2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS 2012-10-19 04:28:13,884 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd 2012-10-19 04:28:13,888 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd 2012-10-19 04:28:19,690 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 I think the first output is important: 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 --> which problem? There's no debug info during that time period to consider where tha problem could come from :/ On affected node side I did grep /var/log/vdsm for ERROR: Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats And 20 more of the same type with same vmId, I'm sure this is an aftereffect s the engine can't tell the status of the VMs. Can you give me an advice where I can find more information to solve this issue? Or perhaps a scenario I can try? I have another curiosity I wanted to ask for in a new mail, but perhaps this has something to do with my issue: The elected SPM is not part of this cluster, just has 2 storage paths (multipath) to the SAN. The problematic cluster has 4 storage paths(bigger hypervisors), and all storage paths are connected successfully . Does the SPM detects this difference, or is it unnecessary as the executive command detected possible paths on its own (what I assume)? Currently in use: oVirt-engine 3.0 oVirt-node2.30 --> is there any problem mixing node versions, regarding the ovirt-engine version? Sorry for the amount of questions, I really want to understand the "oVirt-mechanism" completely, to build up a fail-safe virtual environment :) Thanks in advance. Best, Sven. -----Ursprüngliche Nachricht----- Von: Haim Ateya [mailto:hateya@redhat.com] Gesendet: Dienstag, 16. Oktober 2012 14:38 An: Sven Knohsalla Cc: users@ovirt.org; Itamar Heim; Omer Frenkel Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE Hi Sven, can you attach full logs from the second host (problematic one)? i guess its "deovn-a01". 2012-10-15 11:13:38,197 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 ----- Original Message -----
From: "Omer Frenkel" <ofrenkel@redhat.com> To: "Itamar Heim" <iheim@redhat.com>, "Sven Knohsalla" <s.knohsalla@netbiscuits.com> Cc: users@ovirt.org Sent: Tuesday, October 16, 2012 2:02:50 PM Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE
----- Original Message -----
From: "Itamar Heim" <iheim@redhat.com> To: "Sven Knohsalla" <s.knohsalla@netbiscuits.com> Cc: users@ovirt.org Sent: Monday, October 15, 2012 8:36:07 PM Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE
On 10/15/2012 03:56 PM, Sven Knohsalla wrote:
Hi,
sometimes one hypervisors status turns to „Non-operational“ with error “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated for all VMs) is starting.
I don’t currently know why the ovirt-node turns to this status, because the connected iSCSI SAN is available all the time(checked via iscsi session and lsblk), I’m also able to r/w on the SAN during that time.
We can simply activate this ovirt-node and it turns up again. The migration process is running from scratch and hitting the some error àReboot of ovirt-node necessary!
When a hypervisor turns to “non-operational” status, the live migration is starting and tries to migrate ~25 VMs (~ 100 GB RAM to migrate).
During that process the network workload goes 100%, some VMs will be migrated, then the destination host also turns to “non-operational” status with error “STORAGE_DOMAIN_UNREACHABLE”.
Many VMs are still running on their origin host, some are paused, some are showing “migration from” status.
After a reboot of the origin host, the VMs turns of course into unknown state.
So the whole cluster is down :/
For this problem I have some questions:
-Does ovirt engine just use the ovirt-mgmt network for migration/HA?
yes.
-If so, is there any possibility to *add*/switch a network for migration/HA?
you can bond, not yet add another one.
-Is the kind of way we are using the live-migration not recommended?
-Which engine module checks the availability of the storage domain for the ovirt-nodes?
the engine.
-Is there any timeout/cache option we can set/increase to avoid this problem?
well, not clear what the problem is. also, vdsm is supposed to throttle live migration to 3 vm's in parallel iirc. also, you can at cluster level configure to not live migrate VMs on non-operational status.
-Is there any known problem with the versions we are using? (Migration to ovirt-engine 3.1 is not possible atm)
oh, the cluster level migration policy on non operational may be a 3.1 feature, not sure.
AFAIR, it's in 3.0
-Is it possible to modify the migration queue to just migrate a max. of 4 VMs at the same time for example?
yes, there is a vdsm config for that. i am pretty sure 3 is the default though?
_ovirt-engine: _
FC 16: 3.3.6-3.fc16.x86_64
Engine: 3.0.0_0001-1.6.fc16
KVM based VM: 2 vCPU, 4 GB RAM
1 NIC for ssh/https access 1 NIC for ovirtmgmt network access engine source: dreyou repo
_ovirt-node:_ Node: 2.3.0 2 bonded NICs -> Frontend Network 4 Multipath NICs -> SAN connection
Attached some relevant logfiles.
Thanks in advance, I really appreciate your help!
Best,
Sven Knohsalla |System Administration
Office +49 631 68036 433 | Fax +49 631 68036 111 |E-Mails.knohsalla@netbiscuits.com |<mailto:s.knohsalla@netbiscuits.com>| Skype: Netbiscuits.admin
Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users