On 10/19/2012 06:43 PM, Sven Knohsalla wrote:
Hi Haim,
I wanted to wait to send this mail, until the problem occurs again.
Disabled live-migration for the cluster first, to make sure the second node wouldn't
have the same problem, when migration is started.
It seems the problem isn't caused by migration, as I did run in the same error again
today.
Log snippet Webgui:
2012-Oct-19,04:28:13 "Host deovn-a01 cannot access one of the Storage Domains
attached to it, or the Data Center object. Setting Host state to Non-Operational."
--> all VMs are running properly, although the engine tells something different.
Even the VM status in engine gui is wrong, as it's showing
"<vmname> reboot in progress", but there is no reboot initiated (ssh/rdp
connections, file operations are working fine)
Engine log says for this period:
cat /var/log/ovirt-engine/engine.log | grep 04:2*
2012-10-19 04:23:13,773 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94)
domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
2012-10-19 04:28:13,775 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1)
starting ProcessDomainRecovery for domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
2012-10-19 04:28:13,799 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1)
vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in
problem, moving the vds to status NonOperational
2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true.
Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
2012-10-19 04:28:13,884 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
(QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId =
66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational,
nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
2012-10-19 04:28:13,888 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
(QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd
2012-10-19 04:28:19,690 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38)
domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
I think the first output is important:
2012-10-19 04:23:13,773 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94)
domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
--> which problem? There's no debug info during that time period to consider where
tha problem could come from :/
look to the lines above:
2012-10-19 04:28:13,799 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(QuartzScheduler_Worker-1) vds deovn-a01 reported domain
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving
the vds to status NonOperational
2012-10-19 04:28:13,882 INFO
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand
internal: true. Entities affected : ID:
66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
the problem was with the storage domain.
On affected node side I did grep /var/log/vdsm for ERROR:
Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats)
vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats
And 20 more of the same type with same vmId, I'm sure this is an aftereffect s the
engine can't tell the status of the VMs.
Can you give me an advice where I can find more information to solve this issue?
Or perhaps a scenario I can try?
I have another curiosity I wanted to ask for in a new mail, but perhaps this has
something to do with my issue:
The elected SPM is not part of this cluster, just has 2 storage paths (multipath) to the
SAN.
The problematic cluster has 4 storage paths(bigger hypervisors), and all storage paths
are connected successfully .
Does the SPM detects this difference, or is it unnecessary as the executive command
detected possible paths on its own (what I assume)?
Currently in use:
oVirt-engine 3.0
oVirt-node2.30
--> is there any problem mixing node versions, regarding the ovirt-engine version?
Sorry for the amount of questions, I really want to understand the
"oVirt-mechanism" completely,
to build up a fail-safe virtual environment :)
Thanks in advance.
Best,
Sven.
-----Ursprüngliche Nachricht-----
Von: Haim Ateya [mailto:hateya@redhat.com]
Gesendet: Dienstag, 16. Oktober 2012 14:38
An: Sven Knohsalla
Cc: users(a)ovirt.org; Itamar Heim; Omer Frenkel
Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non
operational" STORAGE_DOMAIN_UNREACHABLE
Hi Sven,
can you attach full logs from the second host (problematic one)? i guess its
"deovn-a01".
2012-10-15 11:13:38,197 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-33)
domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
----- Original Message -----
> From: "Omer Frenkel" <ofrenkel(a)redhat.com>
> To: "Itamar Heim" <iheim(a)redhat.com>, "Sven Knohsalla"
<s.knohsalla(a)netbiscuits.com>
> Cc: users(a)ovirt.org
> Sent: Tuesday, October 16, 2012 2:02:50 PM
> Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non
operational" STORAGE_DOMAIN_UNREACHABLE
>
>
>
> ----- Original Message -----
>> From: "Itamar Heim" <iheim(a)redhat.com>
>> To: "Sven Knohsalla" <s.knohsalla(a)netbiscuits.com>
>> Cc: users(a)ovirt.org
>> Sent: Monday, October 15, 2012 8:36:07 PM
>> Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
>> "non operational" STORAGE_DOMAIN_UNREACHABLE
>>
>> On 10/15/2012 03:56 PM, Sven Knohsalla wrote:
>>> Hi,
>>>
>>> sometimes one hypervisors status turns to „Non-operational“ with
>>> error
>>> “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated
>>> for
>>> all
>>> VMs) is starting.
>>>
>>> I don’t currently know why the ovirt-node turns to this status,
>>> because
>>> the connected iSCSI SAN is available all the time(checked via
>>> iscsi
>>> session and lsblk), I’m also able to r/w on the SAN during that
>>> time.
>>>
>>> We can simply activate this ovirt-node and it turns up again. The
>>> migration process is running from scratch and hitting the some
>>> error
>>> àReboot of ovirt-node necessary!
>>>
>>> When a hypervisor turns to “non-operational” status, the live
>>> migration
>>> is starting and tries to migrate ~25 VMs (~ 100 GB RAM to
>>> migrate).
>>>
>>> During that process the network workload goes 100%, some VMs will
>>> be
>>> migrated, then the destination host also turns to
>>> “non-operational”
>>> status with error “STORAGE_DOMAIN_UNREACHABLE”.
>>>
>>> Many VMs are still running on their origin host, some are
>>> paused,
>>> some
>>> are showing “migration from” status.
>>>
>>> After a reboot of the origin host, the VMs turns of course into
>>> unknown
>>> state.
>>>
>>> So the whole cluster is down :/
>>>
>>> For this problem I have some questions:
>>>
>>> -Does ovirt engine just use the ovirt-mgmt network for
>>> migration/HA?
>>
>> yes.
>>
>>>
>>> -If so, is there any possibility to *add*/switch a network for
>>> migration/HA?
>>
>> you can bond, not yet add another one.
>>
>>>
>>> -Is the kind of way we are using the live-migration not
>>> recommended?
>>>
>>> -Which engine module checks the availability of the storage
>>> domain
>>> for
>>> the ovirt-nodes?
>>
>> the engine.
>>
>>>
>>> -Is there any timeout/cache option we can set/increase to avoid
>>> this
>>> problem?
>>
>> well, not clear what the problem is.
>> also, vdsm is supposed to throttle live migration to 3 vm's in
>> parallel
>> iirc.
>> also, you can at cluster level configure to not live migrate VMs on
>> non-operational status.
>>
>>>
>>> -Is there any known problem with the versions we are using?
>>> (Migration
>>> to ovirt-engine 3.1 is not possible atm)
>>
>> oh, the cluster level migration policy on non operational may be a
>> 3.1
>> feature, not sure.
>>
>
> AFAIR, it's in 3.0
>
>>>
>>> -Is it possible to modify the migration queue to just migrate a
>>> max. of
>>> 4 VMs at the same time for example?
>>
>> yes, there is a vdsm config for that. i am pretty sure 3 is the
>> default
>> though?
>>
>>>
>>> _ovirt-engine: _
>>>
>>> FC 16: 3.3.6-3.fc16.x86_64
>>>
>>> Engine: 3.0.0_0001-1.6.fc16
>>>
>>> KVM based VM: 2 vCPU, 4 GB RAM
>>>
>>> 1 NIC for ssh/https access
>>> 1 NIC for ovirtmgmt network access
>>> engine source: dreyou repo
>>>
>>> _ovirt-node:_
>>> Node: 2.3.0
>>> 2 bonded NICs -> Frontend Network
>>> 4 Multipath NICs -> SAN connection
>>>
>>> Attached some relevant logfiles.
>>>
>>> Thanks in advance, I really appreciate your help!
>>>
>>> Best,
>>>
>>> Sven Knohsalla |System Administration
>>>
>>> Office +49 631 68036 433 | Fax +49 631 68036 111
>>> |E-Mails.knohsalla(a)netbiscuits.com
>>> |<mailto:s.knohsalla@netbiscuits.com>|
>>> Skype: Netbiscuits.admin
>>>
>>> Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users
>>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
>