[Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE

Sun Oct 21 09:05:56 UTC 2012

On 10/19/2012 06:43 PM, Sven Knohsalla wrote:
> Hi Haim,
>
> I wanted to wait to send this mail, until the problem occurs again.
> Disabled live-migration for the cluster first, to make sure the second node wouldn't have the same problem, when migration is started.
>
> It seems the problem isn't caused by migration, as I did run in the same error again today.
>
> Log snippet Webgui:
> 2012-Oct-19,04:28:13 "Host deovn-a01 cannot access one of the Storage Domains attached to it, or the Data Center object. Setting Host state to Non-Operational."
>
> --> all VMs are running properly, although the engine tells something different.
>         Even the VM status in engine gui is wrong, as it's showing "<vmname> reboot in progress", but there is no reboot initiated (ssh/rdp connections, file operations are working fine)
>
> Engine log says for this period:
> cat /var/log/ovirt-engine/engine.log | grep 04:2*
> 2012-10-19 04:23:13,773 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
> 2012-10-19 04:28:13,775 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
> 2012-10-19 04:28:13,799 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational
> 2012-10-19 04:28:13,882 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
> 2012-10-19 04:28:13,884 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
> 2012-10-19 04:28:13,888 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd
> 2012-10-19 04:28:19,690 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
>
> I think the first output is important:
> 2012-10-19 04:23:13,773 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
> --> which problem? There's no debug info during that time period to consider where tha problem could come from :/

look to the lines above:
  2012-10-19 04:28:13,799 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) vds deovn-a01 reported domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving 
the vds to status NonOperational
  2012-10-19 04:28:13,882 INFO 
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] 
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand 
internal: true. Entities affected :  ID: 
66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS

the problem was with the storage domain.

>
> On affected node side I did grep /var/log/vdsm for ERROR:
> Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats
> And 20 more of the same type with same vmId, I'm sure this is an aftereffect s the engine can't tell the status of the VMs.
>
> Can you give me an advice where I can find more information to solve this issue?
> Or perhaps a scenario I can try?
>
> I have another curiosity I wanted to ask for in a new mail, but perhaps this has something to do with my issue:
> The elected SPM is not part of this cluster, just has 2 storage paths (multipath) to the SAN.
> The problematic cluster has 4 storage paths(bigger hypervisors), and all storage paths are connected successfully .
>
> Does the SPM detects this difference, or is it unnecessary as the executive command detected possible paths on its own (what I assume)?
>
> Currently in use:
> oVirt-engine 3.0
> oVirt-node2.30
> --> is there any problem mixing node versions, regarding the ovirt-engine version?
>
> Sorry for the amount of questions, I really want to understand the "oVirt-mechanism" completely,
> to build up a fail-safe virtual environment :)
>
> Thanks in advance.
>
> Best,
> Sven.
>
> -----Ursprüngliche Nachricht-----
> Von: Haim Ateya [mailto:hateya at redhat.com]
> Gesendet: Dienstag, 16. Oktober 2012 14:38
> An: Sven Knohsalla
> Cc: users at ovirt.org; Itamar Heim; Omer Frenkel
> Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE
>
> Hi Sven,
>
> can you attach full logs from the second host (problematic one)? i guess its "deovn-a01".
>
> 2012-10-15 11:13:38,197 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
>
>
> ----- Original Message -----
>> From: "Omer Frenkel" <ofrenkel at redhat.com>
>> To: "Itamar Heim" <iheim at redhat.com>, "Sven Knohsalla" <s.knohsalla at netbiscuits.com>
>> Cc: users at ovirt.org
>> Sent: Tuesday, October 16, 2012 2:02:50 PM
>> Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE
>>
>>
>>
>> ----- Original Message -----
>>> From: "Itamar Heim" <iheim at redhat.com>
>>> To: "Sven Knohsalla" <s.knohsalla at netbiscuits.com>
>>> Cc: users at ovirt.org
>>> Sent: Monday, October 15, 2012 8:36:07 PM
>>> Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
>>> "non operational" STORAGE_DOMAIN_UNREACHABLE
>>>
>>> On 10/15/2012 03:56 PM, Sven Knohsalla wrote:
>>>> Hi,
>>>>
>>>> sometimes one hypervisors status turns to „Non-operational“ with
>>>> error
>>>> “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated
>>>> for
>>>> all
>>>> VMs) is starting.
>>>>
>>>> I don’t currently know why the ovirt-node turns to this status,
>>>> because
>>>> the connected iSCSI SAN is available all the time(checked via
>>>> iscsi
>>>> session and lsblk), I’m also able to r/w on the SAN during that
>>>> time.
>>>>
>>>> We can simply activate this ovirt-node and it turns up again. The
>>>> migration process is running from scratch and hitting the some
>>>> error
>>>> àReboot of ovirt-node necessary!
>>>>
>>>> When a hypervisor turns to “non-operational” status, the live
>>>> migration
>>>> is starting and tries to migrate ~25 VMs (~ 100 GB RAM to
>>>> migrate).
>>>>
>>>> During that process the network workload goes 100%, some VMs will
>>>> be
>>>> migrated, then the destination host also turns to
>>>> “non-operational”
>>>> status with error “STORAGE_DOMAIN_UNREACHABLE”.
>>>>
>>>> Many VMs are still running on their  origin host, some are
>>>> paused,
>>>> some
>>>> are showing “migration from” status.
>>>>
>>>> After a reboot of the origin host, the VMs turns of course into
>>>> unknown
>>>> state.
>>>>
>>>> So the whole cluster is down :/
>>>>
>>>> For this problem I have some questions:
>>>>
>>>> -Does ovirt engine just use the ovirt-mgmt network for
>>>> migration/HA?
>>>
>>> yes.
>>>
>>>>
>>>> -If so, is there any possibility to *add*/switch a network for
>>>> migration/HA?
>>>
>>> you can bond, not yet add another one.
>>>
>>>>
>>>> -Is the kind of way we are using the live-migration not
>>>> recommended?
>>>>
>>>> -Which engine module checks the availability of the storage
>>>> domain
>>>> for
>>>> the ovirt-nodes?
>>>
>>> the engine.
>>>
>>>>
>>>> -Is there any timeout/cache option we can set/increase to avoid
>>>> this
>>>> problem?
>>>
>>> well, not clear what the problem is.
>>> also, vdsm is supposed to throttle live migration to 3 vm's in
>>> parallel
>>> iirc.
>>> also, you can at cluster level configure to not live migrate VMs on
>>> non-operational status.
>>>
>>>>
>>>> -Is there any known problem with the versions we are using?
>>>> (Migration
>>>> to ovirt-engine 3.1 is not possible atm)
>>>
>>> oh, the cluster level migration policy on non operational may be a
>>> 3.1
>>> feature, not sure.
>>>
>>
>> AFAIR, it's in 3.0
>>
>>>>
>>>> -Is it possible to modify the migration queue to just migrate a
>>>> max. of
>>>> 4 VMs at the same time for example?
>>>
>>> yes, there is a vdsm config for that. i am pretty sure 3 is the
>>> default
>>> though?
>>>
>>>>
>>>> _ovirt-engine: _
>>>>
>>>> FC 16:  3.3.6-3.fc16.x86_64
>>>>
>>>> Engine: 3.0.0_0001-1.6.fc16
>>>>
>>>> KVM based VM: 2 vCPU, 4 GB RAM
>>>>
>>>> 1 NIC for ssh/https access
>>>> 1 NIC for ovirtmgmt network access
>>>> engine source: dreyou repo
>>>>
>>>> _ovirt-node:_
>>>> Node: 2.3.0
>>>> 2 bonded NICs -> Frontend Network
>>>> 4 Multipath NICs -> SAN connection
>>>>
>>>> Attached some relevant logfiles.
>>>>
>>>> Thanks in advance, I really appreciate your help!
>>>>
>>>> Best,
>>>>
>>>> Sven Knohsalla |System Administration
>>>>
>>>> Office +49 631 68036 433 | Fax +49 631 68036 111
>>>> |E-Mails.knohsalla at netbiscuits.com
>>>> |<mailto:s.knohsalla at netbiscuits.com>|
>>>> Skype: Netbiscuits.admin
>>>>
>>>> Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>