[Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE

Haim Ateya hateya at redhat.com
Sun Oct 21 11:22:56 UTC 2012



----- Original Message -----
> From: "Itamar Heim" <iheim at redhat.com>
> To: "Sven Knohsalla" <s.knohsalla at netbiscuits.com>
> Cc: "Haim Ateya" <hateya at redhat.com>, users at ovirt.org, "Omer Frenkel" <ofrenkel at redhat.com>
> Sent: Sunday, October 21, 2012 11:05:56 AM
> Subject: Re: AW: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE
> 
> On 10/19/2012 06:43 PM, Sven Knohsalla wrote:
> > Hi Haim,
> >
> > I wanted to wait to send this mail, until the problem occurs again.
> > Disabled live-migration for the cluster first, to make sure the
> > second node wouldn't have the same problem, when migration is
> > started.
> >
> > It seems the problem isn't caused by migration, as I did run in the
> > same error again today.
> >
> > Log snippet Webgui:
> > 2012-Oct-19,04:28:13 "Host deovn-a01 cannot access one of the
> > Storage Domains attached to it, or the Data Center object. Setting
> > Host state to Non-Operational."
> >
> > --> all VMs are running properly, although the engine tells
> > something different.
> >         Even the VM status in engine gui is wrong, as it's showing
> >         "<vmname> reboot in progress", but there is no reboot
> >         initiated (ssh/rdp connections, file operations are
> >         working fine)
> >
> > Engine log says for this period:
> > cat /var/log/ovirt-engine/engine.log | grep 04:2*
> > 2012-10-19 04:23:13,773 WARN
> >  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (QuartzScheduler_Worker-94) domain
> > ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
> > 2012-10-19 04:28:13,775 INFO
> >  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for
> > domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
> > 2012-10-19 04:28:13,799 WARN
> >  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (QuartzScheduler_Worker-1) vds deovn-a01 reported domain
> > ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem,
> > moving the vds to status NonOperational
> > 2012-10-19 04:28:13,882 INFO
> >  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
> > (QuartzScheduler_Worker-1) Running command:
> > SetNonOperationalVdsCommand internal: true. Entities affected :
> >  ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
> > 2012-10-19 04:28:13,884 INFO
> >  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
> > (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId =
> > 66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational,
> > nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
> > 2012-10-19 04:28:13,888 INFO
> >  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
> > (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id:
> > daad8bd
> > 2012-10-19 04:28:19,690 WARN
> >  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (QuartzScheduler_Worker-38) domain
> > ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
> >
> > I think the first output is important:
> > 2012-10-19 04:23:13,773 WARN
> >  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (QuartzScheduler_Worker-94) domain
> > ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
> > --> which problem? There's no debug info during that time period to
> > consider where tha problem could come from :/
> 
> look to the lines above:
>   2012-10-19 04:28:13,799 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (QuartzScheduler_Worker-1) vds deovn-a01 reported domain
> ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem,
> moving
> the vds to status NonOperational
>   2012-10-19 04:28:13,882 INFO
> [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
> (QuartzScheduler_Worker-1) Running command:
> SetNonOperationalVdsCommand
> internal: true. Entities affected :  ID:
> 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
> 
> the problem was with the storage domain.
> 
> 
> >
> > On affected node side I did grep /var/log/vdsm for ERROR:
> > Thread-254302::ERROR::2012-10-12
> > 16:01:11,359::vm::950::vm.Vm::(getStats)
> > vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm
> > stats
> > And 20 more of the same type with same vmId, I'm sure this is an
> > aftereffect s the engine can't tell the status of the VMs.
> >
> > Can you give me an advice where I can find more information to
> > solve this issue?
> > Or perhaps a scenario I can try?
what's the status of the VMs right now ? can you please provide the output of the following commands:

virsh -r list
vdsClient -s 0 list table

please attach full engine, vdsm and libvirt logs (and if possible, qemu log file under /var/log/libvirt/qemu/).
> >
> > I have another curiosity I wanted to ask for in a new mail, but
> > perhaps this has something to do with my issue:
> > The elected SPM is not part of this cluster, just has 2 storage
> > paths (multipath) to the SAN.
> > The problematic cluster has 4 storage paths(bigger hypervisors),
> > and all storage paths are connected successfully .
I would like to see repoStats reports within the node logs (vdsm.log).
> >
> > Does the SPM detects this difference, or is it unnecessary as the
> > executive command detected possible paths on its own (what I
> > assume)?
it's not related, cluster is just a logical entity in the sense (some kind of hierarchy on engine side), in our topology, 
within given data-center, all hosts sees all storage domains, they can operate in different clusters, where one of the hosts has the role of SPM.
as for your multipath question, we are using device-mapper-multipath component and it operates according to the configured policy (active:active, round-robin), and yes, it detects the available paths by its own.

> >
> > Currently in use:
> > oVirt-engine 3.0
> > oVirt-node2.30
> > --> is there any problem mixing node versions, regarding the
> > ovirt-engine version?
> >
> > Sorry for the amount of questions, I really want to understand the
> > "oVirt-mechanism" completely,
> > to build up a fail-safe virtual environment :)
> >
> > Thanks in advance.
> >
> > Best,
> > Sven.
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Haim Ateya [mailto:hateya at redhat.com]
> > Gesendet: Dienstag, 16. Oktober 2012 14:38
> > An: Sven Knohsalla
> > Cc: users at ovirt.org; Itamar Heim; Omer Frenkel
> > Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
> > "non operational" STORAGE_DOMAIN_UNREACHABLE
> >
> > Hi Sven,
> >
> > can you attach full logs from the second host (problematic one)? i
> > guess its "deovn-a01".
> >
> > 2012-10-15 11:13:38,197 WARN
> >  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (QuartzScheduler_Worker-33) domain
> > ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
> >
> >
> > ----- Original Message -----
> >> From: "Omer Frenkel" <ofrenkel at redhat.com>
> >> To: "Itamar Heim" <iheim at redhat.com>, "Sven Knohsalla"
> >> <s.knohsalla at netbiscuits.com>
> >> Cc: users at ovirt.org
> >> Sent: Tuesday, October 16, 2012 2:02:50 PM
> >> Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
> >> "non operational" STORAGE_DOMAIN_UNREACHABLE
> >>
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Itamar Heim" <iheim at redhat.com>
> >>> To: "Sven Knohsalla" <s.knohsalla at netbiscuits.com>
> >>> Cc: users at ovirt.org
> >>> Sent: Monday, October 15, 2012 8:36:07 PM
> >>> Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
> >>> "non operational" STORAGE_DOMAIN_UNREACHABLE
> >>>
> >>> On 10/15/2012 03:56 PM, Sven Knohsalla wrote:
> >>>> Hi,
> >>>>
> >>>> sometimes one hypervisors status turns to „Non-operational“ with
> >>>> error
> >>>> “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated
> >>>> for
> >>>> all
> >>>> VMs) is starting.
> >>>>
> >>>> I don’t currently know why the ovirt-node turns to this status,
> >>>> because
> >>>> the connected iSCSI SAN is available all the time(checked via
> >>>> iscsi
> >>>> session and lsblk), I’m also able to r/w on the SAN during that
> >>>> time.
> >>>>
> >>>> We can simply activate this ovirt-node and it turns up again.
> >>>> The
> >>>> migration process is running from scratch and hitting the some
> >>>> error
> >>>> àReboot of ovirt-node necessary!
> >>>>
> >>>> When a hypervisor turns to “non-operational” status, the live
> >>>> migration
> >>>> is starting and tries to migrate ~25 VMs (~ 100 GB RAM to
> >>>> migrate).
> >>>>
> >>>> During that process the network workload goes 100%, some VMs
> >>>> will
> >>>> be
> >>>> migrated, then the destination host also turns to
> >>>> “non-operational”
> >>>> status with error “STORAGE_DOMAIN_UNREACHABLE”.
> >>>>
> >>>> Many VMs are still running on their  origin host, some are
> >>>> paused,
> >>>> some
> >>>> are showing “migration from” status.
> >>>>
> >>>> After a reboot of the origin host, the VMs turns of course into
> >>>> unknown
> >>>> state.
> >>>>
> >>>> So the whole cluster is down :/
> >>>>
> >>>> For this problem I have some questions:
> >>>>
> >>>> -Does ovirt engine just use the ovirt-mgmt network for
> >>>> migration/HA?
> >>>
> >>> yes.
> >>>
> >>>>
> >>>> -If so, is there any possibility to *add*/switch a network for
> >>>> migration/HA?
> >>>
> >>> you can bond, not yet add another one.
> >>>
> >>>>
> >>>> -Is the kind of way we are using the live-migration not
> >>>> recommended?
> >>>>
> >>>> -Which engine module checks the availability of the storage
> >>>> domain
> >>>> for
> >>>> the ovirt-nodes?
> >>>
> >>> the engine.
> >>>
> >>>>
> >>>> -Is there any timeout/cache option we can set/increase to avoid
> >>>> this
> >>>> problem?
> >>>
> >>> well, not clear what the problem is.
> >>> also, vdsm is supposed to throttle live migration to 3 vm's in
> >>> parallel
> >>> iirc.
> >>> also, you can at cluster level configure to not live migrate VMs
> >>> on
> >>> non-operational status.
> >>>
> >>>>
> >>>> -Is there any known problem with the versions we are using?
> >>>> (Migration
> >>>> to ovirt-engine 3.1 is not possible atm)
> >>>
> >>> oh, the cluster level migration policy on non operational may be
> >>> a
> >>> 3.1
> >>> feature, not sure.
> >>>
> >>
> >> AFAIR, it's in 3.0
> >>
> >>>>
> >>>> -Is it possible to modify the migration queue to just migrate a
> >>>> max. of
> >>>> 4 VMs at the same time for example?
> >>>
> >>> yes, there is a vdsm config for that. i am pretty sure 3 is the
> >>> default
> >>> though?
> >>>
> >>>>
> >>>> _ovirt-engine: _
> >>>>
> >>>> FC 16:  3.3.6-3.fc16.x86_64
> >>>>
> >>>> Engine: 3.0.0_0001-1.6.fc16
> >>>>
> >>>> KVM based VM: 2 vCPU, 4 GB RAM
> >>>>
> >>>> 1 NIC for ssh/https access
> >>>> 1 NIC for ovirtmgmt network access
> >>>> engine source: dreyou repo
> >>>>
> >>>> _ovirt-node:_
> >>>> Node: 2.3.0
> >>>> 2 bonded NICs -> Frontend Network
> >>>> 4 Multipath NICs -> SAN connection
> >>>>
> >>>> Attached some relevant logfiles.
> >>>>
> >>>> Thanks in advance, I really appreciate your help!
> >>>>
> >>>> Best,
> >>>>
> >>>> Sven Knohsalla |System Administration
> >>>>
> >>>> Office +49 631 68036 433 | Fax +49 631 68036 111
> >>>> |E-Mails.knohsalla at netbiscuits.com
> >>>> |<mailto:s.knohsalla at netbiscuits.com>|
> >>>> Skype: Netbiscuits.admin
> >>>>
> >>>> Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Users mailing list
> >>>> Users at ovirt.org
> >>>> http://lists.ovirt.org/mailman/listinfo/users
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Users mailing list
> >>> Users at ovirt.org
> >>> http://lists.ovirt.org/mailman/listinfo/users
> >>>
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> >>
> 
> 
> 



More information about the Users mailing list