[ovirt-users] hosted engine health check issues

Martin Sivak msivak at redhat.com
Wed Apr 23 15:27:09 UTC 2014


Hi Kevin,

> same pb.

Are you missing the lockspace file as well while running on top of GlusterFS?

> ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.

Defunct process eating full four cores? I wonder how is that possible.. What are the status flags of that process when you do ps axwu?

Can you attach the log files please?

--
Martin Sivák
msivak at redhat.com
Red Hat Czech
RHEV-M SLA / Brno, CZ

----- Original Message -----
> same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
> 
> 
> 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak at redhat.com>:
> 
> > Hi,
> >
> > > Isn't this file created when hosted engine is started?
> >
> > The file is created by the setup script. If it got lost then there was
> > probably something bad happening in your NFS or Gluster storage.
> >
> > > Or how can I create this file manually?
> >
> > I can give you experimental treatment for this. We do not have any
> > official way as this is something that should not ever happen :)
> >
> > !! But before you do that make sure you do not have any nodes running
> > properly. This will destroy and reinitialize the lockspace database for the
> > whole hosted-engine environment (which you apparently lack, but..). !!
> >
> > You have to create the ha_agent/hosted-engine.lockspace file with the
> > expected size (1MB) and then tell sanlock to initialize it as a lockspace
> > using:
> >
> > # python
> > >>> import sanlock
> > >>> sanlock.write_lockspace(lockspace="hosted-engine",
> > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage
> > domain>/ha_agent/hosted-engine.lockspace",
> > ... offset=0)
> > >>>
> >
> > Then try starting the services (both broker and agent) again.
> >
> > --
> > Martin Sivák
> > msivak at redhat.com
> > Red Hat Czech
> > RHEV-M SLA / Brno, CZ
> >
> >
> > ----- Original Message -----
> > > On 04/23/2014 11:08 AM, Martin Sivak wrote:
> > > > Hi René,
> > > >
> > > >>>> libvirtError: Failed to acquire lock: No space left on device
> > > >
> > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid
> > > >>>> lockspace found -1 failed 0 name
> > 2851af27-8744-445d-9fb1-a0d083c8dc82
> > > >
> > > > Can you please check the contents of /rhev/data-center/<your nfs
> > > > mount>/<nfs domain uuid>/ha_agent/?
> > > >
> > > > This is how it should look like:
> > > >
> > > > [root at dev-03 ~]# ls -al
> > > >
> > /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
> > > > total 2036
> > > > drwxr-x---. 2 vdsm kvm    4096 Mar 19 18:46 .
> > > > drwxr-xr-x. 6 vdsm kvm    4096 Mar 19 18:46 ..
> > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace
> > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
> > > >
> > > > The errors seem to indicate that you somehow lost the lockspace file.
> > >
> > > True :)
> > > Isn't this file created when hosted engine is started? Or how can I
> > > create this file manually?
> > >
> > > >
> > > > --
> > > > Martin Sivák
> > > > msivak at redhat.com
> > > > Red Hat Czech
> > > > RHEV-M SLA / Brno, CZ
> > > >
> > > > ----- Original Message -----
> > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote:
> > > >>> Hi Rene,
> > > >>> any idea what closed your ovirtmgmt bridge?
> > > >>> as long as it is down vdsm may have issues starting up properly
> > > >>> and this is why you see the complaints on the rpc server.
> > > >>>
> > > >>> Can you try manually fixing the network part first and then
> > > >>> restart vdsm?
> > > >>> Once vdsm is happy hosted engine VM will start.
> > > >>
> > > >> Thanks for your feedback, Doron.
> > > >>
> > > >> My ovirtmgmt bridge seems to be on or isn't it:
> > > >> # brctl show ovirtmgmt
> > > >> bridge name        bridge id               STP enabled     interfaces
> > > >> ovirtmgmt          8000.0025907587c2       no              eth0.200
> > > >>
> > > >> # ip a s ovirtmgmt
> > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
> > > >> state UNKNOWN
> > > >>       link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
> > > >>       inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt
> > > >>       inet6 fe80::225:90ff:fe75:87c2/64 scope link
> > > >>          valid_lft forever preferred_lft forever
> > > >>
> > > >> # ip a s eth0.200
> > > >> 6: eth0.200 at eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> > > >> noqueue state UP
> > > >>       link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
> > > >>       inet6 fe80::225:90ff:fe75:87c2/64 scope link
> > > >>          valid_lft forever preferred_lft forever
> > > >>
> > > >> I tried the following yesterday:
> > > >> Copy virtual disk from GlusterFS storage to local disk of host and
> > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I could
> > > >> reach my engine over the ovirtmgmt bridge (so bridge must be working).
> > > >>
> > > >> I also started libvirtd with Option -v and I saw the following in
> > > >> libvirtd.log when trying to start ovirt engine:
> > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 :
> > > >> Command result 0, with PID 11491
> > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 :
> > Result
> > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0'
> > is
> > > >> not a chain
> > > >>
> > > >> So it could be that something is broken in my hosted-engine network.
> > Do
> > > >> you have any clue how I can troubleshoot this?
> > > >>
> > > >>
> > > >> Thanks,
> > > >> René
> > > >>
> > > >>
> > > >>>
> > > >>> ----- Original Message -----
> > > >>>> From: "René Koch" <rkoch at linuxland.at>
> > > >>>> To: "Martin Sivak" <msivak at redhat.com>
> > > >>>> Cc: users at ovirt.org
> > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM
> > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> I rebooted one of my ovirt hosts today and the result is now that I
> > > >>>> can't start hosted-engine anymore.
> > > >>>>
> > > >>>> ovirt-ha-agent isn't running because the lockspace file is missing
> > > >>>> (sanlock complains about it).
> > > >>>> So I tried to start hosted-engine with --vm-start and I get the
> > > >>>> following errors:
> > > >>>>
> > > >>>> ==> /var/log/sanlock.log <==
> > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid
> > > >>>> lockspace found -1 failed 0 name
> > 2851af27-8744-445d-9fb1-a0d083c8dc82
> > > >>>>
> > > >>>> ==> /var/log/messages <==
> > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22
> > 12:38:17+0200 654
> > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0
> > name
> > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82
> > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0)
> > entering
> > > >>>> disabled state
> > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous
> > mode
> > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0)
> > entering
> > > >>>> disabled state
> > > >>>>
> > > >>>> ==> /var/log/vdsm/vdsm.log <==
> > > >>>> Thread-21::DEBUG::2014-04-22
> > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown
> > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire
> > > >>>> lock: No space left on device
> > > >>>> Thread-21::DEBUG::2014-04-22
> > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm)
> > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations
> > released
> > > >>>> Thread-21::ERROR::2014-04-22
> > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm)
> > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process
> > failed
> > > >>>> Traceback (most recent call last):
> > > >>>>      File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm
> > > >>>>        self._run()
> > > >>>>      File "/usr/share/vdsm/vm.py", line 3170, in _run
> > > >>>>        self._connection.createXML(domxml, flags),
> > > >>>>      File
> > > >>>>      "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
> > > >>>> line 92, in wrapper
> > > >>>>        ret = f(*args, **kwargs)
> > > >>>>      File "/usr/lib64/python2.6/site-packages/libvirt.py", line
> > 2665, in
> > > >>>> createXML
> > > >>>>        if ret is None:raise libvirtError('virDomainCreateXML()
> > failed',
> > > >>>> conn=self)
> > > >>>> libvirtError: Failed to acquire lock: No space left on device
> > > >>>>
> > > >>>> ==> /var/log/messages <==
> > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR
> > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process
> > > >>>> failed#012Traceback (most recent call last):#012  File
> > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012
> > > >>>> self._run()#012  File "/usr/share/vdsm/vm.py", line 3170, in
> > _run#012
> > > >>>>     self._connection.createXML(domxml, flags),#012  File
> > > >>>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
> > line 92,
> > > >>>> in wrapper#012    ret = f(*args, **kwargs)#012  File
> > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in
> > > >>>> createXML#012    if ret is None:raise
> > libvirtError('virDomainCreateXML()
> > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock: No
> > space
> > > >>>> left on device
> > > >>>>
> > > >>>> ==> /var/log/vdsm/vdsm.log <==
> > > >>>> Thread-21::DEBUG::2014-04-22
> > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus)
> > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down:
> > > >>>> Failed to acquire lock: No space left on device
> > > >>>>
> > > >>>>
> > > >>>> No space left on device is nonsense as there is enough space (I had
> > this
> > > >>>> issue last time as well where I had to patch machine.py, but this
> > file
> > > >>>> is now Python 2.6.6 compatible.
> > > >>>>
> > > >>>> Any idea what prevents hosted-engine from starting?
> > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw.
> > > >>>>
> > > >>>> Btw, I can see in log that json rpc server module is missing - which
> > > >>>> package is required for CentOS 6.5?
> > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the
> > json
> > > >>>> rpc server module. Please make sure it is installed.
> > > >>>>
> > > >>>>
> > > >>>> Thanks,
> > > >>>> René
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote:
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>>>>> How can I disable notifications?
> > > >>>>>
> > > >>>>> The notification is configured in
> > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf
> > > >>>>> section notification.
> > > >>>>> The email is sent when the key state_transition exists and the
> > string
> > > >>>>> OldState-NewState contains the (case insensitive) regexp from the
> > > >>>>> value.
> > > >>>>>
> > > >>>>>>>> Is it intended to send out these messages and detect that ovirt
> > > >>>>>>>> engine
> > > >>>>>>>> is down (which is false anyway), but not to restart the vm?
> > > >>>>>
> > > >>>>> Forget about emails for now and check the
> > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and
> > attach
> > > >>>>> them
> > > >>>>> as well btw).
> > > >>>>>
> > > >>>>>>>> oVirt hosts think that hosted engine is down because it seems
> > that
> > > >>>>>>>> hosts
> > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues
> > (or
> > > >>>>>>>> at
> > > >>>>>>>> least I think so).
> > > >>>>>
> > > >>>>> The hosts think so or can't really write there? The lockspace is
> > > >>>>> managed
> > > >>>>> by
> > > >>>>> sanlock and our HA daemons do not touch it at all. We only ask
> > sanlock
> > > >>>>> to
> > > >>>>> get make sure we have unique server id.
> > > >>>>>
> > > >>>>>>>> Is is possible or planned to make the whole ha feature optional?
> > > >>>>>
> > > >>>>> Well the system won't perform any automatic actions if you put the
> > > >>>>> hosted
> > > >>>>> engine to global maintenance and only start/stop/migrate the VM
> > > >>>>> manually.
> > > >>>>> I would discourage you from stopping agent/broker, because the
> > engine
> > > >>>>> itself has some logic based on the reporting.
> > > >>>>>
> > > >>>>> Regards
> > > >>>>>
> > > >>>>> --
> > > >>>>> Martin Sivák
> > > >>>>> msivak at redhat.com
> > > >>>>> Red Hat Czech
> > > >>>>> RHEV-M SLA / Brno, CZ
> > > >>>>>
> > > >>>>> ----- Original Message -----
> > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote:
> > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote:
> > > >>>>>>>> Hi,
> > > >>>>>>>>
> > > >>>>>>>> I have some issues with hosted engine status.
> > > >>>>>>>>
> > > >>>>>>>> oVirt hosts think that hosted engine is down because it seems
> > that
> > > >>>>>>>> hosts
> > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues
> > (or
> > > >>>>>>>> at
> > > >>>>>>>> least I think so).
> > > >>>>>>>>
> > > >>>>>>>> Here's the output of vm-status:
> > > >>>>>>>>
> > > >>>>>>>> # hosted-engine --vm-status
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --== Host 1 status ==--
> > > >>>>>>>>
> > > >>>>>>>> Status up-to-date                  : False
> > > >>>>>>>> Hostname                           : 10.0.200.102
> > > >>>>>>>> Host ID                            : 1
> > > >>>>>>>> Engine status                      : unknown stale-data
> > > >>>>>>>> Score                              : 2400
> > > >>>>>>>> Local maintenance                  : False
> > > >>>>>>>> Host timestamp                     : 1397035677
> > > >>>>>>>> Extra metadata (valid at timestamp):
> > > >>>>>>>>         metadata_parse_version=1
> > > >>>>>>>>         metadata_feature_version=1
> > > >>>>>>>>         timestamp=1397035677 (Wed Apr  9 11:27:57 2014)
> > > >>>>>>>>         host-id=1
> > > >>>>>>>>         score=2400
> > > >>>>>>>>         maintenance=False
> > > >>>>>>>>         state=EngineUp
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --== Host 2 status ==--
> > > >>>>>>>>
> > > >>>>>>>> Status up-to-date                  : True
> > > >>>>>>>> Hostname                           : 10.0.200.101
> > > >>>>>>>> Host ID                            : 2
> > > >>>>>>>> Engine status                      : {'reason': 'vm not running
> > on
> > > >>>>>>>> this
> > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}
> > > >>>>>>>> Score                              : 0
> > > >>>>>>>> Local maintenance                  : False
> > > >>>>>>>> Host timestamp                     : 1397464031
> > > >>>>>>>> Extra metadata (valid at timestamp):
> > > >>>>>>>>         metadata_parse_version=1
> > > >>>>>>>>         metadata_feature_version=1
> > > >>>>>>>>         timestamp=1397464031 (Mon Apr 14 10:27:11 2014)
> > > >>>>>>>>         host-id=2
> > > >>>>>>>>         score=0
> > > >>>>>>>>         maintenance=False
> > > >>>>>>>>         state=EngineUnexpectedlyDown
> > > >>>>>>>>         timeout=Mon Apr 14 10:35:05 2014
> > > >>>>>>>>
> > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with the
> > > >>>>>>>> following
> > > >>>>>>>> subjects:
> > > >>>>>>>> - ovirt-hosted-engine state transition EngineDown-EngineStart
> > > >>>>>>>> - ovirt-hosted-engine state transition EngineStart-EngineUp
> > > >>>>>>>>
> > > >>>>>>>> In oVirt webadmin I can see the following message:
> > > >>>>>>>> VM HostedEngine is down. Exit message: internal error Failed to
> > > >>>>>>>> acquire
> > > >>>>>>>> lock: error -243.
> > > >>>>>>>>
> > > >>>>>>>> These messages are really annoying as oVirt isn't doing anything
> > > >>>>>>>> with
> > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine vm.
> > > >>>>>>>>
> > > >>>>>>>> So my questions are now:
> > > >>>>>>>> Is it intended to send out these messages and detect that ovirt
> > > >>>>>>>> engine
> > > >>>>>>>> is down (which is false anyway), but not to restart the vm?
> > > >>>>>>>>
> > > >>>>>>>> How can I disable notifications? I'm planning to write a Nagios
> > > >>>>>>>> plugin
> > > >>>>>>>> which parses the output of hosted-engine --vm-status and only
> > Nagios
> > > >>>>>>>> should notify me, not hosted-engine script.
> > > >>>>>>>>
> > > >>>>>>>> Is is possible or planned to make the whole ha feature
> > optional? I
> > > >>>>>>>> really really really hate cluster software as it causes more
> > > >>>>>>>> troubles
> > > >>>>>>>> then standalone machines and in my case the hosted-engine ha
> > feature
> > > >>>>>>>> really causes troubles (and I didn't had a hardware or network
> > > >>>>>>>> outage
> > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't need any
> > ha
> > > >>>>>>>> feature for hosted engine. I just want to run engine
> > virtualized on
> > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with a
> > host)
> > > >>>>>>>> I'll
> > > >>>>>>>> restart it on another node.
> > > >>>>>>>
> > > >>>>>>> Hi, you can:
> > > >>>>>>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and
> > tweak
> > > >>>>>>> the logger as you like
> > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services
> > > >>>>>>
> > > >>>>>> Thanks for the information.
> > > >>>>>> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent
> > isn't
> > > >>>>>> running?
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Regards,
> > > >>>>>> René
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> --Jirka
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> René
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>> _______________________________________________
> > > >>>>>> Users mailing list
> > > >>>>>> Users at ovirt.org
> > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users
> > > >>>>>>
> > > >>>> _______________________________________________
> > > >>>> Users mailing list
> > > >>>> Users at ovirt.org
> > > >>>> http://lists.ovirt.org/mailman/listinfo/users
> > > >>>>
> > > >>
> > >
> > _______________________________________________
> > Users mailing list
> > Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
> >
> 



More information about the Users mailing list