[ovirt-users] hosted engine health check issues
Kevin Tibi
kevintibi at hotmail.com
Thu Apr 24 10:57:19 UTC 2014
Ok i mount manualy the domain for hosted engine and agent go up.
But vm-status :
--== Host 2 status ==--
Status up-to-date : False
Hostname : 192.168.99.103
Host ID : 2
Engine status : unknown stale-data
Score : 0
Local maintenance : False
Host timestamp : 1398333438
And in my engine, host02 Ha is no active.
2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi at hotmail.com>:
> Hi,
>
> I try to reboot my hosts and now [supervdsmServer] is <defunct>.
>
> /var/log/vdsm/supervdsm.log
>
>
> MainProcess|Thread-120::DEBUG::2014-04-24
> 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
> return validateAccess with None
> MainProcess|Thread-120::DEBUG::2014-04-24
> 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call
> validateAccess with ('qemu', ('qemu', 'kvm'),
> '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {}
> MainProcess|Thread-120::DEBUG::2014-04-24
> 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
> return validateAccess with None
> MainProcess|Thread-120::DEBUG::2014-04-24
> 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call
> validateAccess with ('qemu', ('qemu', 'kvm'),
> '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {}
> MainProcess|Thread-120::DEBUG::2014-04-24
> 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
> return validateAccess with None
>
> and one host don't mount the NFS used for hosted engine.
>
> MainThread::CRITICAL::2014-04-24
> 12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> Could not start ha-agent
> Traceback (most recent call last):
> File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 97, in run
> self._run_agent()
> File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 154, in _run_agent
> hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
> File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 299, in start_monitoring
> self._initialize_vdsm()
> File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 418, in _initialize_vdsm
> self._sd_path = env_path.get_domain_path(self._config)
> File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line
> 40, in get_domain_path
> .format(sd_uuid, parent))
> Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not
> found in /rhev/data-center/mnt
>
>
>
> 2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi at hotmail.com>:
>
> top
>> 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51
>> ovirt-ha-broker <defunct>
>>
>>
>> [root at host01 ~]# ps axwu | grep 1729
>> vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24
>> [ovirt-ha-broker] <defunct>
>>
>> [root at host01 ~]# ll
>> /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/
>> total 2028
>> -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace
>> -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata
>>
>> cat /var/log/vdsm/vdsm.log
>>
>> Thread-120518::DEBUG::2014-04-23
>> 17:38:02,299::task::1185::TaskManager.Task::(prepare)
>> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished:
>> {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3,
>> 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid':
>> True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3,
>> 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid':
>> True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0,
>> 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid':
>> True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0,
>> 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': True}}
>> Thread-120518::DEBUG::2014-04-23
>> 17:38:02,300::task::595::TaskManager.Task::(_updateState)
>> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing ->
>> state finished
>> Thread-120518::DEBUG::2014-04-23
>> 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll)
>> Owner.releaseAll requests {} resources {}
>> Thread-120518::DEBUG::2014-04-23
>> 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll)
>> Owner.cancelAll requests {}
>> Thread-120518::DEBUG::2014-04-23
>> 17:38:02,300::task::990::TaskManager.Task::(_decref)
>> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False
>> Thread-120518::ERROR::2014-04-23
>> 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect)
>> Failed to connect to broker: [Errno 2] No such file or directory
>> Thread-120518::ERROR::2014-04-23
>> 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted Engine
>> HA info
>> Traceback (most recent call last):
>> File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo
>> stats = instance.get_all_stats()
>> File
>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>> line 83, in get_all_stats
>> with broker.connection():
>> File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
>> return self.gen.next()
>> File
>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>> line 96, in connection
>> self.connect()
>> File
>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>> line 64, in connect
>> self._socket.connect(constants.BROKER_SOCKET_FILE)
>> File "<string>", line 1, in connect
>> error: [Errno 2] No such file or directory
>> Thread-78::DEBUG::2014-04-23
>> 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd
>> iflag=direct
>> if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata
>> bs=4096 count=1' (cwd None)
>> Thread-78::DEBUG::2014-04-23
>> 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS:
>> <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied,
>> 0.000412209 s, 1.3 MB/s\n'; <rc> = 0
>>
>>
>>
>>
>> 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak at redhat.com>:
>>
>> Hi Kevin,
>>>
>>> > same pb.
>>>
>>> Are you missing the lockspace file as well while running on top of
>>> GlusterFS?
>>>
>>> > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
>>>
>>> Defunct process eating full four cores? I wonder how is that possible..
>>> What are the status flags of that process when you do ps axwu?
>>>
>>> Can you attach the log files please?
>>>
>>> --
>>> Martin Sivák
>>> msivak at redhat.com
>>> Red Hat Czech
>>> RHEV-M SLA / Brno, CZ
>>>
>>> ----- Original Message -----
>>> > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill
>>> with -9.
>>> >
>>> >
>>> > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak at redhat.com>:
>>> >
>>> > > Hi,
>>> > >
>>> > > > Isn't this file created when hosted engine is started?
>>> > >
>>> > > The file is created by the setup script. If it got lost then there
>>> was
>>> > > probably something bad happening in your NFS or Gluster storage.
>>> > >
>>> > > > Or how can I create this file manually?
>>> > >
>>> > > I can give you experimental treatment for this. We do not have any
>>> > > official way as this is something that should not ever happen :)
>>> > >
>>> > > !! But before you do that make sure you do not have any nodes running
>>> > > properly. This will destroy and reinitialize the lockspace database
>>> for the
>>> > > whole hosted-engine environment (which you apparently lack, but..).
>>> !!
>>> > >
>>> > > You have to create the ha_agent/hosted-engine.lockspace file with the
>>> > > expected size (1MB) and then tell sanlock to initialize it as a
>>> lockspace
>>> > > using:
>>> > >
>>> > > # python
>>> > > >>> import sanlock
>>> > > >>> sanlock.write_lockspace(lockspace="hosted-engine",
>>> > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage
>>> > > domain>/ha_agent/hosted-engine.lockspace",
>>> > > ... offset=0)
>>> > > >>>
>>> > >
>>> > > Then try starting the services (both broker and agent) again.
>>> > >
>>> > > --
>>> > > Martin Sivák
>>> > > msivak at redhat.com
>>> > > Red Hat Czech
>>> > > RHEV-M SLA / Brno, CZ
>>> > >
>>> > >
>>> > > ----- Original Message -----
>>> > > > On 04/23/2014 11:08 AM, Martin Sivak wrote:
>>> > > > > Hi René,
>>> > > > >
>>> > > > >>>> libvirtError: Failed to acquire lock: No space left on device
>>> > > > >
>>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733
>>> invalid
>>> > > > >>>> lockspace found -1 failed 0 name
>>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82
>>> > > > >
>>> > > > > Can you please check the contents of /rhev/data-center/<your nfs
>>> > > > > mount>/<nfs domain uuid>/ha_agent/?
>>> > > > >
>>> > > > > This is how it should look like:
>>> > > > >
>>> > > > > [root at dev-03 ~]# ls -al
>>> > > > >
>>> > >
>>> /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
>>> > > > > total 2036
>>> > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 .
>>> > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 ..
>>> > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05
>>> hosted-engine.lockspace
>>> > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46
>>> hosted-engine.metadata
>>> > > > >
>>> > > > > The errors seem to indicate that you somehow lost the lockspace
>>> file.
>>> > > >
>>> > > > True :)
>>> > > > Isn't this file created when hosted engine is started? Or how can I
>>> > > > create this file manually?
>>> > > >
>>> > > > >
>>> > > > > --
>>> > > > > Martin Sivák
>>> > > > > msivak at redhat.com
>>> > > > > Red Hat Czech
>>> > > > > RHEV-M SLA / Brno, CZ
>>> > > > >
>>> > > > > ----- Original Message -----
>>> > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote:
>>> > > > >>> Hi Rene,
>>> > > > >>> any idea what closed your ovirtmgmt bridge?
>>> > > > >>> as long as it is down vdsm may have issues starting up properly
>>> > > > >>> and this is why you see the complaints on the rpc server.
>>> > > > >>>
>>> > > > >>> Can you try manually fixing the network part first and then
>>> > > > >>> restart vdsm?
>>> > > > >>> Once vdsm is happy hosted engine VM will start.
>>> > > > >>
>>> > > > >> Thanks for your feedback, Doron.
>>> > > > >>
>>> > > > >> My ovirtmgmt bridge seems to be on or isn't it:
>>> > > > >> # brctl show ovirtmgmt
>>> > > > >> bridge name bridge id STP enabled
>>> interfaces
>>> > > > >> ovirtmgmt 8000.0025907587c2 no
>>> eth0.200
>>> > > > >>
>>> > > > >> # ip a s ovirtmgmt
>>> > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>>> noqueue
>>> > > > >> state UNKNOWN
>>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
>>> > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global
>>> ovirtmgmt
>>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link
>>> > > > >> valid_lft forever preferred_lft forever
>>> > > > >>
>>> > > > >> # ip a s eth0.200
>>> > > > >> 6: eth0.200 at eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
>>> qdisc
>>> > > > >> noqueue state UP
>>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
>>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link
>>> > > > >> valid_lft forever preferred_lft forever
>>> > > > >>
>>> > > > >> I tried the following yesterday:
>>> > > > >> Copy virtual disk from GlusterFS storage to local disk of host
>>> and
>>> > > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I
>>> could
>>> > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be
>>> working).
>>> > > > >>
>>> > > > >> I also started libvirtd with Option -v and I saw the following
>>> in
>>> > > > >> libvirtd.log when trying to start ovirt engine:
>>> > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug :
>>> virCommandRunAsync:2250 :
>>> > > > >> Command result 0, with PID 11491
>>> > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 :
>>> > > Result
>>> > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto
>>> 'FO-vnet0'
>>> > > is
>>> > > > >> not a chain
>>> > > > >>
>>> > > > >> So it could be that something is broken in my hosted-engine
>>> network.
>>> > > Do
>>> > > > >> you have any clue how I can troubleshoot this?
>>> > > > >>
>>> > > > >>
>>> > > > >> Thanks,
>>> > > > >> René
>>> > > > >>
>>> > > > >>
>>> > > > >>>
>>> > > > >>> ----- Original Message -----
>>> > > > >>>> From: "René Koch" <rkoch at linuxland.at>
>>> > > > >>>> To: "Martin Sivak" <msivak at redhat.com>
>>> > > > >>>> Cc: users at ovirt.org
>>> > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM
>>> > > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues
>>> > > > >>>>
>>> > > > >>>> Hi,
>>> > > > >>>>
>>> > > > >>>> I rebooted one of my ovirt hosts today and the result is now
>>> that I
>>> > > > >>>> can't start hosted-engine anymore.
>>> > > > >>>>
>>> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is
>>> missing
>>> > > > >>>> (sanlock complains about it).
>>> > > > >>>> So I tried to start hosted-engine with --vm-start and I get
>>> the
>>> > > > >>>> following errors:
>>> > > > >>>>
>>> > > > >>>> ==> /var/log/sanlock.log <==
>>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733
>>> invalid
>>> > > > >>>> lockspace found -1 failed 0 name
>>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82
>>> > > > >>>>
>>> > > > >>>> ==> /var/log/messages <==
>>> > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22
>>> > > 12:38:17+0200 654
>>> > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1
>>> failed 0
>>> > > name
>>> > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82
>>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0)
>>> > > entering
>>> > > > >>>> disabled state
>>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left
>>> promiscuous
>>> > > mode
>>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0)
>>> > > entering
>>> > > > >>>> disabled state
>>> > > > >>>>
>>> > > > >>>> ==> /var/log/vdsm/vdsm.log <==
>>> > > > >>>> Thread-21::DEBUG::2014-04-22
>>> > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown
>>> > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to
>>> acquire
>>> > > > >>>> lock: No space left on device
>>> > > > >>>> Thread-21::DEBUG::2014-04-22
>>> > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm)
>>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations
>>> > > released
>>> > > > >>>> Thread-21::ERROR::2014-04-22
>>> > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm)
>>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start
>>> process
>>> > > failed
>>> > > > >>>> Traceback (most recent call last):
>>> > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in
>>> _startUnderlyingVm
>>> > > > >>>> self._run()
>>> > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run
>>> > > > >>>> self._connection.createXML(domxml, flags),
>>> > > > >>>> File
>>> > > > >>>>
>>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
>>> > > > >>>> line 92, in wrapper
>>> > > > >>>> ret = f(*args, **kwargs)
>>> > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py",
>>> line
>>> > > 2665, in
>>> > > > >>>> createXML
>>> > > > >>>> if ret is None:raise libvirtError('virDomainCreateXML()
>>> > > failed',
>>> > > > >>>> conn=self)
>>> > > > >>>> libvirtError: Failed to acquire lock: No space left on device
>>> > > > >>>>
>>> > > > >>>> ==> /var/log/messages <==
>>> > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR
>>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start
>>> process
>>> > > > >>>> failed#012Traceback (most recent call last):#012 File
>>> > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012
>>> > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in
>>> > > _run#012
>>> > > > >>>> self._connection.createXML(domxml, flags),#012 File
>>> > > > >>>>
>>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
>>> > > line 92,
>>> > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File
>>> > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in
>>> > > > >>>> createXML#012 if ret is None:raise
>>> > > libvirtError('virDomainCreateXML()
>>> > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock:
>>> No
>>> > > space
>>> > > > >>>> left on device
>>> > > > >>>>
>>> > > > >>>> ==> /var/log/vdsm/vdsm.log <==
>>> > > > >>>> Thread-21::DEBUG::2014-04-22
>>> > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus)
>>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to
>>> Down:
>>> > > > >>>> Failed to acquire lock: No space left on device
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>>> No space left on device is nonsense as there is enough space
>>> (I had
>>> > > this
>>> > > > >>>> issue last time as well where I had to patch machine.py, but
>>> this
>>> > > file
>>> > > > >>>> is now Python 2.6.6 compatible.
>>> > > > >>>>
>>> > > > >>>> Any idea what prevents hosted-engine from starting?
>>> > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw.
>>> > > > >>>>
>>> > > > >>>> Btw, I can see in log that json rpc server module is missing
>>> - which
>>> > > > >>>> package is required for CentOS 6.5?
>>> > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load
>>> the
>>> > > json
>>> > > > >>>> rpc server module. Please make sure it is installed.
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>>> Thanks,
>>> > > > >>>> René
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote:
>>> > > > >>>>> Hi,
>>> > > > >>>>>
>>> > > > >>>>>>>> How can I disable notifications?
>>> > > > >>>>>
>>> > > > >>>>> The notification is configured in
>>> > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf
>>> > > > >>>>> section notification.
>>> > > > >>>>> The email is sent when the key state_transition exists and
>>> the
>>> > > string
>>> > > > >>>>> OldState-NewState contains the (case insensitive) regexp
>>> from the
>>> > > > >>>>> value.
>>> > > > >>>>>
>>> > > > >>>>>>>> Is it intended to send out these messages and detect that
>>> ovirt
>>> > > > >>>>>>>> engine
>>> > > > >>>>>>>> is down (which is false anyway), but not to restart the
>>> vm?
>>> > > > >>>>>
>>> > > > >>>>> Forget about emails for now and check the
>>> > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and
>>> > > attach
>>> > > > >>>>> them
>>> > > > >>>>> as well btw).
>>> > > > >>>>>
>>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it
>>> seems
>>> > > that
>>> > > > >>>>>>>> hosts
>>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs
>>> issues
>>> > > (or
>>> > > > >>>>>>>> at
>>> > > > >>>>>>>> least I think so).
>>> > > > >>>>>
>>> > > > >>>>> The hosts think so or can't really write there? The
>>> lockspace is
>>> > > > >>>>> managed
>>> > > > >>>>> by
>>> > > > >>>>> sanlock and our HA daemons do not touch it at all. We only
>>> ask
>>> > > sanlock
>>> > > > >>>>> to
>>> > > > >>>>> get make sure we have unique server id.
>>> > > > >>>>>
>>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature
>>> optional?
>>> > > > >>>>>
>>> > > > >>>>> Well the system won't perform any automatic actions if you
>>> put the
>>> > > > >>>>> hosted
>>> > > > >>>>> engine to global maintenance and only start/stop/migrate the
>>> VM
>>> > > > >>>>> manually.
>>> > > > >>>>> I would discourage you from stopping agent/broker, because
>>> the
>>> > > engine
>>> > > > >>>>> itself has some logic based on the reporting.
>>> > > > >>>>>
>>> > > > >>>>> Regards
>>> > > > >>>>>
>>> > > > >>>>> --
>>> > > > >>>>> Martin Sivák
>>> > > > >>>>> msivak at redhat.com
>>> > > > >>>>> Red Hat Czech
>>> > > > >>>>> RHEV-M SLA / Brno, CZ
>>> > > > >>>>>
>>> > > > >>>>> ----- Original Message -----
>>> > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote:
>>> > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote:
>>> > > > >>>>>>>> Hi,
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> I have some issues with hosted engine status.
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it
>>> seems
>>> > > that
>>> > > > >>>>>>>> hosts
>>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs
>>> issues
>>> > > (or
>>> > > > >>>>>>>> at
>>> > > > >>>>>>>> least I think so).
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> Here's the output of vm-status:
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> # hosted-engine --vm-status
>>> > > > >>>>>>>>
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> --== Host 1 status ==--
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> Status up-to-date : False
>>> > > > >>>>>>>> Hostname : 10.0.200.102
>>> > > > >>>>>>>> Host ID : 1
>>> > > > >>>>>>>> Engine status : unknown stale-data
>>> > > > >>>>>>>> Score : 2400
>>> > > > >>>>>>>> Local maintenance : False
>>> > > > >>>>>>>> Host timestamp : 1397035677
>>> > > > >>>>>>>> Extra metadata (valid at timestamp):
>>> > > > >>>>>>>> metadata_parse_version=1
>>> > > > >>>>>>>> metadata_feature_version=1
>>> > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014)
>>> > > > >>>>>>>> host-id=1
>>> > > > >>>>>>>> score=2400
>>> > > > >>>>>>>> maintenance=False
>>> > > > >>>>>>>> state=EngineUp
>>> > > > >>>>>>>>
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> --== Host 2 status ==--
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> Status up-to-date : True
>>> > > > >>>>>>>> Hostname : 10.0.200.101
>>> > > > >>>>>>>> Host ID : 2
>>> > > > >>>>>>>> Engine status : {'reason': 'vm not
>>> running
>>> > > on
>>> > > > >>>>>>>> this
>>> > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}
>>> > > > >>>>>>>> Score : 0
>>> > > > >>>>>>>> Local maintenance : False
>>> > > > >>>>>>>> Host timestamp : 1397464031
>>> > > > >>>>>>>> Extra metadata (valid at timestamp):
>>> > > > >>>>>>>> metadata_parse_version=1
>>> > > > >>>>>>>> metadata_feature_version=1
>>> > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014)
>>> > > > >>>>>>>> host-id=2
>>> > > > >>>>>>>> score=0
>>> > > > >>>>>>>> maintenance=False
>>> > > > >>>>>>>> state=EngineUnexpectedlyDown
>>> > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with
>>> the
>>> > > > >>>>>>>> following
>>> > > > >>>>>>>> subjects:
>>> > > > >>>>>>>> - ovirt-hosted-engine state transition
>>> EngineDown-EngineStart
>>> > > > >>>>>>>> - ovirt-hosted-engine state transition
>>> EngineStart-EngineUp
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> In oVirt webadmin I can see the following message:
>>> > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error
>>> Failed to
>>> > > > >>>>>>>> acquire
>>> > > > >>>>>>>> lock: error -243.
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> These messages are really annoying as oVirt isn't doing
>>> anything
>>> > > > >>>>>>>> with
>>> > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine
>>> vm.
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> So my questions are now:
>>> > > > >>>>>>>> Is it intended to send out these messages and detect that
>>> ovirt
>>> > > > >>>>>>>> engine
>>> > > > >>>>>>>> is down (which is false anyway), but not to restart the
>>> vm?
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> How can I disable notifications? I'm planning to write a
>>> Nagios
>>> > > > >>>>>>>> plugin
>>> > > > >>>>>>>> which parses the output of hosted-engine --vm-status and
>>> only
>>> > > Nagios
>>> > > > >>>>>>>> should notify me, not hosted-engine script.
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature
>>> > > optional? I
>>> > > > >>>>>>>> really really really hate cluster software as it causes
>>> more
>>> > > > >>>>>>>> troubles
>>> > > > >>>>>>>> then standalone machines and in my case the hosted-engine
>>> ha
>>> > > feature
>>> > > > >>>>>>>> really causes troubles (and I didn't had a hardware or
>>> network
>>> > > > >>>>>>>> outage
>>> > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't
>>> need any
>>> > > ha
>>> > > > >>>>>>>> feature for hosted engine. I just want to run engine
>>> > > virtualized on
>>> > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with
>>> a
>>> > > host)
>>> > > > >>>>>>>> I'll
>>> > > > >>>>>>>> restart it on another node.
>>> > > > >>>>>>>
>>> > > > >>>>>>> Hi, you can:
>>> > > > >>>>>>> 1. edit
>>> /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and
>>> > > tweak
>>> > > > >>>>>>> the logger as you like
>>> > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services
>>> > > > >>>>>>
>>> > > > >>>>>> Thanks for the information.
>>> > > > >>>>>> So engine is able to run when ovirt-ha-broker and
>>> ovirt-ha-agent
>>> > > isn't
>>> > > > >>>>>> running?
>>> > > > >>>>>>
>>> > > > >>>>>>
>>> > > > >>>>>> Regards,
>>> > > > >>>>>> René
>>> > > > >>>>>>
>>> > > > >>>>>>>
>>> > > > >>>>>>> --Jirka
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> Thanks,
>>> > > > >>>>>>>> René
>>> > > > >>>>>>>>
>>> > > > >>>>>>>>
>>> > > > >>>>>>>
>>> > > > >>>>>> _______________________________________________
>>> > > > >>>>>> Users mailing list
>>> > > > >>>>>> Users at ovirt.org
>>> > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>> > > > >>>>>>
>>> > > > >>>> _______________________________________________
>>> > > > >>>> Users mailing list
>>> > > > >>>> Users at ovirt.org
>>> > > > >>>> http://lists.ovirt.org/mailman/listinfo/users
>>> > > > >>>>
>>> > > > >>
>>> > > >
>>> > > _______________________________________________
>>> > > Users mailing list
>>> > > Users at ovirt.org
>>> > > http://lists.ovirt.org/mailman/listinfo/users
>>> > >
>>> >
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140424/f6669145/attachment-0001.html>
More information about the Users
mailing list