[ovirt-users] HE in bad stauts, will not start following storage issue - HELP

Sun Mar 12 09:24:50 UTC 2017

I've checked id's in  /rhev/data-center/mnt/glusterSD/*...../dom_md/

# -rw-rw----. 1 vdsm kvm  1048576 Mar 12 05:14 ids

seems ok

sanlock.log showing;
---------------------------
r14 acquire_token open error -13
r14 cmd_acquire 2,11,89283 acquire_token -13

Now I'm not quiet sure on which direction to take.

Lockspace
---------------
"hosted-engine --reinitialize-lockspace" is throwing an exception;

Exception("Lockfile reset cannot be performed with"
Exception: Lockfile reset cannot be performed with an active agent.

@didi - I am in "Global Maintenance".
I just noticed that host 1 now shows.
Engine status: unknown stale-data
state= AgentStopped

I'm pretty sure Ive been able to start the Engine VM while in Global
Maintenance. But you raise a good question. I don't see why you would be
restricted in running the engine while in Global or even starting the VM.
If so this is a little bakwards.

On 12 March 2017 at 16:28, Yedidyah Bar David <didi at redhat.com> wrote:

> On Fri, Mar 10, 2017 at 12:39 PM, Martin Sivak <msivak at redhat.com> wrote:
> > Hi Ian,
> >
> > it is normal that VDSMs are competing for the lock, one should win
> > though. If that is not the case then the lockspace might be corrupted
> > or the sanlock daemons can't reach it.
> >
> > I would recommend putting the cluster to global maintenance and
> > attempting a manual start using:
> >
> > # hosted-engine --set-maintenance --mode=global
> > # hosted-engine --vm-start
>
> Is that possible? See also:
>
> http://lists.ovirt.org/pipermail/users/2016-January/036993.html
>
> >
> > You will need to check your storage connectivity and sanlock status on
> > all hosts if that does not work.
> >
> > # sanlock client status
> >
> > There are couple of locks I would expect to be there (ha_agent, spm),
> > but no lock for hosted engine disk should be visible.
> >
> > Next steps depend on whether you have important VMs running on the
> > cluster and on the Gluster status (I can't help you there
> > unfortunately).
> >
> > Best regards
> >
> > --
> > Martin Sivak
> > SLA / oVirt
> >
> >
> > On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen <ian.neilsen at gmail.com>
> wrote:
> >> I just noticed this in the vdsm.logs.  The agent looks like it is
> trying to
> >> start hosted engine on both machines??
> >>
> >> <on_poweroff>destroy</on_poweroff><on_reboot>destroy</
> on_reboot><on_crash>destroy</on_crash></domain>
> >> Thread-7517::ERROR::2017-03-10
> >> 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm)
> >> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process
> failed
> >> Traceback (most recent call last):
> >>   File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
> >> self._run()
> >>   File "/usr/share/vdsm/virt/vm.py", line 2026, in _run
> >> self._connection.createXML(domxml, flags),
> >>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py",
> line
> >> 123, in wrapper ret = f(*args, **kwargs)
> >>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in
> >> wrapper return func(inst, *args, **kwargs)
> >>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in
> >> createXML if ret is None:raise libvirtError('virDomainCreateXML()
> failed',
> >> conn=self)
> >>
> >> libvirtError: Failed to acquire lock: Permission denied
> >>
> >> INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus)
> >> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down:
> Failed
> >> to acquire lock: Permission denied (code=1)
> >> INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop)
> >> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
> >>
> >> DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister)
> Delete
> >> fileno 56 from listener.
> >> DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd)
> Failed
> >> to unregister FD from epoll (ENOENT): 56
> >> DEBUG::2017-03-10 01:26:13,055::__init__::209::
> jsonrpc.Notification::(emit)
> >> Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379":
> {"status":
> >> "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock:
> Permission
> >> denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0",
> >> "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"}
> >> VM Channels Listener::DEBUG::2017-03-10
> >> 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was
> removed
> >> from listener.
> >> DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_
> process)
> >> START check
> >> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/
> a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> >> cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd',
> >> u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/
> a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata',
> >> 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00
> >> DEBUG::2017-03-10 01:26:14,481::asyncevent::564:
> :storage.asyncevent::(reap)
> >> Process <cpopen.CPopen object at 0x3ba6550> terminated (count=1)
> >> DEBUG::2017-03-10
> >> 01:26:14,481::check::327::storage.check::(_check_completed) FINISH
> check
> >> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/
> a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> >> rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B)
> >> copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
> >>
> >>
> >> On 10 March 2017 at 10:40, Ian Neilsen <ian.neilsen at gmail.com> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I had a storage issue with my gluster volumes running under ovirt
> hosted.
> >>> I now cannot start the hosted engine manager vm from "hosted-engine
> >>> --vm-start".
> >>> I've scoured the net to find a way, but can't seem to find anything
> >>> concrete.
> >>>
> >>> Running Centos7, ovirt 4.0 and gluster 3.8.9
> >>>
> >>> How do I recover the engine manager. Im at a loss!
> >>>
> >>> Engine Status = score between nodes was 0 for all, now node 1 is
> reading
> >>> 3400, but all others are 0
> >>>
> >>> {"reason": "bad vm status", "health": "bad", "vm": "down", "detail":
> >>> "down"}
> >>>
> >>>
> >>> Logs from agent.log
> >>> ==================
> >>>
> >>> INFO::2017-03-09
> >>> 19:32:52,600::state_decorators::51::ovirt_hosted_
> engine_ha.agent.hosted_engine.HostedEngine::(check)
> >>> Global maintenance detected
> >>> INFO::2017-03-09
> >>> 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_vdsm)
> >>> Initializing VDSM
> >>> INFO::2017-03-09
> >>> 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_storage_images)
> >>> Connecting the storage
> >>> INFO::2017-03-09
> >>> 19:32:54,821::storage_server::219::ovirt_hosted_engine_ha.
> lib.storage_server.StorageServer::(connect_storage_server)
> >>> Connecting storage server
> >>> INFO::2017-03-09
> >>> 19:32:59,194::storage_server::226::ovirt_hosted_engine_ha.
> lib.storage_server.StorageServer::(connect_storage_server)
> >>> Connecting storage server
> >>> INFO::2017-03-09
> >>> 19:32:59,211::storage_server::233::ovirt_hosted_engine_ha.
> lib.storage_server.StorageServer::(connect_storage_server)
> >>> Refreshing the storage domain
> >>> INFO::2017-03-09
> >>> 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_storage_images)
> >>> Preparing images
> >>> INFO::2017-03-09
> >>> 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.
> image.Image::(prepare_images)
> >>> Preparing images
> >>> INFO::2017-03-09
> >>> 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_storage_images)
> >>> Reloading vm.conf from the shared storage domain
> >>> INFO::2017-03-09
> >>> 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine.config::(refresh_local_conf_file)
> >>> Trying to get a fresher copy of vm configuration from the OVF_STORE
> >>> WARNING::2017-03-09
> >>> 19:33:04,056::ovf_store::107::ovirt_hosted_engine_ha.lib.
> ovf.ovf_store.OVFStore::(scan)
> >>> Unable to find OVF_STORE
> >>> ERROR::2017-03-09
> >>> 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine.config::(refresh_local_conf_file)
> >>> Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
> >>>
> >>> ovirt-ha-agent logs
> >>> ================
> >>>
> >>> ovirt-ha-agent
> >>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR
> Unable
> >>> to get vm.conf from OVF_STORE, falling back to initial vm.conf
> >>>
> >>> vdsm
> >>> ======
> >>>
> >>> vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
> >>>
> >>> ovirt-ha-broker
> >>> ============
> >>>
> >>> ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to
> >>> getVmStats: 'pid'
> >>>
> >>> --
> >>> Ian Neilsen
> >>>
> >>> Mobile: 0424 379 762
> >>> Linkedin: http://au.linkedin.com/in/ianneilsen
> >>> Twitter : ineilsen
> >>
> >>
> >>
> >>
> >> --
> >> Ian Neilsen
> >>
> >> Mobile: 0424 379 762
> >> Linkedin: http://au.linkedin.com/in/ianneilsen
> >> Twitter : ineilsen
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> >>
> > _______________________________________________
> > Users mailing list
> > Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
>
>
>
> --
> Didi
>

-- 
Ian Neilsen

Mobile: 0424 379 762
Linkedin: http://au.linkedin.com/in/ianneilsen
Twitter : ineilsen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170312/2a8249ff/attachment.html>