Hi Nardus,
I'm assuming that your setup was stable and you were able to run your VMs
without problems. If so, then below is not a solution to your problem, you
should really check engine and VDSM logs for reasons why your hosts become
NonResponsive. Most probably there is underlying storage or network issue
which prevents correct engine <-> hosts communications and which made your
hosts NonResponsive. The solution below will just hide the issues you
currently have.
If your problem started suddenly when you significantly increased the
number of running VMs or decreased the number of available hosts, then you
are suffering from those issues because of not having enough resources.
Regards,
Martin
On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <asocha(a)redhat.com> wrote:
Thanks Nardus,
After a quick look I found what I was suspecting - there are way too many
threads in Blocked state. I don't know yet the reason but this is very
helpful. I'll let you know about the findings/investigation. Meanwhile, you
may try restarting the engine as (a very brute and ugly) workaround).
You may try to setup slightly bigger thread pool - may save you some time
until the next hiccup. However, please be aware that this may come with the
cost in memory usage and higher cpu usage (due to increased context
switching)
Here are some docs:
# Specify the thread pool size for jboss managed scheduled executor service used by
commands to periodically execute
# methods. It is generally not necessary to increase the number of threads in this thread
pool. To change the value
# permanently create a conf file 99-engine-scheduled-thread-pool.conf in
/etc/ovirt-engine/engine.conf.d/
ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
A.
On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nardusg(a)gmail.com>
wrote:
> Hi Artur
>
> Please find attached, also let me know if I need to rerun. They 5 min
> apart
>
> [root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep
> | awk '{ print $2 }'
> 27390
> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
> your_engine_thread_dump_1.txt
> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
> your_engine_thread_dump_2.txt
> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
> your_engine_thread_dump_3.txt
>
> Regards
>
> Nar
>
> On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha(a)redhat.com> wrote:
>
>> Sure thing.
>> On engine host please find jboss pid. You can use this command:
>>
>> ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
>>
>> or jps tool from jdk. Sample output on my dev environment is:
>>
>> ± % jps
>> !2860
>> 64853 jboss-modules.jar
>> 196217 Jps
>>
>> Then use jstack from jdk:
>> jstack <pid> > your_engine_thread_dump.txt
>> 2 or 3 dumps taken in approximately 5 minutes intervals would be even
>> more useful.
>>
>> Here you can find even more options
>>
https://www.baeldung.com/java-thread-dump
>>
>> Artur
>>
>> On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg(a)gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Can create thread dump, please send details on howto.
>>>
>>> Regards
>>>
>>> Nardus
>>>
>>> On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha(a)redhat.com> wrote:
>>>
>>>> Hi Nardus,
>>>> You might have hit an issue I have been hunting for some time ( [1]
>>>> and [2] ).
>>>> [1] could not be properly resolved because at a time was not able to
>>>> recreate an issue on dev setup.
>>>> I suspect [2] is related.
>>>>
>>>> Would you be able to prepare a thread dump from your engine instance?
>>>> Additionally, please check for potential libvirt errors/warnings.
>>>> Can you also paste the output of:
>>>> sudo yum list installed | grep vdsm
>>>> sudo yum list installed | grep ovirt-engine
>>>> sudo yum list installed | grep libvirt
>>>>
>>>> Usually, according to previous reports, restarting the engine helps to
>>>> restore connectivity with hosts ... at least for some time.
>>>>
>>>> [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1845152
>>>> [2]
https://bugzilla.redhat.com/show_bug.cgi?id=1846338
>>>>
>>>> regards,
>>>> Artur
>>>>
>>>>
>>>>
>>>> On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys
<nardusg(a)gmail.com>
>>>> wrote:
>>>>
>>>>> Also see this in engine:
>>>>>
>>>>> Aug 6, 2020, 7:37:17 AM
>>>>> VDSM someserver command Get Host Capabilities failed: Message
timeout
>>>>> which can be caused by communication issues
>>>>>
>>>>> On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov
<hunter86_bg(a)yahoo.com>
>>>>> wrote:
>>>>>
>>>>>> Can you fheck for errors on the affected host. Most probably you
>>>>>> need the vdsm logs.
>>>>>>
>>>>>> Best Regards,
>>>>>> Strahil Nikolov
>>>>>>
>>>>>> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys <
>>>>>> nardusg(a)gmail.com> написа:
>>>>>> >Hi Strahil
>>>>>> >
>>>>>> >Hope you are well. I get the following error when I tried to
confirm
>>>>>> >reboot:
>>>>>> >
>>>>>> >Error while executing action: Cannot confirm 'Host has
been
>>>>>> rebooted'
>>>>>> >Host.
>>>>>> >Valid Host statuses are "Non operational",
"Maintenance" or
>>>>>> >"Connecting".
>>>>>> >
>>>>>> >And I can't put it in maintenance, only option is
"restart" or
>>>>>> "stop".
>>>>>> >
>>>>>> >Regards
>>>>>> >
>>>>>> >Nar
>>>>>> >
>>>>>> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov
<hunter86_bg(a)yahoo.com
>>>>>> >
>>>>>> >wrote:
>>>>>> >
>>>>>> >> After rebooting the node, have you "marked" it
that it was
>>>>>> rebooted ?
>>>>>> >>
>>>>>> >> Best Regards,
>>>>>> >> Strahil Nikolov
>>>>>> >>
>>>>>> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus
Geldenhuys <
>>>>>> >> nardusg(a)gmail.com> написа:
>>>>>> >> >Hi oVirt land
>>>>>> >> >
>>>>>> >> >Hope you are well. Got a bit of an issue, actually a
big issue.
>>>>>> We
>>>>>> >had
>>>>>> >> >some
>>>>>> >> >sort of dip of some sort. All the VM's is still
running, but
>>>>>> some of
>>>>>> >> >the
>>>>>> >> >hosts is show "Unassigned" or
"NonResponsive". So all the hosts
>>>>>> was
>>>>>> >> >showing
>>>>>> >> >UP and was fine before our dip. So I did increase
>>>>>> >vdsHeartbeatInSecond
>>>>>> >> >to
>>>>>> >> >240, no luck.
>>>>>> >> >
>>>>>> >> >I still get a timeout on the engine lock even thou I
can connect
>>>>>> to
>>>>>> >> >that
>>>>>> >> >host from the engine using nc to test to port 54321.
I also did
>>>>>> >restart
>>>>>> >> >vdsmd and also rebooted the host with no luck.
>>>>>> >> >
>>>>>> >> > nc -v someserver 54321
>>>>>> >> >Ncat: Version 7.50 (
https://nmap.org/ncat )
>>>>>> >> >Ncat: Connected to 172.40.2.172:54321.
>>>>>> >> >
>>>>>> >> >2020-08-05 20:20:34,256+02 ERROR
>>>>>> >>
>>>>>>
>>>>>>
>>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>> >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[] EVENT_ID:
>>>>>> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver
command Get
>>>>>> Host
>>>>>> >> >Capabilities failed: Message timeout which can be
caused by
>>>>>> >> >communication
>>>>>> >> >issues
>>>>>> >> >
>>>>>> >> >Any troubleshoot ideas will be gladly appreciated.
>>>>>> >> >
>>>>>> >> >Regards
>>>>>> >> >
>>>>>> >> >Nar
>>>>>> >>
>>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list -- users(a)ovirt.org
>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>>>> oVirt Code of Conduct:
>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>> List Archives:
>>>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76...
>>>>>
>>>>
>>>>
>>>> --
>>>> Artur Socha
>>>> Senior Software Engineer, RHV
>>>> Red Hat
>>>>
>>>
>>
>> --
>> Artur Socha
>> Senior Software Engineer, RHV
>> Red Hat
>>
>
--
Artur Socha
Senior Software Engineer, RHV
Red Hat