Heartbeat issue caused channel being closed and set to null which caused a surprise for a reactor.
It seems that we may want to schedule closure of channel in this situation instead of calling it explicitly.

What was the cause of the heartbeat issues?

On Thu, Oct 27, 2016 at 4:12 AM, Allon Mureinik <amureini@redhat.com> wrote:
Seems like we cleared up the engine issues related to the recent injection changes.

I am now seeing stop errors, e.g.:

{"jsonrpc": "2.0", "id": "ea0c564f-bc17-4fc2-8f1b-67c4d28257c6", "result": {"cpuStatistics": {"1": {"cpuUser": "3.07", "nodeIndex": 0, "cpuSys": "3.00", "cpuIdle": "93.93"}, "0": {"cpuUser": "1.67", "nodeIndex": 0, "cpuSys": "2.07", "cpuIdle": "96.26"}}, "numaNodeMemFree": {"0": {"memPercent": 83, "memFree": "359"}}, "memShared": 0, "thpState": "always", "ksmMergeAcrossNodes": true, "vmCount": 0, "memUsed": "20", "storageDomains": {"b2bb3220-1eb3-426a-90c2-5e236aefbe1a": {"code": 0, "actual": true, "version": 0, "acquired": true, "delay": "0.000840117", "lastCheck": "7.1", "valid": true}, "3130195a-73f9-4490-b554-98a9205cead6": {"code": 0, "actual": true, "version": 4, "acquired": true, "delay": "0.00150771", "lastCheck": "7.5", "valid": true}, "1a9e202b-83b7-4bdc-9b0c-e76b83676068": {"code": 0, "actual": true, "version": 4, "acquired": true, "delay": "0.000590956",
2016-10-26 21:51:09,878 DEBUG [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] (DefaultQuartzScheduler7) [6d206bd1] Rescheduling DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783 as there is no unfired trigger.
2016-10-26 21:51:28,705 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [383dd6a0] Heartbeat exceeded. Closing channel
2016-10-26 21:51:28,708 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [383dd6a0] Internal server error: null: java.lang.NullPointerException
	at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.write(SSLClient.java:102) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.processOutgoing(ReactorClient.java:245) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:208) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:125) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:89) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:65) [vdsm-jsonrpc-java-client.jar:]

Piotr - any idea?

On Wed, Oct 26, 2016 at 9:34 PM, Nadav Goldin <ngoldin@redhat.com> wrote:
Its running now:
http://jenkins.ovirt.org/job/ovirt-engine_master_build-artifacts-fc24-x86_64/1037/,
when it finishes it will trigger the deploy job which will trigger the
experimental job. The build_artifacts job is triggered approximately
every 2 hours(depends on the load).
Inside the build-artifacts job you can see the last
commit(https://gerrit.ovirt.org/65776, which is already after your
commit)


On Wed, Oct 26, 2016 at 9:26 PM, Allon Mureinik <amureini@redhat.com> wrote:
> @Infra - the last experimental job I saw was from ~17:30 local Israel time.
> Any idea why another one isn't being triggered (or am I just being daft)?
>
> On Wed, Oct 26, 2016 at 6:27 PM, Allon Mureinik <amureini@redhat.com> wrote:
>>
>> Yipes.
>> [1] should fix that, waiting for the CI to run to merge.
>>
>> [1] https://gerrit.ovirt.org/#/c/65768/
>>
>> On Wed, Oct 26, 2016 at 3:42 PM, Nadav Goldin <ngoldin@redhat.com> wrote:
>>>
>>> Unfortunately it is still failing, see[1], the repository used was
>>> built from commit [2]. If you want to check the logs same links
>>> apply(just replace build number 2759->2782)
>>>
>>>
>>>
>>> [1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/2782
>>> [2] https://gerrit.ovirt.org/#/c/65740/
>>>
>>> On Wed, Oct 26, 2016 at 11:33 AM, Tal Nisan <tnisan@redhat.com> wrote:
>>> > Should be fixed now on master as those patches were just merged:
>>> >
>>> > https://gerrit.ovirt.org/#/c/65738/ - "Move InjectorRule to VdsBroker":
>>> > aligns the InjectorRule's package with the Injector so it can
>>> > essentially be
>>> > used wherever the Injector is used [+12, -4]
>>> > https://gerrit.ovirt.org/#/c/65739 - "InjectorRule: override injector
>>> > anyway": Fixes a bug in InjectorRule where the Injector is only mocked
>>> > if
>>> > you call InjectorRule.bind [+6, -1]
>>> > https://gerrit.ovirt.org/#/c/65740 - "core: InjectorRule for injecting
>>> > members": Fixes a bug in InjectorRule to allow using
>>> > Injector.injectMembers
>>> > when using it [+18, -5]
>>> > https://gerrit.ovirt.org/#/c/65725 - "core: Fix AuditLogging": The
>>> > actual
>>> > fix. Basically, goes over all the places that create an
>>> > AuditLoggableBase
>>> > that needs injecting and take care of it [+155, -160]
>>> >
>>> >
>>> > On Wed, Oct 26, 2016 at 10:04 AM, Nadav Goldin <ngoldin@redhat.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >> We have new failure on OST from patches merged to master yesterday,
>>> >> the failure started after the merge of [1], but as there were quite a
>>> >> few patches merged quickly I can't make sure it is the one causing
>>> >> it(OST aren't ran per-patch).
>>> >> The test that fails is [2] when attempting to start the VM.
>>> >>
>>> >> The error from the API side:
>>> >>
>>> >> RequestError:
>>> >> status: 500
>>> >> reason: Internal Server Error
>>> >> detail: javax.ejb.EJBException: java.lang.NullPointerException
>>> >>     at
>>> >>
>>> >> org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:213)
>>> >>     at
>>> >>
>>> >> org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:265)
>>> >>     at
>>> >>
>>> >> org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:374)
>>> >>     at
>>> >>
>>> >> org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:243)
>>> >>     at
>>> >>
>>> >> org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
>>> >> ....
>>> >>
>>> >> In the engine logs there are a few 'java.lang.NullPointerException'
>>> >> errors:
>>> >>
>>> >> 2016-10-25 11:53:52,845 INFO
>>> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase]
>>> >> (org.ovirt.thread.pool-6-thread-2) [5e6a88be] Failed to get vds
>>> >> 'd60db21f-95f0-487b-9f17-44861e2610a7', error: null
>>> >> 2016-10-25 11:53:52,864 DEBUG
>>> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>>> >> (DefaultQuartzScheduler5) [] Rescheduling
>>> >>
>>> >>
>>> >> DEFAULT.org.ovirt.engine.core.bll.tasks.AsyncTaskManager.timerElapsed#-9223372036854775787
>>> >> as there is no unfired trigger.
>>> >> ...
>>> >> 2016-10-25 11:53:52,845 DEBUG
>>> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase]
>>> >> (org.ovirt.thread.pool-6-thread-2) [5e6a88be] Exception:
>>> >> java.lang.NullPointerException
>>> >>         at
>>> >>
>>> >> org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase.getVdsStatic(AuditLogableBase.java:633)
>>> >> [dal.jar:]
>>> >>         at
>>> >>
>>> >> org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase.getVdsName(AuditLogableBase.java:504)
>>> >> [dal.jar:]
>>> >> ...
>>> >> 2016-10-25 11:53:52,837 INFO
>>> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase]
>>> >> (org.ovirt.thread.pool-6-thread-2) [5e6a88be] Failed to get vds
>>> >> 'd60db21f-95f0-487b-9f17-44861e2610a7', error: null
>>> >> 2016-10-25 11:53:52,837 DEBUG
>>> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase]
>>> >> (org.ovirt.thread.pool-6-thread-2) [5e6a88be] Exception:
>>> >> java.lang.NullPointerException
>>> >>         at
>>> >>
>>> >> org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase.getVdsStatic(AuditLogableBase.java:633)
>>> >> [dal.jar:]
>>> >>         at
>>> >>
>>> >> org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase.getVdsName(AuditLogableBase.java:504)
>>> >> [dal.jar:]
>>> >> ...
>>> >>
>>> >> The full engine logs can be found here[3] and the entire test suite
>>> >> logs here[4].
>>> >>
>>> >> Can anyone have a look?
>>> >>
>>> >> Thanks,
>>> >> Nadav.
>>> >>
>>> >>
>>> >> [1] https://gerrit.ovirt.org/#/c/65198/
>>> >> [2]
>>> >>
>>> >> https://github.com/oVirt/ovirt-system-tests/blob/master/basic_suite_master/test-scenarios/004_basic_sanity.py#L322
>>> >> [3]
>>> >>
>>> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/2759/artifact/exported-artifacts/basic_suite_master.sh-fc24/exported-artifacts/test_logs/basic_suite_master/post-004_basic_sanity.py/*zip*/post-004_basic_sanity.py.zip
>>> >> [4]
>>> >>
>>> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/2759/artifact/exported-artifacts/basic_suite_master.sh-fc24/exported-artifacts/
>>> >> _______________________________________________
>>> >> Devel mailing list
>>> >> Devel@ovirt.org
>>> >> http://lists.ovirt.org/mailman/listinfo/devel
>>> >
>>> >
>>
>>
>