[JIRA] (OVIRT-1763) Increase entropy for hosts
by Evgheni Dereveanchin (oVirt JIRA)
This is a multi-part message in MIME format...
------------=_1510753169-31571-405
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1763?page=com.atlassian.jir... ]
Evgheni Dereveanchin commented on OVIRT-1763:
---------------------------------------------
And I think the difference in speed is caused by this part:
1) successful build 3834 taking 5 minutes to install packages - running in RAM:
02:49:59 [upgrade-from-release-suit] +++ df --output=avail /dev/shm
02:49:59 [upgrade-from-release-suit] +++ sed 1d
02:49:59 [upgrade-from-release-suit] ++ avail_shm=24589620
02:49:59 [upgrade-from-release-suit] ++ [[ 24589620 -ge 15000000 ]]
02:49:59 [upgrade-from-release-suit] ++ mkdir -p /dev/shm/ost
02:49:59 [upgrade-from-release-suit] ++ echo /dev/shm/ost/deployment-upgrade-from-release-suite-master
02:49:59 [upgrade-from-release-suit] + run_path=/dev/shm/ost/deployment-upgrade-from-release-suite-master
2) timed out build 3795 taking 15 minutes to install packages - running on disk:
15:29:24 [upgrade-from-release-suit] ++ local avail_shm
15:29:24 [upgrade-from-release-suit] +++ df --output=avail /dev/shm
15:29:24 [upgrade-from-release-suit] +++ sed 1d
15:29:24 [upgrade-from-release-suit] ++ avail_shm=14540900
15:29:24 [upgrade-from-release-suit] ++ [[ 14540900 -ge 15000000 ]]
15:29:24 [upgrade-from-release-suit] ++ echo /home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/deployment-upgrade-from-release-suite-master
15:29:24 [upgrade-from-release-suit] + run_path=/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/deployment-upgrade-from-release-suite-master
Logging into the node where build 3795 was running I see 24GB free on the RAM drive when no jobs are running. Not really sure if anything should be on the RAM drive by the time that check is made, based on this we can take action and either clean it up, change the check or increase the timeout. [~bkorren(a)redhat.com], do you know any details about RAM drive specifics for OST?
> Increase entropy for hosts
> --------------------------
>
> Key: OVIRT-1763
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1763
> Project: oVirt - virtualization made easy
> Issue Type: Bug
> Reporter: Dafna Ron
> Assignee: infra
>
> we had a failure in ost that was really hard to debug: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/
> There are no failures in the logs and the test itself was terminated by a timeout.
> It took the vms a long time to download packages and install and didi seems to think that this is due to limited entropy on the physical host.
> we need to review this issue and increase the entropy on the hosts.
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100071)
------------=_1510753169-31571-405
Content-Type: text/html; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
<html><body>
<pre>[ https://ovirt-jira.atlassian.net/browse/OVIRT-1763?page=com.atlassian.jir... ]</pre>
<h3>Evgheni Dereveanchin commented on OVIRT-1763:</h3>
<p>And I think the difference in speed is caused by this part:</p>
<p>1) successful build 3834 taking 5 minutes to install packages – running in RAM: 02:49:59 [upgrade-from-release-suit] +++ df --output=avail /dev/shm 02:49:59 [upgrade-from-release-suit] +++ sed 1d 02:49:59 [upgrade-from-release-suit] ++ avail_shm=24589620 02:49:59 [upgrade-from-release-suit] ++ [[ 24589620 -ge 15000000 ]] 02:49:59 [upgrade-from-release-suit] ++ mkdir -p /dev/shm/ost 02:49:59 [upgrade-from-release-suit] ++ echo /dev/shm/ost/deployment-upgrade-from-release-suite-master 02:49:59 [upgrade-from-release-suit] + run_path=/dev/shm/ost/deployment-upgrade-from-release-suite-master</p>
<p>2) timed out build 3795 taking 15 minutes to install packages – running on disk: 15:29:24 [upgrade-from-release-suit] ++ local avail_shm 15:29:24 [upgrade-from-release-suit] +++ df --output=avail /dev/shm 15:29:24 [upgrade-from-release-suit] +++ sed 1d 15:29:24 [upgrade-from-release-suit] ++ avail_shm=14540900 15:29:24 [upgrade-from-release-suit] ++ [[ 14540900 -ge 15000000 ]] 15:29:24 [upgrade-from-release-suit] ++ echo /home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/deployment-upgrade-from-release-suite-master 15:29:24 [upgrade-from-release-suit] + run_path=/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/deployment-upgrade-from-release-suite-master</p>
<p>Logging into the node where build 3795 was running I see 24GB free on the RAM drive when no jobs are running. Not really sure if anything should be on the RAM drive by the time that check is made, based on this we can take action and either clean it up, change the check or increase the timeout. [~bkorren(a)redhat.com], do you know any details about RAM drive specifics for OST?</p>
<blockquote><h3>Increase entropy for hosts</h3>
<pre> Key: OVIRT-1763
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1763
Project: oVirt - virtualization made easy
Issue Type: Bug
Reporter: Dafna Ron
Assignee: infra</pre>
<p>we had a failure in ost that was really hard to debug: <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/">http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/</a> There are no failures in the logs and the test itself was terminated by a timeout. It took the vms a long time to download packages and install and didi seems to think that this is due to limited entropy on the physical host. we need to review this issue and increase the entropy on the hosts.</p></blockquote>
<p>— This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100071)</p>
<img src="https://u4043402.ct.sendgrid.net/wf/open?upn=i5TMWGV99amJbNxJpSp2-2BCmpYL..." alt="" width="1" height="1" border="0" style="height:1px !important;width:1px !important;border-width:0 !important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0 !important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0 !important;padding-right:0 !important;padding-left:0 !important;"/>
</body></html>
------------=_1510753169-31571-405--
7 years
[JIRA] (OVIRT-1763) Increase entropy for hosts
by Barak Korren (oVirt JIRA)
This is a multi-part message in MIME format...
------------=_1510753093-27974-399
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1763?page=com.atlassian.jir... ]
Barak Korren commented on OVIRT-1763:
-------------------------------------
{quote}
How? We have virtio-rng. What else would you like to add? rngd?
{quote}
If Lago provides virtio-rng, then it should be sufficient (And haveged should not be needed), but do we have everything configured correctly so that virtio-rng shows up as {{/dev/random}} in the VMs?
Also maybe the physical hosts themselves are running out of entropy, [~ederevea] do we have haveged installed on them? I seem to remember we had Puppet installing it everywhere.
> Increase entropy for hosts
> --------------------------
>
> Key: OVIRT-1763
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1763
> Project: oVirt - virtualization made easy
> Issue Type: Bug
> Reporter: Dafna Ron
> Assignee: infra
>
> we had a failure in ost that was really hard to debug: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/
> There are no failures in the logs and the test itself was terminated by a timeout.
> It took the vms a long time to download packages and install and didi seems to think that this is due to limited entropy on the physical host.
> we need to review this issue and increase the entropy on the hosts.
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100071)
------------=_1510753093-27974-399
Content-Type: text/html; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
<html><body>
<pre>[ https://ovirt-jira.atlassian.net/browse/OVIRT-1763?page=com.atlassian.jir... ]</pre>
<h3>Barak Korren commented on OVIRT-1763:</h3>
<p>{quote} How? We have virtio-rng. What else would you like to add? rngd? {quote}</p>
<p>If Lago provides virtio-rng, then it should be sufficient (And haveged should not be needed), but do we have everything configured correctly so that virtio-rng shows up as {{/dev/random}} in the VMs?</p>
<p>Also maybe the physical hosts themselves are running out of entropy, [~ederevea] do we have haveged installed on them? I seem to remember we had Puppet installing it everywhere.</p>
<blockquote><h3>Increase entropy for hosts</h3>
<pre> Key: OVIRT-1763
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1763
Project: oVirt - virtualization made easy
Issue Type: Bug
Reporter: Dafna Ron
Assignee: infra</pre>
<p>we had a failure in ost that was really hard to debug: <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/">http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/</a> There are no failures in the logs and the test itself was terminated by a timeout. It took the vms a long time to download packages and install and didi seems to think that this is due to limited entropy on the physical host. we need to review this issue and increase the entropy on the hosts.</p></blockquote>
<p>— This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100071)</p>
<img src="https://u4043402.ct.sendgrid.net/wf/open?upn=i5TMWGV99amJbNxJpSp2-2BCmpYL..." alt="" width="1" height="1" border="0" style="height:1px !important;width:1px !important;border-width:0 !important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0 !important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0 !important;padding-right:0 !important;padding-left:0 !important;"/>
</body></html>
------------=_1510753093-27974-399--
7 years
[JIRA] (OVIRT-1763) Increase entropy for hosts
by Evgheni Dereveanchin (oVirt JIRA)
This is a multi-part message in MIME format...
------------=_1510752474-30503-445
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1763?page=com.atlassian.jir... ]
Evgheni Dereveanchin commented on OVIRT-1763:
---------------------------------------------
>From the provided job log I see an upgrade suite failure during add-host.
In engine.log [1] there's a long update sequence at the end, that contains 570 packages including the kernel, systemd, glibc, rpm and looks like a full "yum update" is being run on the hypervisor. Is this expected? Shouldn't we just install VDSM and friends?
On the host itself [2] I see the upgrade progressing normally, no severe hangups. So I cannot see any direct proof of lack of enropy causing this. On successful runs the "yum update" with 570 packages takes 5 minutes, not 15 so will continue investigating.
In general, the timeout is not happening on lago hosts themselves but in VMs where OST is running. Those should have 'haveged' installed and running inside to provide entropy. If they don't - it needs to be installed and running. [~gbenhaim(a)redhat.com] - could you please confirm if we have haveged in lago VMs?
[1] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/artifa...
[2] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/artifa...
> Increase entropy for hosts
> --------------------------
>
> Key: OVIRT-1763
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1763
> Project: oVirt - virtualization made easy
> Issue Type: Bug
> Reporter: Dafna Ron
> Assignee: infra
>
> we had a failure in ost that was really hard to debug: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/
> There are no failures in the logs and the test itself was terminated by a timeout.
> It took the vms a long time to download packages and install and didi seems to think that this is due to limited entropy on the physical host.
> we need to review this issue and increase the entropy on the hosts.
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100071)
------------=_1510752474-30503-445
Content-Type: text/html; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
<html><body>
<pre>[ https://ovirt-jira.atlassian.net/browse/OVIRT-1763?page=com.atlassian.jir... ]</pre>
<h3>Evgheni Dereveanchin commented on OVIRT-1763:</h3>
<p>From the provided job log I see an upgrade suite failure during add-host. In engine.log [1] there's a long update sequence at the end, that contains 570 packages including the kernel, systemd, glibc, rpm and looks like a full “yum update” is being run on the hypervisor. Is this expected? Shouldn't we just install VDSM and friends?</p>
<p>On the host itself [2] I see the upgrade progressing normally, no severe hangups. So I cannot see any direct proof of lack of enropy causing this. On successful runs the “yum update” with 570 packages takes 5 minutes, not 15 so will continue investigating.</p>
<p>In general, the timeout is not happening on lago hosts themselves but in VMs where OST is running. Those should have ‘haveged’ installed and running inside to provide entropy. If they don't – it needs to be installed and running. [~gbenhaim(a)redhat.com] – could you please confirm if we have haveged in lago VMs?</p>
<p>[1] <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/artifa...">http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/artifa...</a> [2] <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/artifa...">http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/artifa...</a></p>
<blockquote><h3>Increase entropy for hosts</h3>
<pre> Key: OVIRT-1763
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1763
Project: oVirt - virtualization made easy
Issue Type: Bug
Reporter: Dafna Ron
Assignee: infra</pre>
<p>we had a failure in ost that was really hard to debug: <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/">http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/3795/</a> There are no failures in the logs and the test itself was terminated by a timeout. It took the vms a long time to download packages and install and didi seems to think that this is due to limited entropy on the physical host. we need to review this issue and increase the entropy on the hosts.</p></blockquote>
<p>— This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100071)</p>
<img src="https://u4043402.ct.sendgrid.net/wf/open?upn=i5TMWGV99amJbNxJpSp2-2BCmpYL..." alt="" width="1" height="1" border="0" style="height:1px !important;width:1px !important;border-width:0 !important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0 !important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0 !important;padding-right:0 !important;padding-left:0 !important;"/>
</body></html>
------------=_1510752474-30503-445--
7 years
[JIRA] (OVIRT-1764) add logs under /tmp to our job artifacts
by Barak Korren (oVirt JIRA)
This is a multi-part message in MIME format...
------------=_1510752029-21336-395
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1764?page=com.atlassian.jir... ]
Barak Korren commented on OVIRT-1764:
-------------------------------------
The logs collected are declared in the LagoInitFile, so its easy to add more by patching OST. This should probably be an OST ticket and not a general oVirt ticket.
I don`t think we'd want to collect everything in /tmp, so I hopw the relevant logs match some easy to detect file name patterns.
> add logs under /tmp to our job artifacts
> ----------------------------------------
>
> Key: OVIRT-1764
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1764
> Project: oVirt - virtualization made easy
> Issue Type: Bug
> Reporter: Dafna Ron
> Assignee: infra
>
> when engine fails to deploy it does not save anything in the regular engine log.
> however, there is a copy saved under /tmp which would make debugging engine failure to deploy issues easier to debug.
> Didi suggested that it would be a good idea to add the logs under /tmp to the jenkins artifacts job logs to allow easier debugging when engine fails to deploy.
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100071)
------------=_1510752029-21336-395
Content-Type: text/html; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
<html><body>
<pre>[ https://ovirt-jira.atlassian.net/browse/OVIRT-1764?page=com.atlassian.jir... ]</pre>
<h3>Barak Korren commented on OVIRT-1764:</h3>
<p>The logs collected are declared in the LagoInitFile, so its easy to add more by patching OST. This should probably be an OST ticket and not a general oVirt ticket.</p>
<p>I don`t think we'd want to collect everything in /tmp, so I hopw the relevant logs match some easy to detect file name patterns.</p>
<blockquote><h3>add logs under /tmp to our job artifacts</h3>
<pre> Key: OVIRT-1764
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1764
Project: oVirt - virtualization made easy
Issue Type: Bug
Reporter: Dafna Ron
Assignee: infra</pre>
<p>when engine fails to deploy it does not save anything in the regular engine log. however, there is a copy saved under /tmp which would make debugging engine failure to deploy issues easier to debug. Didi suggested that it would be a good idea to add the logs under /tmp to the jenkins artifacts job logs to allow easier debugging when engine fails to deploy.</p></blockquote>
<p>— This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100071)</p>
<img src="https://u4043402.ct.sendgrid.net/wf/open?upn=i5TMWGV99amJbNxJpSp2-2BCmpYL..." alt="" width="1" height="1" border="0" style="height:1px !important;width:1px !important;border-width:0 !important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0 !important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0 !important;padding-right:0 !important;padding-left:0 !important;"/>
</body></html>
------------=_1510752029-21336-395--
7 years