Host Reboot Timeout of 10 Minutes

older
Out-of-sync networks can only be...

Peter H

25 Jan 2023 25 Jan '23

12:07 a.m.

I'm working in a group that maintains a large oVirt setup based on 4.4.1 which works very well. We are afraid of upgrading and prefer setting up a new installation and gradually enlist the hosts one by one into the new installation. We have tried 4.4.10 and 4.5.1 - 4.5.4 based on CentOS Stream 8, Rocky 8, Alma Linux 9.1 with various problems. Worst was the problem that the rpm db ended up in a catch-22 state. Using Alma Linux 9.1 and current oVirt 4.5.4 seems promising as no rpm problems are present after installation. We have only one nuisance left which we have seen in all installation attempts we have made since 4.4.10. When rebooting a host it takes 10 minutes before it's activated again. In 4.4.1 the hosts are activated a few seconds after they have booted up. I have found the following in the engine log: 2023-01-24 23:01:57,564+01 INFO [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-engine-Thread-1513) [2bb08d20] Waiting 600 seconds, for server to finish reboot process. Our ansible playbooks for deployment times out and we could increase the timeout but how come that this 10 minutes delay has been introduced? Does a config file exist where this timeout can be set to a lower value? BR Peter H.

Attachments:

attachment.html (text/html — 3.1 KB)

Show replies by date

Yedidyah Bar David

25 Jan 25 Jan

7:22 a.m.

On Wed, Jan 25, 2023 at 2:08 AM Peter H <peter@hashbang.org> wrote:

...

I'm working in a group that maintains a large oVirt setup based on 4.4.1 which works very well. We are afraid of upgrading and prefer setting up a new installation and gradually enlist the hosts one by one into the new installation.

We have tried 4.4.10 and 4.5.1 - 4.5.4 based on CentOS Stream 8, Rocky 8, Alma Linux 9.1 with various problems. Worst was the problem that the rpm db ended up in a catch-22 state.

Using Alma Linux 9.1 and current oVirt 4.5.4 seems promising as no rpm problems are present after installation. We have only one nuisance left which we have seen in all installation attempts we have made since 4.4.10. When rebooting a host it takes 10 minutes before it's activated again. In 4.4.1 the hosts are activated a few seconds after they have booted up.

I have found the following in the engine log: 2023-01-24 23:01:57,564+01 INFO [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-engine-Thread-1513) [2bb08d20] Waiting 600 seconds, for server to finish reboot process.

Our ansible playbooks for deployment times out and we could increase the timeout but how come that this 10 minutes delay has been introduced?

Does a config file exist where this timeout can be set to a lower value?

I intended to provide a short reply just pointing out what value to change, then realized this might not be helpful, so decided to give up and not reply. Then I decided to take this opportunity and write the following. For background, please see: https://lists.ovirt.org/archives/list/users@ovirt.org/thread/HEKKBM6MZEKBEAX... . You do not need to be a developer, to search and read source code. One of the biggest advantages of FOSS is that you can do this, even without knowing how to write/update it. My main work in oVirt was in packaging/setup/backup/restore, not in the engine itself or vdsm - the two main parts of the project. But I know enough to guess that the error message you got is from the engine. I already have the engine source code git cloned on my laptop, so grepped it for 'for server to finish reboot', and found this in backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/VdsCommand.java: private void sleepOnReboot(final VDSStatus status) { int sleepTimeInSec = Config.<Integer> getValue(ConfigValues.ServerRebootTimeout); log.info("Waiting {} seconds, for server to finish reboot process.", sleepTimeInSec); Even without knowing Java, ServerRebootTimeout seems relevant. grepping for this, finds it also in: packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:582:select fn_db_add_config_value('ServerRebootTimeout','600','general'); packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1460:-- Increase default ServerRebootTimeout from 5 to 10 minutes packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1461:select fn_db_update_default_config_value('ServerRebootTimeout', '300', '600', 'general', false); where it's set and then updated, and in: packaging/etc/engine-config/engine-config.properties:119:ServerRebootTimeout.description="Host Reboot Timeout (in seconds)" packaging/etc/engine-config/engine-config.properties:120:ServerRebootTimeout.type=Integer where it's exposed to engine-config. So if all you want is to get this error message earlier, this should be enough. However, I also checked the git log (or blame, if you want, but I prefer the log) for the former file, trying to understand when and why it was changed from 5 to 10 minutes. 'git log -u packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql' and then searching for 'ServerRebootTimeout' finds https://github.com/oVirt/ovirt-engine/commit/d324bbdd . This links at https://bugzilla.redhat.com/1947403 . That one sadly does not provide many more details. It does show that it was done in 4.4.6. So I can only guess that one of two things happened: 1. Someone complained that hosts become non-operational e.g. because their boot sequence/POST/whatever takes more than 5 minutes. Perhaps this was rare enough to be reported and handled only recently (two years ago, and not, say, 10). (Although I personally managed machines that needed more than 5 minutes to reboot, or even just test the RAM - but that's indeed rare). 2. Something else changed, and made this less comfortable. E.g. perhaps the engine didn't move them in the past to non-operational and now does, or something like that. Not sure which of these, it at all. You are welcome to change it to some low value using engine-config and see if it helps. If it's "just enough", you should notice no difference from previous versions. If it's not enough, you might indeed see different behavior and then decide how to continue - I can think of a few ways: 1. Just set it to slightly more than your own machines' reboot times, and decide that's, where you might need to manually activate a host after reboot if it took longer for some reason. 2. Set it to a much lower value, and if indeed it causes the hosts to always move to non-operational, try to create a script/daemon/whatever, e.g. in ansible, to be used for rebooting, which will try to activate the host in a loop with some delay before and in-between attempts, until it succeeds or gives up after some attempts. 3. Try to find out why the above patch was needed and think if it's important enough for you to dive deeper and provide, or at least suggest, a fix/change/whatever that will make it unnecessary. That said, we (the RHV team) are still interested, and if you think it's a significant issue/bug, we might decide to spend the time and come up with a fix/change ourselves. In the past, I might have tried to guess who might know more details and Cced them. But at this point in time, I decided it's better to explain what users can do by themselves. I hope this is helpful, and that enough people will take the time to read, understand, and apply my explanation as applicable. Best regards, -- Didi

Peter H

27 Feb 27 Feb

10:30 a.m.

On Wed, Jan 25, 2023 at 8:22 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Wed, Jan 25, 2023 at 2:08 AM Peter H <peter@hashbang.org> wrote:

...
Does a config file exist where this timeout can be set to a lower value?

I intended to provide a short reply just pointing out what value to change, then realized this might not be helpful, so decided to give up and not reply. Then I decided to take this opportunity and write the following.

For background, please see: https://lists.ovirt.org/archives/list/users@ovirt.org/thread/HEKKBM6MZEKBEAX...

Thanks for taking the time to answer me at such length. I was unaware of the tool engine-config(1) even after maintaining an oVirt cluster for 3 years... The tool was just what I was looking for.

...

You do not need to be a developer, to search and read source code. One of the biggest advantages of FOSS is that you can do this, even without knowing how to write/update it.

I installed my first Linux OS (Slackware) back in '93 so I have downloaded and compiled my fair share of Free/OpenSource projects. In the last century I also got a couple of kernel patches accepted. I actually checked the code a couple of years ago while investigating the logic behind the dropdown menu regarding VM types. I found out the UI code was written in some Java framework that was quite hard to understand for me.

...

My main work in oVirt was in packaging/setup/backup/restore, not in the engine itself or vdsm - the two main parts of the project. But I know enough to guess that the error message you got is from the engine. I already have the engine source code git cloned on my laptop, so grepped it for 'for server to finish reboot', and found this in backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/VdsCommand.java:

I acknowledge that I could have found the error message in the code but I'm unsure if I then had made the connection that would have me discover the engine-setup(1) tool.

...

private void sleepOnReboot(final VDSStatus status) { int sleepTimeInSec = Config.<Integer> getValue(ConfigValues.ServerRebootTimeout); log.info("Waiting {} seconds, for server to finish reboot process.", sleepTimeInSec);

Even without knowing Java, ServerRebootTimeout seems relevant. grepping for this, finds it also in:

packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:582:select fn_db_add_config_value('ServerRebootTimeout','600','general'); packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1460:-- Increase default ServerRebootTimeout from 5 to 10 minutes packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1461:select fn_db_update_default_config_value('ServerRebootTimeout', '300', '600', 'general', false);

where it's set and then updated, and in:

packaging/etc/engine-config/engine-config.properties:119:ServerRebootTimeout.description="Host Reboot Timeout (in seconds)" packaging/etc/engine-config/engine-config.properties:120:ServerRebootTimeout.type=Integer

where it's exposed to engine-config. So if all you want is to get this error message earlier, this should be enough.

I can confirm that the ServerRebootTimeout is set to 300 in our current 4.4.1 installation. I have also tested that I can change it in my 4.5.4 test system using: engine-config -s ServerRebootTimeout=300 systemctl restart ovirt-engine

...

However, I also checked the git log (or blame, if you want, but I prefer the log) for the former file, trying to understand when and why it was changed from 5 to 10 minutes. 'git log -u packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql' and then searching for 'ServerRebootTimeout' finds https://github.com/oVirt/ovirt-engine/commit/d324bbdd . This links at https://bugzilla.redhat.com/1947403 . That one sadly does not provide many more details. It does show that it was done in 4.4.6. So I can only guess that one of two things happened:

1. Someone complained that hosts become non-operational e.g. because their boot sequence/POST/whatever takes more than 5 minutes. Perhaps this was rare enough to be reported and handled only recently (two years ago, and not, say, 10). (Although I personally managed machines that needed more than 5 minutes to reboot, or even just test the RAM - but that's indeed rare).

We have some HP servers in our 4.4.1 installation that take around 6-7 minutes to reboot from selecting Restart through the SSH Management (dropdown menu). Even Though this is longer than 5 minutes (300 secs) the installation never fails. The state is Reboot for 5 minutes then it goes into NonResponsive for some minutes until the state is set to Up.

...

2. Something else changed, and made this less comfortable. E.g. perhaps the engine didn't move them in the past to non-operational and now does, or something like that.

Not sure which of these, it at all.

I'm seeing some differences between 4.4.1 and 4.5.4. In 4.4.1 with default timeout of 10 minutes a host would be set to Up just after reboot. That could be within 3, 5 or 7 minutes. In 4.5.4 this seems to have changed. With a 10 minute timeout a host that reboots in 5 minutes will not be set to "Up" before the 10 minutes have gone. It's a long time to sit and wait when you know the host is up. If I lower the timeout to 1 minute the host state will go from Reboot to NonResponsive and finally Up very shortly after the host has started up again. So in 4.5.4 it seems that hosts can't connect until ServerRebootTimeout seconds after the reboot was initiated.

...

You are welcome to change it to some low value using engine-config and see if it helps. If it's "just enough", you should notice no difference from previous versions.

Setting the timeout to 1 minute seems to work fine. We will probably just need to work on our alerting rules because the hosts will be in NonResponsive state for a while before coming "Up".

...

3. Try to find out why the above patch was needed and think if it's important enough for you to dive deeper and provide, or at least suggest, a fix/change/whatever that will make it unnecessary.

That said, we (the RHV team) are still interested, and if you think it's a significant issue/bug, we might decide to spend the time and come up with a fix/change ourselves. In the past, I might have tried to guess who might know more details and Cced them. But at this point in time, I decided it's better to explain what users can do by themselves. I hope this is helpful, and that enough people will take the time to read, understand, and apply my explanation as applicable.

Best regards, -- Didi

The issue is not a blocker for us as things work. It's just a matter of how long we have to wait or deal with the NonResponsive state. We can install hosts and add them to the cluster without failures. For the versions 4.5.1 - 4.5.3 it was another matter. Here the reboot timeout caused the installations to fail. I'm wondering about how this was not seen by others or some regression. Once again, thanks for your answer. BR Peter

893

Age (days ago)

926

Last active (days ago)

List overview

Download

2 comments

2 participants

participants (2)

Peter H
Yedidyah Bar David