
On Wed, Jan 25, 2023 at 2:08 AM Peter H <peter@hashbang.org> wrote:
I'm working in a group that maintains a large oVirt setup based on 4.4.1 which works very well. We are afraid of upgrading and prefer setting up a new installation and gradually enlist the hosts one by one into the new installation.
We have tried 4.4.10 and 4.5.1 - 4.5.4 based on CentOS Stream 8, Rocky 8, Alma Linux 9.1 with various problems. Worst was the problem that the rpm db ended up in a catch-22 state.
Using Alma Linux 9.1 and current oVirt 4.5.4 seems promising as no rpm problems are present after installation. We have only one nuisance left which we have seen in all installation attempts we have made since 4.4.10. When rebooting a host it takes 10 minutes before it's activated again. In 4.4.1 the hosts are activated a few seconds after they have booted up.
I have found the following in the engine log: 2023-01-24 23:01:57,564+01 INFO [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-engine-Thread-1513) [2bb08d20] Waiting 600 seconds, for server to finish reboot process.
Our ansible playbooks for deployment times out and we could increase the timeout but how come that this 10 minutes delay has been introduced?
Does a config file exist where this timeout can be set to a lower value?
I intended to provide a short reply just pointing out what value to change, then realized this might not be helpful, so decided to give up and not reply. Then I decided to take this opportunity and write the following. For background, please see: https://lists.ovirt.org/archives/list/users@ovirt.org/thread/HEKKBM6MZEKBEAX... . You do not need to be a developer, to search and read source code. One of the biggest advantages of FOSS is that you can do this, even without knowing how to write/update it. My main work in oVirt was in packaging/setup/backup/restore, not in the engine itself or vdsm - the two main parts of the project. But I know enough to guess that the error message you got is from the engine. I already have the engine source code git cloned on my laptop, so grepped it for 'for server to finish reboot', and found this in backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/VdsCommand.java: private void sleepOnReboot(final VDSStatus status) { int sleepTimeInSec = Config.<Integer> getValue(ConfigValues.ServerRebootTimeout); log.info("Waiting {} seconds, for server to finish reboot process.", sleepTimeInSec); Even without knowing Java, ServerRebootTimeout seems relevant. grepping for this, finds it also in: packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:582:select fn_db_add_config_value('ServerRebootTimeout','600','general'); packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1460:-- Increase default ServerRebootTimeout from 5 to 10 minutes packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1461:select fn_db_update_default_config_value('ServerRebootTimeout', '300', '600', 'general', false); where it's set and then updated, and in: packaging/etc/engine-config/engine-config.properties:119:ServerRebootTimeout.description="Host Reboot Timeout (in seconds)" packaging/etc/engine-config/engine-config.properties:120:ServerRebootTimeout.type=Integer where it's exposed to engine-config. So if all you want is to get this error message earlier, this should be enough. However, I also checked the git log (or blame, if you want, but I prefer the log) for the former file, trying to understand when and why it was changed from 5 to 10 minutes. 'git log -u packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql' and then searching for 'ServerRebootTimeout' finds https://github.com/oVirt/ovirt-engine/commit/d324bbdd . This links at https://bugzilla.redhat.com/1947403 . That one sadly does not provide many more details. It does show that it was done in 4.4.6. So I can only guess that one of two things happened: 1. Someone complained that hosts become non-operational e.g. because their boot sequence/POST/whatever takes more than 5 minutes. Perhaps this was rare enough to be reported and handled only recently (two years ago, and not, say, 10). (Although I personally managed machines that needed more than 5 minutes to reboot, or even just test the RAM - but that's indeed rare). 2. Something else changed, and made this less comfortable. E.g. perhaps the engine didn't move them in the past to non-operational and now does, or something like that. Not sure which of these, it at all. You are welcome to change it to some low value using engine-config and see if it helps. If it's "just enough", you should notice no difference from previous versions. If it's not enough, you might indeed see different behavior and then decide how to continue - I can think of a few ways: 1. Just set it to slightly more than your own machines' reboot times, and decide that's, where you might need to manually activate a host after reboot if it took longer for some reason. 2. Set it to a much lower value, and if indeed it causes the hosts to always move to non-operational, try to create a script/daemon/whatever, e.g. in ansible, to be used for rebooting, which will try to activate the host in a loop with some delay before and in-between attempts, until it succeeds or gives up after some attempts. 3. Try to find out why the above patch was needed and think if it's important enough for you to dive deeper and provide, or at least suggest, a fix/change/whatever that will make it unnecessary. That said, we (the RHV team) are still interested, and if you think it's a significant issue/bug, we might decide to spend the time and come up with a fix/change ourselves. In the past, I might have tried to guess who might know more details and Cced them. But at this point in time, I decided it's better to explain what users can do by themselves. I hope this is helpful, and that enough people will take the time to read, understand, and apply my explanation as applicable. Best regards, -- Didi