On Wed, Jan 25, 2023 at 2:08 AM Peter H <peter(a)hashbang.org> wrote:
I'm working in a group that maintains a large oVirt setup based on 4.4.1 which works
very well. We are afraid of upgrading and prefer setting up a new installation and
gradually enlist the hosts one by one into the new installation.
We have tried 4.4.10 and 4.5.1 - 4.5.4 based on CentOS Stream 8, Rocky 8, Alma Linux 9.1
with various problems. Worst was the problem that the rpm db ended up in a catch-22
state.
Using Alma Linux 9.1 and current oVirt 4.5.4 seems promising as no rpm problems are
present after installation. We have only one nuisance left which we have seen in all
installation attempts we have made since 4.4.10. When rebooting a host it takes 10 minutes
before it's activated again. In 4.4.1 the hosts are activated a few seconds after they
have booted up.
I have found the following in the engine log:
2023-01-24 23:01:57,564+01 INFO [org.ovirt.engine.core.bll.SshHostRebootCommand]
(EE-ManagedThreadFactory-engine-Thread-1513) [2bb08d20] Waiting 600 seconds, for server to
finish reboot process.
Our ansible playbooks for deployment times out and we could increase the timeout but how
come that this 10 minutes delay has been introduced?
Does a config file exist where this timeout can be set to a lower value?
I intended to provide a short reply just pointing out what value to
change, then realized this might not be helpful, so decided to give up
and not reply. Then I decided to take this opportunity and write the
following.
For background, please see:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/HEKKBM6MZEKB...
.
You do not need to be a developer, to search and read source code. One
of the biggest advantages of FOSS is that you can do this, even
without knowing how to write/update it.
My main work in oVirt was in packaging/setup/backup/restore, not in
the engine itself or vdsm - the two main parts of the project. But I
know enough to guess that the error message you got is from the
engine. I already have the engine source code git cloned on my laptop,
so grepped it for 'for server to finish reboot', and found this in
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/VdsCommand.java:
private void sleepOnReboot(final VDSStatus status) {
int sleepTimeInSec = Config.<Integer>
getValue(ConfigValues.ServerRebootTimeout);
log.info("Waiting {} seconds, for server to finish reboot process.",
sleepTimeInSec);
Even without knowing Java, ServerRebootTimeout seems relevant.
grepping for this, finds it also in:
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:582:select
fn_db_add_config_value('ServerRebootTimeout','600','general');
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1460:--
Increase default ServerRebootTimeout from 5 to 10 minutes
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1461:select
fn_db_update_default_config_value('ServerRebootTimeout', '300',
'600',
'general', false);
where it's set and then updated, and in:
packaging/etc/engine-config/engine-config.properties:119:ServerRebootTimeout.description="Host
Reboot Timeout (in seconds)"
packaging/etc/engine-config/engine-config.properties:120:ServerRebootTimeout.type=Integer
where it's exposed to engine-config. So if all you want is to get this
error message earlier, this should be enough.
However, I also checked the git log (or blame, if you want, but I
prefer the log) for the former file, trying to understand when and why
it was changed from 5 to 10 minutes. 'git log -u
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql' and then
searching for 'ServerRebootTimeout' finds
https://github.com/oVirt/ovirt-engine/commit/d324bbdd . This links at
https://bugzilla.redhat.com/1947403 . That one sadly does not provide
many more details. It does show that it was done in 4.4.6. So I can
only guess that one of two things happened:
1. Someone complained that hosts become non-operational e.g. because
their boot sequence/POST/whatever takes more than 5 minutes. Perhaps
this was rare enough to be reported and handled only recently (two
years ago, and not, say, 10). (Although I personally managed machines
that needed more than 5 minutes to reboot, or even just test the RAM -
but that's indeed rare).
2. Something else changed, and made this less comfortable. E.g.
perhaps the engine didn't move them in the past to non-operational and
now does, or something like that.
Not sure which of these, it at all.
You are welcome to change it to some low value using engine-config and
see if it helps. If it's "just enough", you should notice no
difference from previous versions. If it's not enough, you might
indeed see different behavior and then decide how to continue - I can
think of a few ways:
1. Just set it to slightly more than your own machines' reboot times,
and decide that's, where you might need to manually activate a host
after reboot if it took longer for some reason.
2. Set it to a much lower value, and if indeed it causes the hosts to
always move to non-operational, try to create a
script/daemon/whatever, e.g. in ansible, to be used for rebooting,
which will try to activate the host in a loop with some delay before
and in-between attempts, until it succeeds or gives up after some
attempts.
3. Try to find out why the above patch was needed and think if it's
important enough for you to dive deeper and provide, or at least
suggest, a fix/change/whatever that will make it unnecessary.
That said, we (the RHV team) are still interested, and if you think
it's a significant issue/bug, we might decide to spend the time and
come up with a fix/change ourselves. In the past, I might have tried
to guess who might know more details and Cced them. But at this point
in time, I decided it's better to explain what users can do by
themselves. I hope this is helpful, and that enough people will take
the time to read, understand, and apply my explanation as applicable.
Best regards,
--
Didi