2021. 09. 05. 9:24 keltezéssel, Yedidyah Bar David írta:
Hi,
On Sun, Sep 5, 2021 at 9:19 AM <erdosi.peter(a)kifu.gov.hu> wrote:
> Hy!
>
> I've just finished the process in the subject, and want to share a few things
with others, and also want to ask a question.
Thanks :-)
> Long story "short" (it's my IT infra@home, I've oVirt in bigger
setups at work): I had 3 machines before: two desktop, and one Dell R710 server. One of
the desktop running FreeNAS with an SSD RAID1, the other desktop and my Dell machine was
my two hypervisors. Luckily, i've just got two HP DL380 G6, which become my
hypervisors, and my Dell machine became the TrueNAS storage (redundant PSU, more core,
more RAM, more disk :) ).
>
> When I started the procedure, I used the latest oVirt 4.3, but it was the time to
upgrade the version, and also want to migrage all my Data (with the self hosted engine
too) to the TrueNAS Dell machine (but... I've only have the two SSD on my old FreeNAS,
so it has to be moved)
>
> After I replaced the hypervisors, and migrated all my VM data to the new storage to
new disks (iSCSI->iSCSI, live and cold storage migration) it was the time, to shut off
the FreeNAS, and start the HE redeploy.
>
> The main steps, what I take:
> - Undeployed the hosted-engine from the host with ID: 2 (the ID came from
hosted-engine --vm-status command, the machine name is: Jupiter)
> - Moved all my VMs to this host, only the HE remained on the machine with ID: 1
(name: Saturn)
> - Removed the remained stuff with: "hosted-engine --clean-metadata --host-id=2
--force-clean", so the Saturn was the only machine capable of running the HE
> - Set Global maintenance mode=true
> - Stop the ovirt-engine in the old HE machine
> - Create a backup with "engine-backup --mode=backup --file=engine.backup
--log=engine-backup.log" and copy it to an another machine (my desktop actually)
> - "hosted-engine --vm-shutdown"
> - Saturn shutdown, 4.4.6 installer in (it was written to DVD before, don't want
to create another) Saturn complete reinstall.
> - FreeNAS shutdown, SSD moved to TrueNAS, RAID 1/zVol create, iSCSI export...
> - After the base network was created, and the backup was moved back to the new
Saturn, I started the new host deploy with: "hosted-engine --deploy
--restore-from-file=engine.backup"
>
> Catch 1 and 2: If you do this, you must know the old DC and the Cluster name!
Are you sure you _must_? I think it will create new DC/cluster for you
if you type new names. Obviously, this might not be what you want.
So probably what you mean is something like:
The deploy process wrote it to me at
the beginning. I don't have
screenshots from it, but it was clear. Since I only have one DC and one
Cluster, and I don't want the new HE to put in an another, I think i
must do it, or my restore process will halt, or I may get Two clusters
with 1-1 machine, but not that what I want.
If I restore from backup, I think the tool should check the
DC/cluster
names in the backup, and let me select among them, perhaps even try to
intelligently guess the one that should be default (or just pick one
at random).
The tool not do this, since this question was at the beginning, but the
restore file only used (at least based the ansible tasks, what I saw
running) after the local VM created, and it just copied back to it, and
started engine-restore
I agree this makes sense, you are welcome to file an RFE bug.
Not sure it will be prioritized - patches are welcome!
Maybe :)
But what happen, if you have multiple DC and/or Cluster?!
> write it down somewhere! or you need to get out the PSQL dbs from the backup, and
extract from it... (like i had to)
Would also be useful if you detail your commands as a workaround, if
you file such a bug.
I think it's not a bug, it's my fault, but I did this
(after unpacked
the backup file, and get the two .db from it)
(installing PSQL11 to a dev machine)
pg_restore engine_backup.db > engine.sql
pg_restore dwh_backup.db > dwh.sql
createdb dwh
createdb engine
createuser engine
psql engine < engine.sql
psql dwh < dwh.sql
psql engine
select name from cluster;
and
psql dwh
select datacenter_name, create_date from datacenter_configuration order
by history_id desc limit 1;
Caution! there are a few more records in datacenter_configuration table,
historically the all what you named your DC... I need to get the last one.
> The process goes almost flawless, but at some point, I've got a deployment error
(code: 519) which tells me, the Saturn have missing networks, and I can connect to
https://saturn.mope.local:6900/ovirt-engine
>
> Catch 3: this URL not worked, since it was not the HE's URL, and I cannot login
because the FQDN checking...
The deploy process temporarily configures the engine to allow using
the host's FQDN to connect to the engine, and forwards port 6900 for
you.
There might have been other complications I do not understand that
prevented this from working well.
What happened when you tried?
If you think this is a real bug, and can reliably reproduce it, please
file one in bugzilla. Thanks!
I cant reproduce it, at least not now, but it's
looks like a bug. I have
two more oVirt 4.3, which has to be upgraded in 2021 - if nothing
blocking happen at work - so I may be able to reproduce it later.
Actually I got the frontend webpage after opening the link (I was in the
same subnet with the deploy host) and it wrote me the FQDN problem (this
is not a valid FQDN)
There was a link, to connect to the valid, but when I clicked on it, it
start to open the original HE domain (sun.mope.local) without port,
which is pointed to the old HE's IP address.
I've tried to edit my hosts file, and make the sun.mope.local be
resolved to the saturn's IP, but no luck. (the port 6900 was always
removed from the link if I remember right)
Is this may be impoved in a newer version of installer?
> This may need some further improvement ;) After some thinking, finally I've made
a socks proxy with SSH (1080 forwarded to Saturn) and I was able to login to the locally
running HE and make the network changes, which was required to continue the deploy
process... Also (since the old FreeNAS box was on my desk, the HE and the two hypervisor
was unable to connect to my old SSD iSCSI, so I have to remove it...
I understand this was somewhat annoying, but you have to realize that
there are many different backup/restore/upgrade/migration scenarios,
and it's not easy to decide what's best to do.
I understand completly ;)
there is no two same setup...
In your specific case, if you are confident that you'll not need
to be
able to abort in the middle and revert to the old engine+storage, you
could have removed it from the engine before taking the backup.
How can I remove
from the engine the domain, which the engine running
from? (and only the engine... all other VMs copied to other place)
> (cannot put on maintenance, but I was able to delete it from Storage->Domains tab)
After this, Saturn came up, got the "green arrow", so I removed the lock file
which the deploy give me, and the deploy continued...
>
> After this, I selected the new SSD RAID1 on my Dell iSCSI box, and the deploy was
finally able to copy the HE to the iSCSI, so far so good :)
>
> Finally, I've got my new 4.4.6 setup, with Self hosted HE on my new TrueNAS@Dell.
(and all my VMs running on Jupiter at this time, without any errors)
>
> The next step was to live migrate off Jupiter to Saturn, DVD in, Jupiter remove from
the Cluster, and reinstall to 4.4.6, and readd Jupiter (this time, with HE deployed
again)
>
> After I've put back Jupiter, and made the required initial setup with the network
(VLANs pulled to the bonds, iSCSI IPs set, etc) the cluster was up and running.
>
> The next step was to upgrade the hypervisors to the latest image with rolling update,
it was worked as before, so the time come to move the cluster and DC compatilbility level
from 4.3 -> 4.6... This forced me to reboot all my VMs, as it written when I made this
change, but this worked too. At last: I've restarted the HE with hosted-engine
--vm-shutdown / --vm-start, because it has some "!" at the BIOS version...
>
> And this is my question actually: after the restart, the BIOS version of the HE
machine remained the old one, and still have the "!" which states: "The
chipset/firmware type doen not match the cluster chipset/firmware type"
>
> Doen anyone know, how can the HE BIOS updated after compatibility leve increase?
Sorry, not sure. Perhaps you can try to edit the VM from the web admin
ui. Adding Arik.
Thanks.
The GUI wrote me: "There was an attempt to change Hosted Engine VM
values that are locked."
Generally speaking:
1. The VM is created fresh, unrelated to anything in your backup, IMO.
2. So this is perhaps a result of starting from 4.4.6. To prevent
this, you could have upgraded the host to latest version *before*
starting the deploy/restore. But I might be wrong.
I think you wrong. This not came
because of I used 4.4.6 installer...
This issue started, when I moved up the cluster compatibility level.
If I use the latest iso, the compatibility level have to rised up from
the same value (4.3) to the same value (4.6) either... (after the new
HE deployed)
> Sorry, if my mail should be goes to the users list, but because of "Catch
3" I think is's better here.
Generally speaking it should have been on the users list - most bugs
are reported there, not here. But that's ok :-). This list is more for
developers. E.g. If you decide to write a patch for one of the above
issues, you might want to discuss it here if needed.
Okay! Should I send it there
too, and move the conversation to the user
list?
> Also I want to write this down, maybe someone find usefull, and
this procedure can be used in the Docs too, if you want!
I think most of what you wrote is already covered by docs, no?
I had to figure out
a few things by myself :)
You might want to also check RHV docs (right now they tend to be
more
up-to-date than oVirt), and also open docs bugs in bugzilla. If you
think there is something concrete missing, please file a docs bug (and
if it's missing in oVirt and exists in RHV, that's not a reason to not
file the bug - from oVirt's POV, the bug exists!).
Okay, i'll if I have
some time, read them again.
BTW, I started from this solution (at least the preparation steps had
been done by this)
https://access.redhat.com/solutions/2998291
The oVirt doc may need some extensions, based on my workflow, but I'll
try another cluster with the latest iso:
https://www.ovirt.org/documentation/upgrade_guide/#SHE_Upgrading_from_4-3
(the main steps are there)
Thanks and best regards,
Regards:
Peter
--
*Erdősi Péter
* /Informatikus, IKT Fejlesztési Főosztály /
*Kormányzati Informatikai Fejlesztési Ügynökség
* cím: 1134 Budapest, Váci út 35.
tel: +36 1 450 3080 e-mail: erdosi.peter(a)kifu.gov.hu
<mailto:erdosi.peter@kifu.gov.hu>
KIFÜ -
www.kifu.gov.hu <
http://kifu.gov.hu/kifu/>