Re: Fix corrupt self-hosted engine
by Alex K
On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
> On Thu, Nov 19, 2020 at 9:43 PM Alex K <rightkicktech(a)gmail.com> wrote:
>
>>
>>
>> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkicktech(a)gmail.com> wrote:
>>
>>> Hi Didi,
>>>
>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David <didi(a)redhat.com>
>>> wrote:
>>>
>>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K <rightkicktech(a)gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a corrupt self-hosted engine (with several file system errors,
>>>>> postgres not able to start) and thus it does not give access to the web UI.
>>>>> This happened following an unlucky split brain resolution (I am running 2
>>>>> nodes). The two hosts are running VMs also which I would like to keep
>>>>> running as they are needed.
>>>>>
>>>>> When trying to boot into rescue mode (using
>>>>> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
>>>>> else.
>>>>>
>>>>
>>>> This means that more than just the DB is corrupt...
>>>>
>>>>
>>>>>
>>>>> I have backups of engine files with scope all (using the engine-backup
>>>>> tool).
>>>>> What is the best approach to try and fix the engine or redeploy.
>>>>>
>>>>
>>>> If you are careful, and know what you are doing, you can try something
>>>> like the following. I am not giving many details, hopefully you can find on
>>>> the net tutorials about how to use the things I suggest:
>>>>
>>>> 1. Move to global maintenance
>>>>
>>>> 2. Stop the current dead vm (if needed)
>>>>
>>>> 3. Find current vm conf, edit it to boot from a rescue iso image of
>>>> your preference or from net/PXE etc., and start the vm with '--vm-conf'
>>>> pointing to your edited file.
>>>>
>>>> 4. Connect a console (hosted-engine --console, or 'virsh console', or
>>>> use '--add-console-password' and remote viewer, if needed)
>>>>
>>>> 5. Clean the disk and install the OS, oVirt, etc.
>>>>
>>>> 6. Copy your backup into the vm and restore with engine-backup
>>>>
>>>> 7. Then cleanly stop the machine, exit global maint, and let HA start
>>>> it (or start it yourself with --vm-start).
>>>>
>>>> At the time, we had a bug [1] to document this. The result is [2]. It
>>>> does not detail how to boot/reinstall os/etc., only restore (if e.g. db is
>>>> dead but fs is ok).
>>>> For something somewhat similar to what you want, see also [3], which
>>>> uses guestfish. Might be useful, depending on how badly your disk is
>>>> corrupted.
>>>>
>>> I went with the guestfish approach. It has fixed some fs issues and now
>>> the yum etc seem fine apart from postgres.
>>> I had tried previously to uninstall/install packages so I ended
>>> installing them again with yum install ovirt\*setup\*.
>>> Now I think I have to run engine-setup but I get the error:
>>>
>>> Failed to execute stage 'Environment setup': Cannot connect to Engine
>>> database using existing credentials: engine@localhost:5432
>>>
>> Seems that I need to have psql running to be able to run engine-backup
>> --mode=restore. Are there any steps how one could manually prepare pgsql
>> for ovirt so as to attempt restoration?
>>
>
> Replying again, also to conclude this part of your episode: Generally
> speaking, that's not needed. restore --provision-all-databases should do
> that for you.
>
Seems that when pgsql is down nothing can be done. You need at least pgsql
up and running (e clean state will do) so as to be able to proceed with
restoration.
>
> I replied to all your interim emails in private, since you replied in
> private.
>
Did not notice I was replying in private :)
>
> Thanks for the final message to the list.
>
> It would be nice if you send another summary of the main obstacles you ran
> into, what worked and didn't work, and especially what ideas you can think
> of to improve the code/doc for the next time something similar happens
> (also to you :-) ).
>
> If you feel like that, and have time, it sounds like a nice opportunity
> for a blog post :-) (I know I (almost?) never wrote any myself, sorry, but
> I like reading them - and they are much more approachable and useful, over
> the long run, compared to just posting to the list).
>
Noted. Will check to put this in a blog. Generally the missing part from
the docs was that one cannot proceed with the restoration if pgsql is not
able to start. So I had to clean re-install pgsql and initialize its data
store before proceeding with the restoration.
>
> Best regards,
>
>
>>
>>> So I guess I need to follow [2]. What do you think?
>>>
>>>
>>>> How did you run into a split brain? There is a lock on the shared
>>>> storage that should prevent this.
>>>>
>>>> Good luck and best regards,
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
>>>> [2]
>>>> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_S...
>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
>>>> --
>>>> Didi
>>>>
>>>
>
> --
> Didi
>
4 years
Re: ovirt 4.3 cannot upload ISO to data domain
by Alex K
On Tue, Nov 24, 2020 at 12:44 PM Facundo Badaracco <varekoarfa(a)gmail.com>
wrote:
> hi alex!
>
> imageio-proxy isnt installed. Only ovirt-imageio service is running.
> (ovirt 4.4 now, i made the update)
>
If I run the following I get:
[root@engine ~]# systemctl | grep image
ovirt-imageio-proxy.service
Then checking the status of the service I confirm it is up:
systemctl status ovirt-imageio-proxy.service
If the service is up and ok then you need to check the certificate issue. I
had noticed some times that importing the cert form GUI had not resolved
the issue. You may try to manually get the CA cert from engine
(/etc/pki/ovirt-engine/ca.pem) and import it at your browser. Clean cache
at browser and try again. The browser should indicate the connection is
trusted and secure. Also, you might need to remove the already imported
cert from your browser.
> when i connecto to the web UI, it say the certificate is not valid. with a
> red warning. but, i have already added the certificate to chrome
>
> El mar, 24 de nov. de 2020 a la(s) 07:22, Alex K (rightkicktech(a)gmail.com)
> escribió:
>
>>
>>
>> On Mon, Nov 23, 2020 at 4:43 PM Facundo Badaracco <varekoarfa(a)gmail.com>
>> wrote:
>>
>>> Hi everyone.
>>>
>>> Im trying to upload a ISO to my data domain, the GUI gives me this error
>>> "Connection to ovirt-imageio service has failed. Ensure that ovirt-engine
>>> certificate
>>> <https://192.168.2.27/ovirt-engine/services/pki-resource?resource=ca-certi...> is
>>> registered as a valid CA in the browser.".
>>>
>> is the imageio service running ok? At engine: systemctl status
>> ovirt-imageio-proxy
>> Also, when connecting at the engine web UI, is the browser happy with the
>> certificate of the UI?
>>
>>>
>>> the image fails whne trying to upload.
>>>
>>> i have added the certificate in chrome, but inst working.
>>>
>>> Some help? any other way to upload?
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/47YYPUF3NA2...
>>>
>>
4 years
oVirt 4.4 and Active directory
by Latchezar Filtchev
Hello All,
Fresh standalone installation of oVirt 4.3 (CentOS 7) . Execution of ovirt-engine-extension-aaa-ldap-setup completes normally and DC is connected to AD (Domain functional level: Windows Server 2008 ).
On the same hardware fresh standalone installation of oVirt 4.4.
Installation of engine completed with warning:
2020-11-23 14:50:46,159+0200 WARNING otopi.plugins.ovirt_engine_common.base.network.hostname hostname._validateFQDNresolvability:308 Failed to resolve 44-8.mb118.local using DNS, it can be resolved only locally
Despite warning engine portal is resolvable after installation.
Execution of ovirt-engine-extension-aaa-ldap-setup ends with:
[ INFO ] Stage: Environment customization
Welcome to LDAP extension configuration program
Available LDAP implementations:
1 - 389ds
2 - 389ds RFC-2307 Schema
3 - Active Directory
4 - IBM Security Directory Server
5 - IBM Security Directory Server RFC-2307 Schema
6 - IPA
7 - Novell eDirectory RFC-2307 Schema
8 - OpenLDAP RFC-2307 Schema
9 - OpenLDAP Standard Schema
10 - Oracle Unified Directory RFC-2307 Schema
11 - RFC-2307 Schema (Generic)
12 - RHDS
13 - RHDS RFC-2307 Schema
14 - iPlanet
Please select: 3
Please enter Active Directory Forest name: mb118.local
[ INFO ] Resolving Global Catalog SRV record for mb118.local
[WARNING] Cannot resolve Global Catalog SRV record for mb118.local. Please check you have entered correct Active Directory forest name and check that forest is resolvable by your system DNS servers
[ ERROR ] Failed to execute stage 'Environment customization': Active Directory forest is not resolvable, please make sure you've entered correct forest name. If for some reason you can't use forest and you need some special configuration instead, please refer to examples directory provided by ovirt-engine-extension-aaa-ldap package.
[ INFO ] Stage: Clean up
Log file is available at /tmp/ovirt-engine-extension-aaa-ldap-setup-20201123113909-bj749k.log:
[ INFO ] Stage: Pre-termination
[ INFO ] Stage: Termination
Can someone advise on this?
Thank you!
Best,
Latcho
4 years
Intel Cascade Lake Family supported
by Ramon Sierra
Hi,
We are planning to upgrade our cluster hardware. We would like to know
if Intel Cascade Lake CPUs are supported on ovirt 4.4.3. Any
recommendation is very welcome.
Regards,
Ramon
4 years
oVirt Engine LDAP aaa - rfc2307bis issues
by Jake R
Hi,
I have LDAP with rfc2307bis schema - I have posixGroup, with members
defined as FDNs under the member attribute.
Currently, if I login to oVirt via the AAA extension. then my groups are
not enumerated. The LDAP searches (recorded on the LDAP server) are:
slapd[1503]: conn=7876 op=2 SRCH base="dc=example,dc=com" scope=2 deref=0
filter="(&(objectClass=posixGroup)(memberUid=jreynolds))"
slapd[1503]: conn=7876 op=2 SRCH attr=entryUUID cn description
slapd[1503]: conn=7871 op=2 SRCH base="dc=example,dc=com" scope=2 deref=0
filter="(&(|(objectClass=groupOfUniqueNames)(objectClass=posixGroup))(uniqueMember:uniqueMemberMatch:=cn=jreynolds,ou=users,dc=example,dc=com))"
slapd[1503]: conn=7871 op=2 SRCH attr=entryUUID cn description
This returns no results, as the search needs to search for 'member'
attribute, with an FDN. The issue looks to be inherited from the
simple.properties file (regardless of if I use rfc2307, rfc2307-openldap
profile), with the line:
search.simple-resolve-groups-member.search-request.filter =
&${seq:simple_filterGroupObject}(${seq:simple_attrGroupMemberDN}=${seq:_simple_dn_encoded})
I can fix the issue by replacing "${seq:simple_attrGroupMemberDN}=" with
"member=", but this feels pretty hacky. I cannot find where this variable
is defined, nor how to change it. Is the correct way to do this to create a
new profile that overwrites the filter value? Or am I doing something
wrong? I don't think my LDAP schema is particularly unusual, as far as I'm
aware it complies with rfc2307bis spec.
Thanks,
Jake
4 years
Can't find storage server connection
by francesco@shellrent.com
Hi all,
I'm using oVirt SDK python for retrieving info about storage domain, in several hosts (centos7/ovirt4.3 and centos8/ovirt4.4), but the script exits with the following error in some of them:
Traceback (most recent call last):
File "get_uuid.py", line 70, in <module>
storage_domain = sds_service.list(search='name=data-foo')[0]
File "/root/.local/lib/python2.7/site-packages/ovirtsdk4/services.py", line 26296, in list
return self._internal_get(headers, query, wait)
File "/root/.local/lib/python2.7/site-packages/ovirtsdk4/service.py", line 211, in _internal_get
return future.wait() if wait else future
File "/root/.local/lib/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait
return self._code(response)
File "/root/.local/lib/python2.7/site-packages/ovirtsdk4/service.py", line 208, in callback
self._check_fault(response)
File "/root/.local/lib/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
self._raise_error(response, body)
File "/root/.local/lib/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "Can't find storage server connection for id '92444a95-0be7-4589-ac46-1ed6dfe7ed4c'.". HTTP response code is 500.
The portion of the script that search for the storage domain is the following:
sds_service = connection.system_service().storage_domains_service()
storage_domain = sds_service.list(search='name={}'.format(storage_domain_name))[0]
Now: I have no real clue on which ID "92444a95-0be7-4589-ac46-1ed6dfe7ed4c'" but digging in the engine logs it refers to StorageServerConnections ID:
[root@ovirt-engine ovirt-engine]# zgrep 92444a95-0be7-4589-ac46-1ed6dfe7ed4c *.gz
engine.log-20201108.gz:2020-11-07 06:05:54,352+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [3ffb810c] START, ConnectStorageServerVDSCommand(HostName = another-server.foo.com, StorageServerConnectionManagementVDSParameters:{hostId='7d202bc7-002b-4426-8446-99b6b346874e', storagePoolId='82d0b3de-0334-451c-8321-c3533de9a894', storageType='LOCALFS', connectionList='[StorageServerConnections:{id='92444a95-0be7-4589-ac46-1ed6dfe7ed4c', connection='/data', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]', sendNetworkEventOnFailure='true'}), log id: 1509fe17
As I said, I tried to execute the script from several hosts: some of them with oVirt 4.3 and other with oVirt 4.4 but it can run, or not, on both of versions.
When I try to manage the storage domain via ovirt-engine GUI on the hosts that the script exit with the mentioned error, I recieve the following error:
Uncaught exception occurred. Please try reloading the page. Details: (TypeError) : Cannot read property 'a' of null
Please have your administrator check the UI logs
Any slightest idea on what is going on?
Thank you for your time and help,
Francesco
4 years
Best and clean way to have oVirt qemu and libvirt versions on CentOS 8
by Gianluca Cecchi
Hello,
normally with current CentOS 8.2 I get
qemu-kvm-core-15:2.12.0-99.module_el8.2.0+524+f765f7e0.4.x86_64
libvirt-daemon-4.5.0-42.module_el8.2.0+320+13f867d7.x86_64
With oVirt 4.4 having
qemu-kvm-core-4.2.0-29.el8.3.x86_64
libvirt-daemon-6.0.0-25.2.el8.x86_64
and in current Fedora 32 updates:
qemu-kvm-core-4.2.1-1.fc32.x86_64.rpm
libvirt-daemon-6.1.0-4.fc32.x86_64.rpm
I see on CentOS these groups somehow related to Virtualization:
Virtualization Host
Virtualization Client
Virtualization Hypervisor
Virtualization Platform
Virtualization Tools
What could be the best and less intrusive repo to enable on plain CentOS
8.2 to get qemu-kvm and libvirt versions quite similar to oVirt shipped
ones, or fedora ones, without enabling all the ovirt 4.4 repos?
Thanks,
Gianluca
4 years
ovirt-imageio-proxy not working after updating SSL certificates with a wildcard cert issued by AlphaSSL (intermediate)
by Lynn Dixon
All,
I recently bought a wildcard certificate for my lab domain (shadowman.dev)
and I replaced all the certs on my RHV4.3 machine per our documentation.
The WebUI presents the certs successfully and without any issues, and
everything seemed to be fine, until I tried to upload a disk image (or an
ISO) to my storage domain. I get this error in the events tab:
https://share.getcloudapp.com/p9uPvegx
[image: image.png]
I also see that the disk is showing up in my storage domain, but its
showing "Paused by System" and I can't do anything with it. I cant even
delete it!
I have tried following this document to fix the issue, but it didn't work:
https://access.redhat.com/solutions/4148361
I am seeing this error pop into my engine.log:
https://pastebin.com/kDLSEq1A
And I see this error in my image-proxy.log:
WARNING 2020-07-24 15:26:34,802 web:137:web:(log_error) ERROR [172.17.0.30]
PUT /tickets/ [403] Error verifying signed ticket: Invalid ovirt ticket
(data='------my_ticket_data-----', reason=Untrusted certificate)
[request=0.002946/1]
Now, when I bought my wildcard, I was given a root certificate for the CA,
as well as a separate intermediate CA certificate from the provider.
Likewise, they gave me a certificate and a private key of course. The root
and intermediate CA's certificates have been added
to /etc/pki/ca-trust/source/anchors/ and I did an update-ca-trust.
I also started experiencing issues with the ovpn network provider at the
same time I replaced the SSL certs, but I disregarded it at the time, but
now I am thinking its related. Any advice on what to look for to fix the
ovirt-imageio-proxy?
Thanks!
*Lynn Dixon* | Red Hat Certified Architect #100-006-188
*Solutions Architect* | NA Commercial
Google Voice: 423-618-1414
Cell/Text: 423-774-3188
Click here to view my Certification Portfolio <http://red.ht/1XMX2Mi>
4 years
Re: Fix corrupt self-hosted engine
by Alex K
For the records,
After having fixed the major fs issues with guestfish and since the DB was
not starting up, I removed everything from DB data dir and recreated it as
below:
rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/*
/opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb
systemctl restart rh-postgresql10-postgresql.service
Then proceeded with the restoration, where I requested to provision all
missing databases:
engine-backup --mode=restore --file=engine-backup.gz
--provision-all-databases \
--log=restore.log --restore-permissions
Following this, ran engine-setup, as instructed from the restore operation.
Gained engine web access and saw the same running VMs were shown as up
without issues.
I only observed one VM not able to start due to illegal volume, but that's
another story.
On Thu, Nov 19, 2020 at 9:42 PM Alex K <rightkicktech(a)gmail.com> wrote:
>
>
> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkicktech(a)gmail.com> wrote:
>
>> Hi Didi,
>>
>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David <didi(a)redhat.com>
>> wrote:
>>
>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K <rightkicktech(a)gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a corrupt self-hosted engine (with several file system errors,
>>>> postgres not able to start) and thus it does not give access to the web UI.
>>>> This happened following an unlucky split brain resolution (I am running 2
>>>> nodes). The two hosts are running VMs also which I would like to keep
>>>> running as they are needed.
>>>>
>>>> When trying to boot into rescue mode (using
>>>> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
>>>> else.
>>>>
>>>
>>> This means that more than just the DB is corrupt...
>>>
>>>
>>>>
>>>> I have backups of engine files with scope all (using the engine-backup
>>>> tool).
>>>> What is the best approach to try and fix the engine or redeploy.
>>>>
>>>
>>> If you are careful, and know what you are doing, you can try something
>>> like the following. I am not giving many details, hopefully you can find on
>>> the net tutorials about how to use the things I suggest:
>>>
>>> 1. Move to global maintenance
>>>
>>> 2. Stop the current dead vm (if needed)
>>>
>>> 3. Find current vm conf, edit it to boot from a rescue iso image of your
>>> preference or from net/PXE etc., and start the vm with '--vm-conf' pointing
>>> to your edited file.
>>>
>>> 4. Connect a console (hosted-engine --console, or 'virsh console', or
>>> use '--add-console-password' and remote viewer, if needed)
>>>
>>> 5. Clean the disk and install the OS, oVirt, etc.
>>>
>>> 6. Copy your backup into the vm and restore with engine-backup
>>>
>>> 7. Then cleanly stop the machine, exit global maint, and let HA start it
>>> (or start it yourself with --vm-start).
>>>
>>> At the time, we had a bug [1] to document this. The result is [2]. It
>>> does not detail how to boot/reinstall os/etc., only restore (if e.g. db is
>>> dead but fs is ok).
>>> For something somewhat similar to what you want, see also [3], which
>>> uses guestfish. Might be useful, depending on how badly your disk is
>>> corrupted.
>>>
>> I went with the guestfish approach. It has fixed some fs issues and now
>> the yum etc seem fine apart from postgres.
>> I had tried previously to uninstall/install packages so I ended
>> installing them again with yum install ovirt\*setup\*.
>> Now I think I have to run engine-setup but I get the error:
>>
>> Failed to execute stage 'Environment setup': Cannot connect to Engine
>> database using existing credentials: engine@localhost:5432
>>
> Seems that I need to have psql running to be able to run engine-backup
> --mode=restore. Are there any steps how one could manually prepare pgsql
> for ovirt so as to attempt restoration?
>
>>
>> So I guess I need to follow [2]. What do you think?
>>
>>
>>> How did you run into a split brain? There is a lock on the shared
>>> storage that should prevent this.
>>>
>>> Good luck and best regards,
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
>>> [2]
>>> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_S...
>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
>>> --
>>> Didi
>>>
>>
4 years