[ovirt-users] Hosted engine deployment error

Wed Mar 21 16:19:55 UTC 2018

Hi,
   I made some progress : by allowing my NAS to map any user to admin (not the best for security, but it is a dedicated infrastructure), this weird permissions problem disappeared. Maybe a NFS bug somewhere ? I don't know.   I was able to redeploy a new hosted engine, and after a cleanup and some other manual cleaning tasks, restore my latest backup.   So the new engine vm is able to startup, but it seems there is a problem for communicating with hosts.   I get a lot of errors like this one :   vdsm[3008]: ERROR ssl handshake: SSLError, address: ::ffff:10.100.1.100 

10.100.1.100 is the IP of the engine vm. 

vdsm.log is not more helpful : 

2018-03-21 17:10:10,769+0100 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::ffff:10.100.1.100 (sslutils:258) 

Is there something to update or generate after a restore ? I don't know whether keys and certificates were kept or if new ones are now used. 

I also tried to add the SSH public key showed in the GUI to the authorized_keys on a node, even reboot, but no change. 

Regards 

 Le 20-Mar-2018 16:12:40 +0100, spfma.tech at e.mail.fr a crit:   
 I tried to make a cleaner install : after cleanup, I recreated "/rhev/data-center/mnt/" and ran the installer again.
   As you can see, it crashed again with the same access denied error on this file :    [ INFO ] TASK [Copy configuration archive to storage]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["dd", "bs=20480", "count=1", "oflag=direct", "if=/var/tmp/localvmVBRLpL/b1884198-69e6-4096-939d-03c87112de10", "of=/rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10"], "delta": "0:00:00.004468", "end": "2018-03-20 15:57:34.199405", "msg": "non-zero return code", "rc": 1, "start": "2018-03-20 15:57:34.194937", "stderr": "dd: impossible d'ouvrir /rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10 : Permission non accorde", "stderr_lines": ["dd: impossible d'ouvrir /rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10 : Permission non accorde"], "stdout": "", "stdout_lines": []}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook
   But the file permissions look ok to me :    -rw-rw----. 1 vdsm kvm 1,0G 20 mars 2018 /rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10

 So I decided to test something : I set a shell for "vdsm", so I could login :    su - vdsm -c "touch /rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10" && echo "OK"
OK   As far as I can see,still no permission problem 

But if I try the same as "root" : 

touch /rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10 && echo "OK"
touch: impossible de faire un touch /rhev/data-center/mnt/10.100.2.132:_volume3_ovirt__engine__self__hosted/015d9546-af01-4fb2-891e-e28683db3387/images/589d0768-c935-4495-aa57-45b9b2a18526/b1884198-69e6-4096-939d-03c87112de10 : Permission non accorde 

Of course, "root" and "vdsm" can create, touch and delete other files flawlessly in this share. 

It looks like some kind of immutable file, but is is not suppose to exist on NFS, does it ? 

Regards 

 Le 20-Mar-2018 12:22:50 +0100, stirabos at redhat.com a crit:   

 On Tue, Mar 20, 2018 at 11:44 AM,  wrote:

   Hi,
       In fact it is a workaround coming from you I found in the bugtrack that helped me :         

chmod 644 /var/cache/vdsm/schema/*    

As the only thing looking like a weird error I have found was :     

ERROR Exception raised#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 156, in run#012 serve_clients(log)#012 File "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 103, in serve_clients#012 cif = clientIF.getInstance(irs, log, scheduler)#012 File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 250, in getInstance#012 cls._instance = clientIF(irs, log, scheduler)#012 File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 144, in __init__#012 self._prepareJSONRPCServer()#012 File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 307, in _prepareJSONRPCServer#012 bridge = Bridge.DynamicBridge()#012 File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 67, in __init__#012 self._schema = vdsmapi.Schema(paths, api_strict_mode)#012 File "/usr/lib/python2.7/site-packages/vdsm/api/vdsmapi.py", line 217, in __init__#012 raise SchemaNotFound("Unable to find API schema file")#012SchemaNotFound: Unable to find API schema file    Thanks, it's tracked here: https://bugzilla.redhat.com/1552565   A fix will come in the next build.   
     So I can go one step futher, but the installation still fails in the end, with file permission problems in datastore files (i chose NFS 4.1). I can't indeed touch or get informations even logged in root. But I can create and delete files in the same directory.   Is there a workaround for this too ?    Everything should get wrote and read on the NFS export as vdsm:kvm (36:36); can you please ensure that everything is fine with that?   
   Regards 

 Le 19-Mar-2018 17:48:41 +0100, stirabos at redhat.com a crit:       

 On Mon, Mar 19, 2018 at 4:56 PM,  wrote:

 Hi,
   I wanted to rebuild a new hosted engine setup, as the old was corrupted (too much violent poweroff !)   So the server was not reinstalled, I just runned "ovirt-hosted-engine-cleanup". The network setup generated by vdsm seems to be still in place, so I haven't changed anything there.   Then I decided to update the packages to the latest versions avaible, rebooted the server and run "ovirt-hosted-engine-setup".   But the process never succeeds, as I get an error after a long time spent in "[ INFO ] TASK [Wait for the host to be up]"     [ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "pfm-srv-virt-1.pfm-ad.pfm.loc", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "pfm.loc", "subject": "O=pfm.loc,CN=pfm-srv-virt-1.pfm-ad.pfm.loc"}, "cluster": {"href": "/ovirt-engine/api/clusters/d6c9358e-2b8b-11e8-bc86-00163e152701", "id": "d6c9358e-2b8b-11e8-bc86-00163e152701"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/542566c4-fc85-4398-9402-10c8adaa9554", "id": "542566c4-fc85-4398-9402-10c8adaa9554", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "pfm-srv-virt-1.pfm-ad.pfm.loc", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:J75BVLFnmGBGFosXzaxCRnuIYcOc75HUBQZ4pOKpDg8", "port": 22}, "statistics": [], "status": "non_responsive", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false}]}, "attempts": 120, "changed": false}
[ INFO ] TASK [Remove local vm dir]
[ INFO ] TASK [Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.n"}     I made another try with Cockpit, it is the same.   Am I doing something wrong or is there a bug ?    I suppose that your host was condifured with DHCP, if so it's this one: https://bugzilla.redhat.com/1549642   The fix will come with 4.2.2.   
   Regards     

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.fr 
_______________________________________________
 Users mailing list
Users at ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.fr  
_______________________________________________
 Users mailing list
Users at ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.fr 

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180321/ac168679/attachment.html>