[Users] Vdsmd is respawning trying to sample NICs

Itamar Heim iheim at redhat.com
Mon Jun 25 18:23:39 UTC 2012


On 06/25/2012 12:00 PM, jose garcia wrote:
> On 06/25/2012 04:17 PM, Dan Kenigsberg wrote:
>> On Mon, Jun 25, 2012 at 03:15:47PM +0100, jose garcia wrote:
>>> On 06/25/2012 01:24 PM, Dan Kenigsberg wrote:
>>>> On Mon, Jun 25, 2012 at 01:15:08PM +0100, jose garcia wrote:
>>>>> On 06/25/2012 12:30 PM, Dan Kenigsberg wrote:
>>>>>> On Mon, Jun 25, 2012 at 12:11:37PM +0100, jose garcia wrote:
>>>>>>> On 06/25/2012 11:37 AM, Dan Kenigsberg wrote:
>>>>>>>> On Mon, Jun 25, 2012 at 10:57:47AM +0100, jose garcia wrote:
>>>>>>>>> Good monday morning,
>>>>>>>>>
>>>>>>>>> Installed Fedora 17 and tried to install the node to a 3.1 engine.
>>>>>>>>>
>>>>>>>>> I'm getting an VDS Network exception in the engine side:
>>>>>>>>>
>>>>>>>>> in /var/log/ovirt-engine/engine:
>>>>>>>>>
>>>>>>>>> 2012-06-25 10:15:34,132 WARN
>>>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>>>>>>>> (QuartzScheduler_Worker-96)
>>>>>>>>> ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS ,
>>>>>>>>> vds
>>>>>>>>> = 2e9929c6-bea6-11e1-bfdd-ff11f39c80eb :
>>>>>>>>> ovirt-node2.smb.eurotux.local, VDS Network Error, continuing.
>>>>>>>>> VDSNetworkException:
>>>>>>>>> 2012-06-25 10:15:36,143 ERROR
>>>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>>>>>>>> (QuartzScheduler_Worker-20) VDS::handleNetworkException Server
>>>>>>>>> failed to respond, vds_id = 2e9929c6-bea6-11e1-bfdd-ff11f39c80eb,
>>>>>>>>> vds_name = ovirt-node2.smb.eurotux.local, error =
>>>>>>>>> VDSNetworkException:
>>>>>>>>> 2012-06-25 10:15:36,181 INFO
>>>>>>>>> [org.ovirt.engine.core.bll.VdsEventListener] (pool-3-thread-49)
>>>>>>>>> ResourceManager::vdsNotResponding entered for Host
>>>>>>>>> 2e9929c6-bea6-11e1-bfdd-ff11f39c80eb, 10.10.30.177
>>>>>>>>> 2012-06-25 10:15:36,214 ERROR
>>>>>>>>> [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand]
>>>>>>>>> (pool-3-thread-49) [1afd4b89] Failed to run Fence script on
>>>>>>>>> vds:ovirt-node2.smb.eurotux.local, VMs moved to UnKnown instead.
>>>>>>>>>
>>>>>>>>> While in the node, vdsmd does fail to sample nics:
>>>>>>>>>
>>>>>>>>> in /var/log/vdsm/vdsm.log:
>>>>>>>>>
>>>>>>>>> nf = netinfo.NetInfo()
>>>>>>>>> File "/usr/share/vdsm/netinfo.py", line 268, in __init__
>>>>>>>>> _netinfo = get()
>>>>>>>>> File "/usr/share/vdsm/netinfo.py", line 220, in get
>>>>>>>>> for nic in nics() ])
>>>>>>>>> KeyError: 'p36p1'
>>>>>>>>>
>>>>>>>>> MainThread::INFO::2012-06-25 10:45:09,110::vdsm::76::vds::(run)
>>>>>>>>> VDSM
>>>>>>>>> main thread ended. Waiting for 1 other threads...
>>>>>>>>> MainThread::INFO::2012-06-25 10:45:09,111::vdsm::79::vds::(run)
>>>>>>>>> <_MainThread(MainThread, started 140567823243072)>
>>>>>>>>> MainThread::INFO::2012-06-25 10:45:09,111::vdsm::79::vds::(run)
>>>>>>>>> <Thread(libvirtEventLoop, started daemon 140567752681216)>
>>>>>>>>>
>>>>>>>>> in /etc/var/log/messages there is a lot of vdsmd died too quickly:
>>>>>>>>>
>>>>>>>>> Jun 25 10:45:08 ovirt-node2 respawn: slave '/usr/share/vdsm/vdsm'
>>>>>>>>> died too quickly, respawning slave
>>>>>>>>> Jun 25 10:45:08 ovirt-node2 respawn: slave '/usr/share/vdsm/vdsm'
>>>>>>>>> died too quickly, respawning slave
>>>>>>>>> Jun 25 10:45:09 ovirt-node2 respawn: slave '/usr/share/vdsm/vdsm'
>>>>>>>>> died too quickly for more than 30 seconds, master sleeping for 900
>>>>>>>>> seconds
>>>>>>>>>
>>>>>>>>> I don't know why Fedora 17 calls p36p1 to what was eth0 in Fedora
>>>>>>>>> 16, but tried to configure a bridge ovirtmgmt and the only
>>>>>>>>> difference is that KeyError becomes 'ovirtmgmt'.
>>>>>>>> The nic renaming may have happened due to biosdevname. Do you
>>>>>>>> have it
>>>>>>>> installed? Does any of the
>>>>>>>> /etc/sysconfig/network-scripts/ifcfg-* refer
>>>>>>>> to an old nic name?
>>>>>>>>
>>>>>>>> Which version of vdsm are you running? It seems that it is
>>>>>>>> pre-v4.9.4-61-g24f8627 which is too old for f17 to run - the
>>>>>>>> output of
>>>>>>>> ifconfig has changed. Please retry with latest beta version
>>>>>>>> https://koji.fedoraproject.org/koji/buildinfo?buildID=327015
>>>>>>>>
>>>>>>>> If the problem persists, could you run vdsm manually, with
>>>>>>>> # su - vdsm -s /bin/bash
>>>>>>>> # cd /usr/share/vdsm
>>>>>>>> # ./vdsm
>>>>>>>> maybe it would give a hint about the crash.
>>>>>>>>
>>>>>>>> regards,
>>>>>>>> Dan.
>>>>>>> Well, thank you. I have updated Vdsm to version 4.10. Now the
>>>>>>> problem is with SSL and XMLRPC.
>>>>>>>
>>>>>>> This is the error in the side of the engine:
>>>>>>>
>>>>>>> /var/log/ovirt-engine/engine.log
>>>>>>>
>>>>>>> ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand]
>>>>>>> (QuartzScheduler_Worker-52) XML RPC error in command
>>>>>>> GetCapabilitiesVDS ( Vds: ovirt-node2.smb.eurotux.local ), the error
>>>>>>> was: java.util.concurrent.ExecutionException:
>>>>>>> java.lang.reflect.InvocationTargetException,
>>>>>>> NoHttpResponseException: The server 10.10.30.177 failed to respond.
>>>>>>>
>>>>>>> In the side of the node, there seems to be an authentication
>>>>>>> problem:
>>>>>>>
>>>>>>> /var/log/vdsm/vdsm.log
>>>>>>>
>>>>>>> SSLError: [Errno 1] _ssl.c:504: error:1407609C:SSL
>>>>>>> routines:SSL23_GET_CLIENT_HELLO:http request
>>>>>>> Thread-810::ERROR::2012-06-25
>>>>>>> 12:02:46,351::SecureXMLRPCServer::73::root::(handle_error) client
>>>>>>> ('10.10.30.101', 58605)
>>>>>>> Traceback (most recent call last):
>>>>>>> File "/usr/lib64/python2.7/SocketServer.py", line 582, in
>>>>>>> process_request_thread
>>>>>>> self.finish_request(request, client_address)
>>>>>>> File
>>>>>>> "/usr/lib/python2.7/site-packages/vdsm/SecureXMLRPCServer.py", line
>>>>>>> 66, in finish_request
>>>>>>> request.do_handshake()
>>>>>>> File "/usr/lib64/python2.7/ssl.py", line 305, in do_handshake
>>>>>>> self._sslobj.do_handshake()
>>>>>>> SSLError: [Errno 1] _ssl.c:504: error:1407609C:SSL
>>>>>>> routines:SSL23_GET_CLIENT_HELLO:http request
>>>>>>>
>>>>>>> In /var/log/messages there is an:
>>>>>>>
>>>>>>> vdsm [5834]: vdsm root ERROR client ()
>>>>>>>
>>>>>>> with the ip address of the engine.
>>>>>> Hmm... Do you have ssl=true in your /etc/vdsm/vdsm.conf ?
>>>>>> Does vdsm respond locally to
>>>>>>
>>>>>> vdsClient -s 0 getVdsCaps
>>>>>>
>>>>>> (Maybe your local certificates and key were corrupted, and you
>>>>>> will have
>>>>>> to re-install the host form Engine in order to create a new set)
>>>>>>
>>>>> I have recreated the db and run engine-setup again. I have tried
>>>>> with ssl= true commented and uncommented in the node. vdsClient -s 0
>>>>> getVdsCaps works locally and provides the information of the host,
>>>>> but something seems to be preventing it to get to the engine. I am
>>>>> still getting the same error. The installer is not beginning. I can
>>>>> do ssh as root to the host and vdsmd is alive.
>>>> What do you mean by "The installer is not beginning"?
>>>> Could you review your /etc/pki/vdsm/certs/ and check that they have
>>>> been
>>>> generated by *your* engine? Is the cacert the same as the one on the
>>>> Engine machine?
>>>>
>>>> Dan.
>>> In the engine server I have a self signed certificate. I set up a
>>> cacert and server.pem to avoid libvirtd complaining about gssapi.
>>> The one in the node is issued by the VDSM Certificate Authority, so
>>> I suppose it is set up by the vdsm package installation.
>> So here lies the problem. When Vdsm is first installed, it generates its
>> own self-signed certificate. This, by definition, does not help to
>> identify it for Engine.
>>
>> When you add a host to a data center, Engine logs into the host and
>> askes the host to produce a *new* key. Engine's CA then signs the key,
>> and put the cert back under /etc/pki/vdsm/cert.
>>
>> Something has gone wrong in this process. If you re-install the host it
>> *should* override current keys. If not - it is a bug. Could you look at
>> the installation logs (they sit on a random dir under /tmp)?
>> Maybe there's a clue there why you keep the default
>> good-for-almost-nothing keys that come with vdsm.
>>
>> Dan.
> Yeah, there lies the problem. There is not installation process that I
> am aware of. The error seems to be raised in the first transaction,
> getting the capabilities of the node. As there is no do_handshake or

getting the capailites is already *post* the installation which happens 
when you "add host".
please remove and re-add the host to engine to re-create the certificates.
(assuming its a clean install of engine, and you didn't change its 
config to isntallVds=false)

> whatever it is called by vdsm, the installation progress does not begin
> and the host is considered unresponsive and stored with the info I
> provided, hostname and IP address, no more.
>
> The host just report:
>
> File "/usr/lib64/pythoFile "/usr/lib64/python2.7/ssl.py", line 305, in
> do_handshake
> self._sslobj.do_handshake()
> SSLError: [Errno 1] _ssl.c:504: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http requestn2.7/ssl.py", line 305, in
> do_handshake
> self._sslobj.do_handshake()
>
> and in /var/log/messages appear a series of not-very-promising news:
>
> Jun 25 16:53:19 ovirt-node2 vdsm root ERROR client ('10.10.30.101', 57035)
> Jun 25 16:53:21 ovirt-node2 vdsm root ERROR client ('10.10.30.101', 35856)
> Jun 25 16:53:21 ovirt-node2 vdsm root ERROR client ('10.10.30.101', 33413)
> Jun 25 16:53:21 ovirt-node2 vdsm root ERROR client ('10.10.30.101', 60822)
> Jun 25 16:53:21 ovirt-node2 vdsm root ERROR client ('10.10.30.101', 61000)
>
> Regards.
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users




More information about the Users mailing list