[ovirt-users] vdsm storage problem - maybe cache problem?

ml at ohnewald.net ml at ohnewald.net
Wed May 20 10:28:53 UTC 2015


Hello List,

i really need some help here....could someone please give it a try? I 
already tried to get help on the irc channel but it seems that my 
problem here is too complicated or maybe i am providing not useful infos?

DB & vdsClient: http://fpaste.org/223483/14320588/ (i think is part is 
very intresting)

engine.log: http://paste.fedoraproject.org/223349/04494414
Node02 vdsm Log: http://paste.fedoraproject.org/223350/43204496
Node01 vdsm Log: http://paste.fedoraproject.org/223347/20448951

Why does my vdsm look for StorageDomain 
036b5575-51fa-4f14-8b05-890d7807894c ? => This was a NFS Export which i 
deleted from the GUI yesterday (!!!).

   From the Database Log/Dump:
=============================
USER_FORCE_REMOVE_STORAGE_DOMAIN        981     0       Storage Domain 
EXPORT2 was forcibly removed by admin at internal   f 
b384b3da-02a6-44f3-a3f6-56751ce8c26d    HP_Proliant_DL180G6 
036b5575-51fa-4f14-8b05-890d7807894c    EXPORT2 
00000000-0000-0000-0000-000000000000            c1754ff 
807321f6-2043-4a26-928c-0ce6b423c381


I already put one node into maintance, rebootet and activated it. Still 
the same problem.

Some Screen Shots:
--------------------
http://postimg.org/image/8zo4ujgjb/
http://postimg.org/image/le918grdr/
http://postimg.org/image/wnawwhrgh/


My GlusterFS fully works and is not the problem here i guess:
==============================================================

2015-05-19 04:09:06,292 INFO 
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] 
(DefaultQuartzScheduler_Worker-43) [3df9132b] START, 
HSMClearTaskVDSCommand(HostName = ovirt-node02.stuttgart.imos.net, 
HostId = 6948da12-0b8a-4b6d-a9af-162e6c25dad3, 
taskId=2aeec039-5b95-40f0-8410-da62b44a28e8), log id: 19b18840
2015-05-19 04:09:06,337 INFO 
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] 
(DefaultQuartzScheduler_Worker-43) [3df9132b] FINISH, 
HSMClearTaskVDSCommand, log id: 19b18840

I already had a chat with Maor Lipchuk and he told me to add another 
host to the datacenter and then re-initialize it. I will migrate back to 
ESXi for now to get those two nodes free. Then we can mess with it if 
anyone is interested to help me. Otherwise i will have to stay with ESXi :-(

Has anyone else an idea until then? Why is my vdsm host all messed up 
with that zombie StorageDomain?


Thanks for your time. I really, really appreciate it.

Mario






Am 19.05.15 um 14:57 schrieb ml at ohnewald.net:
> Hello List,
>
> okay, i really need some help now. I stopped vdsmd for a little bit too
> long, fencing stepped in and rebootet node01.
>
> I now can NOT start any vm because the storage is marked "Unknown".
>
>
> Since node01 rebootet i am wondering why the hell it is still looking to
> this StorageDomain:
>
> StorageDomainCache::(_findDomain) domain
> 036b5575-51fa-4f14-8b05-890d7807894c not found
>
> Can anyone please please tell me where the id exactly it comes from?
>
>
> Thanks,
> Mario
>
>
>
>
>
> Am 18.05.15 um 14:21 schrieb Maor Lipchuk:
>> Hi Mario,
>>
>> Can u try to mount this directly from the Host?
>> Can u please attach the VDSM and engine logs
>>
>> Thanks,
>> Maor
>>
>>
>> ----- Original Message -----
>>> From: ml at ohnewald.net
>>> To: "Maor Lipchuk" <mlipchuk at redhat.com>
>>> Cc: users at ovirt.org
>>> Sent: Monday, May 18, 2015 2:36:38 PM
>>> Subject: Re: [ovirt-users] vdsm storage problem - maybe cache problem?
>>>
>>> Hi Maor,
>>>
>>> thanks for the quick reply.
>>>
>>> Am 18.05.15 um 13:25 schrieb Maor Lipchuk:
>>>
>>>>> Now my Question: Why does the vdsm node not know that i deleted the
>>>>> storage? Has the vdsm cached this mount informations? Why does it
>>>>> still
>>>>> try to access 036b5575-51fa-4f14-8b05-890d7807894c?
>>>>
>>>>
>>>> Yes, the vdsm use a cache for Storage Domains, you can try to
>>>> restart the
>>>> vdsmd service instead of rebooting the host.
>>>>
>>>
>>> I am still getting the same error.
>>>
>>>
>>> [root at ovirt-node01 ~]# /etc/init.d/vdsmd stop
>>> Shutting down vdsm daemon:
>>> vdsm watchdog stop                                         [  OK  ]
>>> vdsm: Running run_final_hooks                              [  OK  ]
>>> vdsm stop                                                  [  OK  ]
>>> [root at ovirt-node01 ~]#
>>> [root at ovirt-node01 ~]#
>>> [root at ovirt-node01 ~]#
>>> [root at ovirt-node01 ~]# ps aux | grep vdsmd
>>> root      3198  0.0  0.0  11304   740 ?        S<   May07   0:00
>>> /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon
>>> --masterpid /var/run/vdsm/supervdsm_respawn.pid
>>> /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock
>>> --pidfile /var/run/vdsm/supervdsmd.pid
>>> root      3205  0.0  0.0 922368 26724 ?        S<l  May07  12:10
>>> /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile
>>> /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
>>> root     15842  0.0  0.0 103248   900 pts/0    S+   13:35   0:00 grep
>>> vdsmd
>>>
>>>
>>> [root at ovirt-node01 ~]# /etc/init.d/vdsmd start
>>> initctl: Job is already running: libvirtd
>>> vdsm: Running mkdirs
>>> vdsm: Running configure_coredump
>>> vdsm: Running configure_vdsm_logs
>>> vdsm: Running run_init_hooks
>>> vdsm: Running gencerts
>>> vdsm: Running check_is_configured
>>> libvirt is already configured for vdsm
>>> sanlock service is already configured
>>> vdsm: Running validate_configuration
>>> SUCCESS: ssl configured to true. No conflicts
>>> vdsm: Running prepare_transient_repository
>>> vdsm: Running syslog_available
>>> vdsm: Running nwfilter
>>> vdsm: Running dummybr
>>> vdsm: Running load_needed_modules
>>> vdsm: Running tune_system
>>> vdsm: Running test_space
>>> vdsm: Running test_lo
>>> vdsm: Running restore_nets
>>> vdsm: Running unified_network_persistence_upgrade
>>> vdsm: Running upgrade_300_nets
>>> Starting up vdsm daemon:
>>> vdsm start                                                 [  OK  ]
>>> [root at ovirt-node01 ~]#
>>>
>>> [root at ovirt-node01 ~]# grep ERROR /var/log/vdsm/vdsm.log | tail -n 20
>>> Thread-13::ERROR::2015-05-18
>>> 13:35:03,631::sdc::137::Storage.StorageDomainCache::(_findDomain)
>>> looking for unfetched domain abc51e26-7175-4b38-b3a8-95c6928fbc2b
>>> Thread-13::ERROR::2015-05-18
>>> 13:35:03,632::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain)
>>>
>>> looking for domain abc51e26-7175-4b38-b3a8-95c6928fbc2b
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:11,607::sdc::137::Storage.StorageDomainCache::(_findDomain)
>>> looking for unfetched domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:11,621::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain)
>>>
>>> looking for domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:11,960::sdc::143::Storage.StorageDomainCache::(_findDomain) domain
>>> 036b5575-51fa-4f14-8b05-890d7807894c not found
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:11,960::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain)
>>>
>>> Error while collecting domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> monitoring information
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:21,962::sdc::137::Storage.StorageDomainCache::(_findDomain)
>>> looking for unfetched domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:21,965::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain)
>>>
>>> looking for domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:22,068::sdc::143::Storage.StorageDomainCache::(_findDomain) domain
>>> 036b5575-51fa-4f14-8b05-890d7807894c not found
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:22,072::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain)
>>>
>>> Error while collecting domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> monitoring information
>>> Thread-15::ERROR::2015-05-18
>>> 13:35:33,821::task::866::TaskManager.Task::(_setError)
>>> Task=`54bdfc77-f63a-493b-b24e-e5a3bc4977bb`::Unexpected error
>>> Thread-15::ERROR::2015-05-18
>>> 13:35:33,864::dispatcher::65::Storage.Dispatcher.Protect::(run)
>>> {'status': {'message': "Unknown pool id, pool not connected:
>>> ('b384b3da-02a6-44f3-a3f6-56751ce8c26d',)", 'code': 309}}
>>> Thread-13::ERROR::2015-05-18
>>> 13:35:33,930::sdc::137::Storage.StorageDomainCache::(_findDomain)
>>> looking for unfetched domain abc51e26-7175-4b38-b3a8-95c6928fbc2b
>>> Thread-15::ERROR::2015-05-18
>>> 13:35:33,928::task::866::TaskManager.Task::(_setError)
>>> Task=`fe9bb0fa-cf1e-4b21-af00-0698c6d1718f`::Unexpected error
>>> Thread-13::ERROR::2015-05-18
>>> 13:35:33,932::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain)
>>>
>>> looking for domain abc51e26-7175-4b38-b3a8-95c6928fbc2b
>>> Thread-15::ERROR::2015-05-18
>>> 13:35:33,978::dispatcher::65::Storage.Dispatcher.Protect::(run)
>>> {'status': {'message': 'Not SPM: ()', 'code': 654}}
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:41,117::sdc::137::Storage.StorageDomainCache::(_findDomain)
>>> looking for unfetched domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:41,131::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain)
>>>
>>> looking for domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:41,452::sdc::143::Storage.StorageDomainCache::(_findDomain) domain
>>> 036b5575-51fa-4f14-8b05-890d7807894c not found
>>> Thread-36::ERROR::2015-05-18
>>> 13:35:41,453::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain)
>>>
>>> Error while collecting domain 036b5575-51fa-4f14-8b05-890d7807894c
>>> monitoring information
>>>
>>>
>>> Thanks,
>>> Mario
>>>
>>>
>






More information about the Users mailing list