[Users] oVirt storage is down and doesn't come up

Yuval M yuvalme at gmail.com
Wed Apr 17 13:56:55 UTC 2013


1. we do not have the logs from before the problem.
2.
--------
$ tree /rhev/data-center/
/rhev/data-center/
âââ hsm-tasks
âââ mnt
    âââ bufferoverflow.home:_home_BO__ISO__Domain
    â   âââ 45d24e2a-705e-440f-954c-fda3cab61298
    â   â   âââ dom_md
    â   â   â   âââ ids
    â   â   â   âââ inbox
    â   â   â   âââ leases
    â   â   â   âââ metadata
    â   â   â   âââ outbox
    â   â   âââ images
    â   â       âââ 11111111-1111-1111-1111-111111111111
    â   â           âââ Fedora-18-x86_64-DVD.iso
    â   â           âââ Fedora-18-x86_64-Live-Desktop.iso
    â   âââ __DIRECT_IO_TEST__
    âââ bufferoverflow.home:_home_BO__Ovirt__Storage
    âââ kernelpanic.home:_home_KP__Data__Domain
        âââ a8286508-db45-40d7-8645-e573f6bacdc7
        â   âââ dom_md
        â   â   âââ ids
        â   â   âââ inbox
        â   â   âââ leases
        â   â   âââ metadata
        â   â   âââ outbox
        â   âââ images
        â       âââ 0df45336-de35-4dc0-9958-95b27d5d4701
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.lease
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.meta
        â       â   âââ b245184f-f8e3-479b-8559-8b6af2473b7c
        â       â   âââ b245184f-f8e3-479b-8559-8b6af2473b7c.lease
        â       â   âââ b245184f-f8e3-479b-8559-8b6af2473b7c.meta
        â       âââ 0e1ebaf7-3909-44cd-8560-d05a63eb4c4e
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.lease
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.meta
        â       â   âââ 562b9043-bde8-4595-bbea-fa8871f0e19e
        â       â   âââ 562b9043-bde8-4595-bbea-fa8871f0e19e.lease
        â       â   âââ 562b9043-bde8-4595-bbea-fa8871f0e19e.meta
        â       âââ 32ebb85a-0dde-47fe-90c7-7f4fb2c0f1e5
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.lease
        â       â   âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.meta
        â       â   âââ 4774095e-db3d-4561-8284-53eabfd28f66
        â       â   âââ 4774095e-db3d-4561-8284-53eabfd28f66.lease
        â       â   âââ 4774095e-db3d-4561-8284-53eabfd28f66.meta
        â       âââ a7e13a25-1694-4509-9e6b-e88583a4d970
        â           âââ 0d33efc8-a608-439f-abe2-43884c1ce72d
        â           âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.lease
        â           âââ 0d33efc8-a608-439f-abe2-43884c1ce72d.meta
        âââ __DIRECT_IO_TEST__

16 directories, 35 files

--------------------
3. We have 3 domains:
BO_Ovirt_Storage (data domain, on the same machine as engine and vdsm, via
NFS)
BO_ISO_Domain (ISO domain, same machine via NFS)
KP_Data_Domain (data domain on an NFS mount on a different machine)

Yuval



On Wed, Apr 17, 2013 at 4:28 PM, Yeela Kaplan <ykaplan at redhat.com> wrote:

> Hi Limor,
> 1) Your log starts exactly after the vdsm restart. I need to see the full
> vdsm log from before the domains went down in order to understand the
> problem. Can you attach them?
> 2) can you send the printout of 'tree /rhev/data-center/'
> 3) how many domains are attached to your DC, and what type are they(ISO,
> export,data) and (The DC is nfs right)?
>
> Thanks,
> Yeela
>
> ----- Original Message -----
> > From: "Limor Gavish" <lgavish at gmail.com>
> > To: "Tal Nisan" <tnisan at redhat.com>
> > Cc: "Yuval M" <yuvalme at gmail.com>, users at ovirt.org, "Nezer Zaidenberg" <
> nzaidenberg at mac.com>
> > Sent: Monday, April 15, 2013 5:10:16 PM
> > Subject: Re: [Users] oVirt storage is down and doesn't come up
> >
> > Thank you very much for your reply.
> > I ran the commands you asked (see below) but a directory named as the
> uuid of
> > the master domain is not mounted. We tried to restart the VDSM and the
> > entire machine it didn't help.
> > We succeeded to manually mount " /home/BO_Ovirt_Storage" to a temporary
> > directory.
> >
> > postgres=# \connect engine;
> > You are now connected to database "engine" as user "postgres".
> > engine=# select current_database();
> > current_database
> > ------------------
> > engine
> > (1 row)
> > engine=# select sds.id , ssc.connection from storage_domain_static sds
> join
> > storage_server_connections ssc on sds.storage= ssc.id where sds.id
> > ='1083422e-a5db-41b6-b667-b9ef1ef244f0';
> > id | connection
> >
> --------------------------------------+--------------------------------------------
> > 1083422e-a5db-41b6-b667-b9ef1ef244f0 |
> > bufferoverflow.home:/home/BO_Ovirt_Storage
> > (1 row)
> >
> > [wil at bufferoverflow ~] $ mount
> > proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
> > sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
> > devtmpfs on /dev type devtmpfs
> > (rw,nosuid,size=8131256k,nr_inodes=2032814,mode=755)
> > securityfs on /sys/kernel/security type securityfs
> > (rw,nosuid,nodev,noexec,relatime)
> > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> > devpts on /dev/pts type devpts
> > (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
> > tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
> > tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
> > cgroup on /sys/fs/cgroup/systemd type cgroup
> >
> (rw,nosuid,nodev,noexec,relatime,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
> > cgroup on /sys/fs/cgroup/cpuset type cgroup
> > (rw,nosuid,nodev,noexec,relatime,cpuset)
> > cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
> > (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
> > cgroup on /sys/fs/cgroup/memory type cgroup
> > (rw,nosuid,nodev,noexec,relatime,memory)
> > cgroup on /sys/fs/cgroup/devices type cgroup
> > (rw,nosuid,nodev,noexec,relatime,devices)
> > cgroup on /sys/fs/cgroup/freezer type cgroup
> > (rw,nosuid,nodev,noexec,relatime,freezer)
> > cgroup on /sys/fs/cgroup/net_cls type cgroup
> > (rw,nosuid,nodev,noexec,relatime,net_cls)
> > cgroup on /sys/fs/cgroup/blkio type cgroup
> > (rw,nosuid,nodev,noexec,relatime,blkio)
> > cgroup on /sys/fs/cgroup/perf_event type cgroup
> > (rw,nosuid,nodev,noexec,relatime,perf_event)
> > /dev/sda3 on / type ext4 (rw,relatime,data=ordered)
> > rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
> > debugfs on /sys/kernel/debug type debugfs (rw,relatime)
> > sunrpc on /proc/fs/nfsd type nfsd (rw,relatime)
> > hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
> > systemd-1 on /proc/sys/fs/binfmt_misc type autofs
> > (rw,relatime,fd=34,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
> > mqueue on /dev/mqueue type mqueue (rw,relatime)
> > tmpfs on /tmp type tmpfs (rw)
> > configfs on /sys/kernel/config type configfs (rw,relatime)
> > binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
> > /dev/sda5 on /home type ext4 (rw,relatime,data=ordered)
> > /dev/sda1 on /boot type ext4 (rw,relatime,data=ordered)
> > kernelpanic.home:/home/KP_Data_Domain on
> > /rhev/data-center/mnt/kernelpanic.home:_home_KP__Data__Domain type nfs
> >
> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.100.101.100,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.100.101.100)
> > bufferoverflow.home:/home/BO_ISO_Domain on
> > /rhev/data-center/mnt/bufferoverflow.home:_home_BO__ISO__Domain type nfs
> >
> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.100.101.108,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.100.101.108)
> >
> > [wil at bufferoverflow ~]$ ls -la /home/
> > total 36
> > drwxr-xr-x. 6 root root 4096 Mar 22 11:25 .
> > dr-xr-xr-x. 19 root root 4096 Apr 12 18:53 ..
> > drwxr-xr-x. 3 vdsm kvm 4096 Mar 27 17:33 BO_ISO_Domain
> > drwxr-xr-x. 3 vdsm kvm 4096 Mar 27 17:33 BO_Ovirt_Storage
> > drwx------. 2 root root 16384 Mar 6 09:11 lost+found
> > drwx------. 27 wil wil 4096 Apr 15 01:50 wil
> > [wil at bufferoverflow ~]$ cd /home/BO_Ovirt_Storage/
> > [wil at bufferoverflow BO_Ovirt_Storage]$ ls -la
> > total 12
> > drwxr-xr-x. 3 vdsm kvm 4096 Mar 27 17:33 .
> > drwxr-xr-x. 6 root root 4096 Mar 22 11:25 ..
> > drwxr-xr-x 5 vdsm kvm 4096 Mar 20 23:06
> 1083422e-a5db-41b6-b667-b9ef1ef244f0
> > -rwxr-xr-x 1 vdsm kvm 0 Mar 27 17:33 __DIRECT_IO_TEST__
> >
> > Thanks,
> > Limor
> >
> >
> > On Mon, Apr 15, 2013 at 4:02 PM, Tal Nisan < tnisan at redhat.com > wrote:
> >
> >
> >
> > Hi Limor,
> > First we should probably start with checking which mount is the master
> > storage domain that appears as not found, this should be checked against
> the
> > oVirt server database, please run
> >
> > select sds.id , ssc.connection from storage_domain_static sds join
> > storage_server_connections ssc on sds.storage= ssc.id
> > where sds.id ='1083422e-a5db-41b6-b667-b9ef1ef244f0';
> >
> > You can run this via psql or a Postgres ui if you have one.
> > In the results you will see the storage connection in the format of
> > %hostname%:/%mountName%, then in the VDSM server check in the mount list
> > that you see that it is mounted, the mount itself should contain a
> directory
> > named as the uuid of the master domain, let me know the result.
> >
> > Tal.
> >
> >
> >
> >
> > On 04/12/2013 07:29 PM, Limor Gavish wrote:
> >
> >
> >
> > Hi,
> >
> > For some reason, without doing anything, all the storage domains became
> down
> > and restarting VDSM or the entire machine do not bring it up.
> > I am not using lvm
> > The following errors appear several times in vdsm.log (full logs are
> > attached):
> >
> > Thread-22::WARNING::2013-04-12
> > 19:00:08,597::lvm::378::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['
> > Volume group "1083422e-a5db-41b6-b667-b9ef1ef244f0" not found']
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,598::lvm::402::OperationMutex::(_reloadvgs) Operation 'lvm
> reload
> > operation' released the operation mutex
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,681::resourceManager::615::ResourceManager::(releaseResource)
> > Trying to release resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3'
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,681::resourceManager::634::ResourceManager::(releaseResource)
> > Released resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3' (0
> active
> > users)
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,681::resourceManager::640::ResourceManager::(releaseResource)
> > Resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3' is free, finding
> out
> > if anyone is waiting for it.
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,682::resourceManager::648::ResourceManager::(releaseResource) No
> > one is waiting for resource
> 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3',
> > Clearing records.
> > Thread-22::ERROR::2013-04-12
> > 19:00:08,682::task::850::TaskManager.Task::(_setError)
> > Task=`e35a22ac-771a-4916-851f-2fe9d60a0ae6`::Unexpected error
> > Traceback (most recent call last):
> > File "/usr/share/vdsm/storage/task.py", line 857, in _run
> > return fn(*args, **kargs)
> > File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
> > res = f(*args, **kwargs)
> > File "/usr/share/vdsm/storage/hsm.py", line 939, in connectStoragePool
> > masterVersion, options)
> > File "/usr/share/vdsm/storage/hsm.py", line 986, in _connectStoragePool
> > res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
> > File "/usr/share/vdsm/storage/sp.py", line 695, in connect
> > self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
> > File "/usr/share/vdsm/storage/sp.py", line 1232, in __rebuild
> > masterVersion=masterVersion)
> > File "/usr/share/vdsm/storage/sp.py", line 1576, in getMasterDomain
> > raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
> > StoragePoolMasterNotFound: Cannot find master domain:
> > 'spUUID=5849b030-626e-47cb-ad90-3ce782d831b3,
> > msdUUID=1083422e-a5db-41b6-b667-b9ef1ef244f0'
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,685::task::869::TaskManager.Task::(_run)
> > Task=`e35a22ac-771a-4916-851f-2fe9d60a0ae6`::Task._run:
> > e35a22ac-771a-4916-851f-2fe9d60a0ae6
> > ('5849b030-626e-47cb-ad90-3ce782d831b3', 1,
> > '5849b030-626e-47cb-ad90-3ce782d831b3',
> > '1083422e-a5db-41b6-b667-b9ef1ef244f0', 3942) {} failed - stopping task
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,685::task::1194::TaskManager.Task::(stop)
> > Task=`e35a22ac-771a-4916-851f-2fe9d60a0ae6`::stopping in state preparing
> > (force False)
> > Thread-22::DEBUG::2013-04-12
> > 19:00:08,685::task::974::TaskManager.Task::(_decref)
> > Task=`e35a22ac-771a-4916-851f-2fe9d60a0ae6`::ref 1 aborting True
> > Thread-22::INFO::2013-04-12
> > 19:00:08,686::task::1151::TaskManager.Task::(prepare)
> > Task=`e35a22ac-771a-4916-851f-2fe9d60a0ae6`::aborting: Task is aborted:
> > 'Cannot find master domain' - code 304
> >
> > [wil at bufferoverflow ~]$ sudo vgs --noheadings --units b --nosuffix
> > --separator \| -o
> >
> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free
> > No volume groups found
> >
> > [wil at bufferoverflow ~]$ mount
> > proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
> > sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
> > devtmpfs on /dev type devtmpfs
> > (rw,nosuid,size=8131256k,nr_inodes=2032814,mode=755)
> > securityfs on /sys/kernel/security type securityfs
> > (rw,nosuid,nodev,noexec,relatime)
> > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> > devpts on /dev/pts type devpts
> > (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
> > tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
> > tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
> > cgroup on /sys/fs/cgroup/systemd type cgroup
> >
> (rw,nosuid,nodev,noexec,relatime,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
> > cgroup on /sys/fs/cgroup/cpuset type cgroup
> > (rw,nosuid,nodev,noexec,relatime,cpuset)
> > cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
> > (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
> > cgroup on /sys/fs/cgroup/memory type cgroup
> > (rw,nosuid,nodev,noexec,relatime,memory)
> > cgroup on /sys/fs/cgroup/devices type cgroup
> > (rw,nosuid,nodev,noexec,relatime,devices)
> > cgroup on /sys/fs/cgroup/freezer type cgroup
> > (rw,nosuid,nodev,noexec,relatime,freezer)
> > cgroup on /sys/fs/cgroup/net_cls type cgroup
> > (rw,nosuid,nodev,noexec,relatime,net_cls)
> > cgroup on /sys/fs/cgroup/blkio type cgroup
> > (rw,nosuid,nodev,noexec,relatime,blkio)
> > cgroup on /sys/fs/cgroup/perf_event type cgroup
> > (rw,nosuid,nodev,noexec,relatime,perf_event)
> > /dev/sda3 on / type ext4 (rw,relatime,data=ordered)
> > rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
> > debugfs on /sys/kernel/debug type debugfs (rw,relatime)
> > sunrpc on /proc/fs/nfsd type nfsd (rw,relatime)
> > hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
> > systemd-1 on /proc/sys/fs/binfmt_misc type autofs
> > (rw,relatime,fd=34,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
> > mqueue on /dev/mqueue type mqueue (rw,relatime)
> > tmpfs on /tmp type tmpfs (rw)
> > configfs on /sys/kernel/config type configfs (rw,relatime)
> > binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
> > /dev/sda5 on /home type ext4 (rw,relatime,data=ordered)
> > /dev/sda1 on /boot type ext4 (rw,relatime,data=ordered)
> > kernelpanic.home:/home/KP_Data_Domain on
> > /rhev/data-center/mnt/kernelpanic.home:_home_KP__Data__Domain type nfs
> >
> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.100.101.100,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.100.101.100)
> > bufferoverflow.home:/home/BO_ISO_Domain on
> > /rhev/data-center/mnt/bufferoverflow.home:_home_BO__ISO__Domain type nfs
> >
> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.100.101.108,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.100.101.108)
> >
> > [wil at bufferoverflow ~]$ sudo find / -name
> > 5849b030-626e-47cb-ad90-3ce782d831b3
> > /run/vdsm/pools/5849b030-626e-47cb-ad90-3ce782d831b3
> >
> > [wil at bufferoverflow ~]$ sudo find / -name
> > 1083422e-a5db-41b6-b667-b9ef1ef244f0
> > /home/BO_Ovirt_Storage/1083422e-a5db-41b6-b667-b9ef1ef244f0
> >
> > I will extremely appreciate any help,
> > Limor Gavish
> > _______________________________________________
> > Users mailing list Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
> >
> >
> > _______________________________________________
> > Users mailing list
> > Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130417/b50260d3/attachment-0001.html>


More information about the Users mailing list