Re: [Users] hosted engine issues

----- Original Message -----
From: "René Koch" <rkoch@linuxland.at> To: "Yedidyah Bar David" <didi@redhat.com>, "Martin Sivak" <msivak@redhat.com> Cc: users@ovirt.org Sent: Monday, March 3, 2014 4:10:51 PM Subject: Re: [Users] hosted engine issues
On 03/03/2014 02:13 PM, Yedidyah Bar David wrote:
Me neither. Is everything Read-Write? Read-Only FS might report no space left as well in some cases. Other than that, I do not know.
Perhaps some ipc resource? semaphores?
Please check:
ipcs
cat /proc/sys/kernel/sem
I know nothing about libvirt, that's just a wild guess.
# ipcs
------ Shared Memory Segments -------- key shmid owner perms bytes nattch status
0x00000000 0 root 644 80 2
0x00000000 32769 root 644 16384 2
0x00000000 65538 root 644 280 2
------ Semaphore Arrays -------- key semid owner perms nsems 0x00000000 0 root 600 1 0x00000000 65537 root 600 1 0x000000a7 163842 root 600 1
This means you have 3 semaphore sets, of one semaphore each.
------ Message Queues -------- key msqid owner perms used-bytes messages
Also the rest is moderate usage.
# cat /proc/sys/kernel/sem 250 32000 32 128
So you are far from the maxima (250 per set, 32000 total, 128 sets).
Do you see anything in this output? I have no clue how to interpret this...
See e.g. http://man7.org/linux/man-pages/man5/proc.5.html Is the above on a node? engine? both nodes are similar? If so, that's not the reason for the "no space left on device". If this error is reproducible, you can try to find the process that this happens to (perhaps libvirtd, vdsmd, or the hosted-engine ha daemon) and do: strace -f -o /tmp/trace1 -tt -s 512 -p PID where PID is the pid of that process, then search /tmp/trace1 for 'no space left on device' and see the exact call that failed. -- Didi

On 03/03/2014 03:25 PM, Yedidyah Bar David wrote:
----- Original Message -----
From: "René Koch" <rkoch@linuxland.at> To: "Yedidyah Bar David" <didi@redhat.com>, "Martin Sivak" <msivak@redhat.com> Cc: users@ovirt.org Sent: Monday, March 3, 2014 4:10:51 PM Subject: Re: [Users] hosted engine issues
On 03/03/2014 02:13 PM, Yedidyah Bar David wrote:
Me neither. Is everything Read-Write? Read-Only FS might report no space left as well in some cases. Other than that, I do not know.
Perhaps some ipc resource? semaphores?
Please check:
ipcs
cat /proc/sys/kernel/sem
I know nothing about libvirt, that's just a wild guess.
# ipcs
------ Shared Memory Segments -------- key shmid owner perms bytes nattch status
0x00000000 0 root 644 80 2
0x00000000 32769 root 644 16384 2
0x00000000 65538 root 644 280 2
------ Semaphore Arrays -------- key semid owner perms nsems 0x00000000 0 root 600 1 0x00000000 65537 root 600 1 0x000000a7 163842 root 600 1
This means you have 3 semaphore sets, of one semaphore each.
------ Message Queues -------- key msqid owner perms used-bytes messages
Also the rest is moderate usage.
# cat /proc/sys/kernel/sem 250 32000 32 128
So you are far from the maxima (250 per set, 32000 total, 128 sets).
Do you see anything in this output? I have no clue how to interpret this...
See e.g. http://man7.org/linux/man-pages/man5/proc.5.html
Is the above on a node? engine? both nodes are similar? If so, that's not the reason for the "no space left on device".
Same on both hosts. These are CentOS 6.5 hosts which are the base for hosted engine.
If this error is reproducible, you can try to find the process that this happens to (perhaps libvirtd, vdsmd, or the hosted-engine ha daemon) and do: strace -f -o /tmp/trace1 -tt -s 512 -p PID where PID is the pid of that process, then search /tmp/trace1 for 'no space left on device' and see the exact call that failed.
Thanks a lot for the troubleshooting tips. I figured the following out: strace of libvirtd: 3296 17:10:05.396192 write(4, "2014-03-03 16:10:05.396+0000: 3296: error : virLockManagerSanlockAcquire:974 : Failed to acquire lock: No space left on device\n", 127 <unfinished ...> Then I checked sanlock.log where I found the following error message (which could to be the reason for No space left on device): 2014-03-03 17:10:05+0100 25094 [3105]: r6 cmd_acquire 2,9,11852 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82 So my question is now if I can remove the lockspace file (it should be hosted-engine.lockspace located in /rhev/data-center/mnt/ovirt-host01\:_engine/2851af27-8744-445d-9fb1-a0d083c8dc82/ha_agent/, right?) and it will be created again. I fear the GlusterFS split-brain situation destroyed it as this file was affected. Thanks, René
participants (2)
-
René Koch
-
Yedidyah Bar David