On 03/03/2014 03:25 PM, Yedidyah Bar David wrote:
----- Original Message -----
> From: "René Koch" <rkoch(a)linuxland.at
> To:
"Yedidyah Bar David" <didi(a)redhat.com>, "Martin Sivak"
<msivak(a)redhat.com
> Cc: users(a)ovirt.org
> Sent: Monday, March 3, 2014 4:10:51 PM
> Subject: Re: [Users] hosted engine issues
> On 03/03/2014 02:13 PM, Yedidyah Bar David wrote:
>>> Me neither. Is everything Read-Write? Read-Only FS might report no space
>>> left
>>> as well in some cases. Other than that, I do not know.
>
>> Perhaps some ipc resource? semaphores?
>
>> Please check:
>
>> ipcs
>
>> cat /proc/sys/kernel/sem
>
>> I know nothing about libvirt, that's just a wild
guess.
> # ipcs
> ------ Shared Memory Segments --------
> key shmid owner perms bytes nattch status
> 0x00000000 0 root 644 80 2
> 0x00000000 32769 root 644 16384 2
> 0x00000000 65538 root 644 280 2
> ------ Semaphore Arrays
--------
> key semid owner perms nsems
> 0x00000000 0 root 600 1
> 0x00000000 65537 root 600 1
> 0x000000a7 163842 root 600 1
This means you have 3 semaphore sets, of one semaphore each.
> ------ Message Queues --------
> key msqid owner perms used-bytes messages
Also the rest is moderate usage.
> # cat /proc/sys/kernel/sem
> 250 32000 32 128
So you are far from the maxima (250 per set, 32000 total, 128 sets).
> Do you see anything in this
output?
> I have no clue how to interpret this...
See e.g.
http://man7.org/linux/man-pages/man5/proc.5.html
Is the above on a node? engine? both nodes are similar? If so, that's
not the reason for the "no space left on device".
Same on both hosts.
These are CentOS 6.5 hosts which are the base for hosted engine.
If this error is reproducible, you can try to find the process that this
happens to (perhaps libvirtd, vdsmd, or the hosted-engine ha daemon) and do:
strace -f -o /tmp/trace1 -tt -s 512 -p PID
where PID is the pid of that process, then search /tmp/trace1 for 'no space
left on device' and see the exact call that failed.
Thanks a lot for the troubleshooting tips.
I figured the following out:
strace of libvirtd:
3296 17:10:05.396192 write(4, "2014-03-03 16:10:05.396+0000: 3296:
error : virLockManagerSanlockAcquire:974 : Failed to acquire lock: No
space left on device\n", 127 <unfinished ...
Then I checked sanlock.log where I found the following error message
(which could to be the reason for No space left on device):
2014-03-03 17:10:05+0100 25094 [3105]: r6 cmd_acquire 2,9,11852 invalid
lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
So my question is now if I can remove the lockspace file (it should be
hosted-engine.lockspace located in
/rhev/data-center/mnt/ovirt-host01\:_engine/2851af27-8744-445d-9fb1-a0d083c8dc82/ha_agent/,
right?) and it will be created again. I fear the GlusterFS split-brain
situation destroyed it as this file was affected.
Thanks,
René