This is a multi-part message in MIME format.
--------------070804060902090506040505
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
So I looked at the vdsm logs and since there were multiple tests done it
was difficult to isolate which error to track down. You mentioned test
between 14:00-14:30 CET - but the gluster logs that were attached ended
at 11.29 UTC
Tracking down the errors when the master domain (gluster volume
1HP12-R3A1P1) went inactive for time period when corresponding gluster
volume log was available - they all seem to correspond to an issue where
gluster volume quorum was not met.
Can you confirm if this was for the test performed - or provide logs
from correct time period (both vdsm and gluster mount logs are required
- from hypervisors where the master domain is mounted)?
For master domain:
On 1hp1:
vdsm.log
Thread-35::ERROR::2016-03-31
13:21:27,225::monitor::276::Storage.Monitor::(_monitorDomain) Err
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
...
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
454, in statvfs
resdict = self._sendCommand("statvfs", {"path": path},
self.timeout)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
427, in _sendCommand
raise OSError(errcode, errstr)
OSError: [Errno 107] Transport endpoint is not connected
Thread-35::INFO::2016-03-31
13:21:27,267::monitor::299::Storage.Monitor::(_notifyStatusChanges)
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID
-- And I see a corresponding:
[2016-03-31 11:21:16.027090] W [MSGID: 108001]
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r
eplicate-0: Client-quorum is not met
jsonrpc.Executor/0::DEBUG::2016-03-31
13:23:34,110::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest)
Return 'GlusterVolume.status' in bridge with {'volumeStatus':
{'bricks':
[{'status': 'OFFLINE', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'pid': '-1',
'rdma_port': 'N/A',
'brick': '1hp1:/STORAGES/P1/GFS', 'port': 'N/A'},
{'status': 'OFFLINE',
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'pid':
'-1',
'rdma_port': 'N/A', 'brick': '1hp2:/STORAGES/P1/GFS',
'port': 'N/A'}],
'nfs': [{'status': 'OFFLINE', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'hostname':
'172.16.5.151/24',
'pid': '-1', 'rdma_port': 'N/A', 'port':
'N/A'}, {'status': 'OFFLINE',
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'hostname':
'1hp2',
'pid': '-1', 'rdma_port': 'N/A', 'port':
'N/A'}], 'shd': [{'status':
'ONLINE', 'hostname': '172.16.5.151/24', 'pid':
'2148', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4'}, {'status': 'ONLINE',
'hostname': '1hp2', 'pid': '2146', 'hostuuid':
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f'}], 'name':
'1HP12-R3A1P1'}}
-- 2 bricks were offline. I think the arbiter brick is not reported in
the xml output - this is a bug.
Similarly on 1hp2:
Thread-35::ERROR::2016-03-31
13:21:14,284::monitor::276::Storage.Monitor::(_monitorDomain) Err
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
...
raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
Thread-35::INFO::2016-03-31
13:21:14,285::monitor::299::Storage.Monitor::(_notifyStatusChanges)
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID
Corresponding gluster mount log -
[2016-03-31 11:21:16.027640] W [MSGID: 108001]
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r
eplicate-0: Client-quorum is not met
On 04/05/2016 07:02 PM, paf1(a)email.cz wrote:
Hello Sahina,
look attached logs which U requested
regs.
Pavel
On 5.4.2016 14:07, Sahina Bose wrote:
>
>
> On 03/31/2016 06:41 PM, paf1(a)email.cz wrote:
>> Hi,
>> rest of logs:
>>
www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W
>> <
http://www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W>
>>
>> The TEST is the last big event in logs ....
>> TEST TIME : about 14:00-14:30 CET
>
> Thank you Pavel for the interesting test report and sharing the logs.
>
> You are right - the master domain should not go down if 2 of 3 bricks
> are available from volume A (1HP12-R3A1P1).
>
> I notice that host kvmarbiter was not responsive at 2016-03-31
> 13:27:19 , but the ConnectStorageServerVDSCommand executed on
> kvmarbiter node returned success at 2016-03-31 13:27:26
>
> Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter
> nodes during this time ?
>
> Ravi, Krutika - could you take a look at the gluster logs?
>
>>
>> regs.Pavel
>>
>> On 31.3.2016 14:30, Yaniv Kaul wrote:
>>> Hi Pavel,
>>>
>>> Thanks for the report. Can you begin with a more accurate
>>> description of your environment?
>>> Begin with host, oVirt and Gluster versions. Then continue with the
>>> exact setup (what are 'A', 'B', 'C' - domains?
Volumes? What is the
>>> mapping between domains and volumes?).
>>>
>>> Are there any logs you can share with us?
>>>
>>> I'm sure with more information, we'd be happy to look at the issue.
>>> Y.
>>>
>>>
>>> On Thu, Mar 31, 2016 at 3:09 PM, paf1(a)email.cz <paf1(a)email.cz
>>> <mailto:paf1@email.cz>> wrote:
>>>
>>> Hello,
>>> we tried the following test - with unwanted results
>>>
>>> input:
>>> 5 node gluster
>>> A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
>>> B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
>>> C = distributed replica 3 arbiter 1 ( node1+node2,
>>> node3+node4, each arbiter on node 5)
>>> node 5 has only arbiter replica ( 4x )
>>>
>>> TEST:
>>> 1) directly reboot one node - OK ( is not important which (
>>> data node or arbiter node ))
>>> 2) directly reboot two nodes - OK ( if nodes are not from the
>>> same replica )
>>> 3) directly reboot three nodes - yes, this is the main problem
>>> and a questions ....
>>> - rebooted all three nodes from replica "B" ( not so
>>> possible, but who knows ... )
>>> - all VMs with data on this replica was paused ( no data
>>> access ) - OK
>>> - all VMs running on replica "B" nodes lost ( started
>>> manually, later )( datas on other replicas ) - acceptable
>>> BUT
>>> - !!! all oVIrt domains went down !! - master domain is on
>>> replica "A" which lost only one member from three !!!
>>> so we are not expecting that all domain will go down,
>>> especially master with 2 live members.
>>>
>>> Results:
>>> - the whole cluster unreachable until at all domains up -
>>> depent of all nodes up !!!
>>> - all paused VMs started back - OK
>>> - rest of all VMs rebooted and runnig - OK
>>>
>>> Questions:
>>> 1) why all domains down if master domain ( on replica "A"
)
>>> has two runnig members ( 2 of 3 ) ??
>>> 2) how to fix that colaps without waiting to all nodes up ?
>>> ( in worste case if node has HW error eg. ) ??
>>> 3) which oVirt cluster policy can prevent that situation
>>> ?? ( if any )
>>>
>>> regs.
>>> Pavel
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users
>
--------------070804060902090506040505
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
So I looked at the vdsm logs and since there were multiple tests
done it was difficult to isolate which error to track down. You
mentioned test between 14:00-14:30 CET - but the gluster logs that
were attached ended at 11.29 UTC<br>
<br>
Tracking down the errors when the master domain (gluster volume
1HP12-R3A1P1) went inactive for time period when corresponding
gluster volume log was available - they all seem to correspond to an
issue where gluster volume quorum was not met.<br>
<br>
Can you confirm if this was for the test performed - or provide logs
from correct time period (both vdsm and gluster mount logs are
required - from hypervisors where the master domain is mounted)?<br>
<br>
For master domain:<br>
On 1hp1:<br>
vdsm.log<br>
Thread-35::ERROR::2016-03-31
13:21:27,225::monitor::276::Storage.Monitor::(_monitorDomain) Err<br>
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313<br>
Traceback (most recent call last):<br>
...<br>
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",
line 454, in statvfs<br>
resdict = self._sendCommand("statvfs", {"path": path},
self.timeout)<br>
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",
line 427, in _sendCommand<br>
raise OSError(errcode, errstr)<br>
OSError: [Errno 107] Transport endpoint is not connected<br>
Thread-35::<a class="moz-txt-link-freetext"
href="INFO::2016-03-31">INFO::2016-03-31</a>
13:21:27,267::monitor::299::Storage.Monitor::(_notifyStatusChanges)
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID<br>
<br>
-- And I see a corresponding:<br>
[2016-03-31 11:21:16.027090] W [MSGID: 108001]
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r<br>
eplicate-0: Client-quorum is not met<br>
<br>
jsonrpc.Executor/0::DEBUG::2016-03-31
13:23:34,110::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest)
Return 'GlusterVolume.status' in bridge with {'volumeStatus':
{'bricks': [{'status': 'OFFLINE', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'pid': '-1',
'rdma_port':
'N/A', 'brick': '1hp1:/STORAGES/P1/GFS', 'port':
'N/A'}, {'status':
'OFFLINE', 'hostuuid':
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f',
'pid': '-1', 'rdma_port': 'N/A', 'brick':
'1hp2:/STORAGES/P1/GFS',
'port': 'N/A'}], 'nfs': [{'status': 'OFFLINE',
'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'hostname':
'172.16.5.151/24', 'pid': '-1', 'rdma_port':
'N/A', 'port': 'N/A'},
{'status': 'OFFLINE', 'hostuuid':
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'hostname': '1hp2',
'pid':
'-1', 'rdma_port': 'N/A', 'port': 'N/A'}],
'shd': [{'status':
'ONLINE', 'hostname': '172.16.5.151/24', 'pid':
'2148', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4'}, {'status': 'ONLINE',
'hostname': '1hp2', 'pid': '2146',
'hostuuid':
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f'}], 'name':
'1HP12-R3A1P1'}}<br>
<br>
-- 2 bricks were offline. I think the arbiter brick is not reported
in the xml output - this is a bug.<br>
<br>
Similarly on 1hp2:<br>
Thread-35::ERROR::2016-03-31
13:21:14,284::monitor::276::Storage.Monitor::(_monitorDomain) Err<br>
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313<br>
Traceback (most recent call last):<br>
...<br>
raise OSError(errcode, errstr)<br>
OSError: [Errno 2] No such file or directory<br>
Thread-35::<a class="moz-txt-link-freetext"
href="INFO::2016-03-31">INFO::2016-03-31</a>
13:21:14,285::monitor::299::Storage.Monitor::(_notifyStatusChanges)
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID<br>
<br>
Corresponding gluster mount log - <br>
[2016-03-31 11:21:16.027640] W [MSGID: 108001]
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r<br>
eplicate-0: Client-quorum is not met<br>
<br>
<div class="moz-cite-prefix">On 04/05/2016 07:02 PM, <a
class="moz-txt-link-abbreviated"
href="mailto:paf1@email.cz">paf1@email.cz</a>
wrote:<br>
</div>
<blockquote cite="mid:5703BE74.1050408@email.cz"
type="cite">
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
Hello Sahina, <br>
look attached logs which U requested<br>
<br>
regs.<br>
Pavel<br>
<br>
<div class="moz-cite-prefix">On 5.4.2016 14:07, Sahina Bose
wrote:<br>
</div>
<blockquote cite="mid:5703AA9D.40303@redhat.com"
type="cite">
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
<br>
<br>
<div class="moz-cite-prefix">On 03/31/2016 06:41 PM, <a
moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:paf1@email.cz"><a
class="moz-txt-link-abbreviated"
href="mailto:paf1@email.cz">paf1@email.cz</a></a>
wrote:<br>
</div>
<blockquote cite="mid:56FD221F.30707@email.cz"
type="cite">
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
Hi, <br>
rest of logs:<br>
<a moz-do-not-send="true"
href="http://www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W"
style="text-decoration:none;color:#ff9c00;">www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W</a><br>
<br>
The TEST is the last big event in logs ....<br>
TEST TIME : about 14:00-14:30 CET<br>
</blockquote>
<br>
Thank you Pavel for the interesting test report and sharing the
logs.<br>
<br>
You are right - the master domain should not go down if 2 of 3
bricks are available from volume A (1HP12-R3A1P1).<br>
<br>
I notice that host kvmarbiter was not responsive at 2016-03-31
13:27:19 , but the ConnectStorageServerVDSCommand executed on
kvmarbiter node returned success at 2016-03-31 13:27:26<br>
<br>
Could you also share the vdsm logs from 1hp1, 1hp2 and
kvmarbiter nodes during this time ?<br>
<br>
Ravi, Krutika - could you take a look at the gluster logs? <br>
<br>
<blockquote cite="mid:56FD221F.30707@email.cz"
type="cite"> <br>
regs.Pavel<br>
<br>
<div class="moz-cite-prefix">On 31.3.2016 14:30, Yaniv Kaul
wrote:<br>
</div>
<blockquote
cite="mid:CAJgorsaOUQ_42GUSPh-H1vGUgJ114JYcUHR8vHwvmcWR+w8Jmw@mail.gmail.com"
type="cite">
<div dir="ltr">Hi Pavel,
<div><br>
</div>
<div>Thanks for the report. Can you begin with a more
accurate description of your environment?</div>
<div>Begin with host, oVirt and Gluster versions. Then
continue with the exact setup (what are 'A', 'B',
'C' -
domains? Volumes? What is the mapping between domains
and volumes?).</div>
<div><br>
</div>
<div>Are there any logs you can share with us?</div>
<div><br>
</div>
<div>I'm sure with more information, we'd be happy to look
at the issue.</div>
<div>Y.</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Mar 31, 2016 at 3:09 PM,
<a moz-do-not-send="true"
class="moz-txt-link-abbreviated"
href="mailto:paf1@email.cz">paf1@email.cz</a>
<span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:paf1@email.cz"
target="_blank">paf1(a)email.cz</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000066" bgcolor="#FFFFFF"> Hello,
<br>
we tried the following test - with unwanted results<br>
<br>
input:<br>
5 node gluster<br>
A = replica 3 with arbiter 1 ( node1+node2+arbiter
on node 5 )<br>
B = replica 3 with arbiter 1 ( node3+node4+arbiter
on node 5 )<br>
C = distributed replica 3 arbiter 1 ( node1+node2,
node3+node4, each arbiter on node 5)<br>
node 5 has only arbiter replica ( 4x )<br>
<br>
TEST:<br>
1) directly reboot one node - OK ( is not important
which ( data node or arbiter node ))<br>
2) directly reboot two nodes - OK ( if nodes are
not from the same replica ) <br>
3) directly reboot three nodes - yes, this is the
main problem and a questions ....<br>
- rebooted all three nodes from replica "B" (
not so possible, but who knows ... )<br>
- all VMs with data on this replica was paused (
no data access ) - OK<br>
- all VMs running on replica "B" nodes lost (
started manually, later )( datas on other replicas )
- acceptable<br>
BUT<br>
- !!! all oVIrt domains went down !! - master
domain is on replica "A" which lost only one member
from three !!!<br>
so we are not expecting that all domain will go
down, especially master with 2 live members.<br>
<br>
Results: <br>
- the whole cluster unreachable until at all
domains up - depent of all nodes up !!!<br>
- all paused VMs started back - OK<br>
- rest of all VMs rebooted and runnig - OK<br>
<br>
Questions:<br>
1) why all domains down if master domain ( on
replica "A" ) has two runnig members ( 2 of 3 )
??<br>
2) how to fix that colaps without waiting to all
nodes up ? ( in worste case if node has HW error eg.
) ??<br>
3) which oVirt cluster policy can prevent
that situation ?? ( if any )<br>
<br>
regs.<br>
Pavel<br>
<br>
<br>
</div>
<br>
_______________________________________________<br>
Users mailing list<br>
<a moz-do-not-send="true"
href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>
<a moz-do-not-send="true"
href="http://lists.ovirt.org/mailman/listinfo/users"
rel="noreferrer"
target="_blank">http://lists.ovirt.org/mailman/listinfo/user...
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
</body>
</html>
--------------070804060902090506040505--