New subject: Fwd: Re: urgent issue

22 Sep 2015

      This is a multi-part message in MIME format.
--------------030305080206090500040208
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hi Chris,

Replies inline..

On 09/22/2015 09:31 AM, Sahina Bose wrote:
...
-------- Forwarded Message --------
Subject: 	Re: [ovirt-users] urgent issue
Date: 	Wed, 9 Sep 2015 08:31:07 -0700
From: 	Chris Liebman <chris.l@taboola.com>
To: 	users <users@ovirt.org>
Ok - I think I'm going to switch to local storage - I've had way to 
many unexplainable issue with glusterfs Â :-(.Â  Is there any reason I 
cant add local storage to the existing shared-storage cluster?Â  I see 
that the menu item is greyed out....
What version of gluster and ovirt are you using?
...
On Tue, Sep 8, 2015 at 4:19 PM, Chris Liebman <chris.l@taboola.com 
<mailto:chris.l@taboola.com>> wrote:
Its possible that this is specific to just one gluster volume...Â 
    I've moved a few VM disks off of that volume and am able to start
    them fine.Â  My recolection is that any VM started on the "bad"
    volume causes it to be disconnected and forces the ovirt node to
    be marked down until Maint->Activate.
On Tue, Sep 8, 2015 at 3:52 PM, Chris Liebman
    <chris.l@taboola.com> wrote:
In attempting to put an ovirt cluster in production I'm
        running into some off errors with gluster it looks like.Â  Its
        12 hosts each with one brick in distributed-replicate.
        Â (actually 2 bricks but they are separate volumes)
These 12 nodes in dist-rep config, are they in replica 2 or replica 3? 
The latter is what is recommended for VM use-cases. Could you give the 
output of `gluster volume info` ?
...
[root@ovirt-node268 glusterfs]# rpm -qa | grep vdsm
vdsm-jsonrpc-4.16.20-0.el6.noarch
vdsm-gluster-4.16.20-0.el6.noarch
vdsm-xmlrpc-4.16.20-0.el6.noarch
vdsm-yajsonrpc-4.16.20-0.el6.noarch
vdsm-4.16.20-0.el6.x86_64
vdsm-python-zombiereaper-4.16.20-0.el6.noarch
vdsm-python-4.16.20-0.el6.noarch
vdsm-cli-4.16.20-0.el6.noarch
Â  Â Everything was fine last week, however, today various
        clients in the gluster cluster seem get "client quorum not
        met" periodically - when they get this they take one of the
        bricks offline - this causes VM's to be attempted to move -
        sometimes 20 at a time.Â  That takes a long time :-(. I've
        tried disabling automatic migration and teh VM's get paused
        when this happens - resuming gets nothing at that point as the
        volumes mount on the server hosting the VM is not connected:
from
        rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:
[2015-09-08 21:18:42.920771] W [MSGID: 108001]
        [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2:
        Client-quorum isÂ not met
When client-quorum is not met (due to network disconnects, or gluster 
brick processes going down etc), gluster makes the volume read-only. 
This is expected behavior and prevents split-brains. It's probably a bit 
late, but do you have the  gluster fuse mount logs to confirm this 
indeed was the issue?
...
[2015-09-08 21:18:42.931751] I
        [fuse-bridge.c:4900:fuse_thread_proc] 0-fuse: unmounting
        /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02
[2015-09-08 21:18:42.931836] W
        [glusterfsd.c:1219:cleanup_and_exit]
        (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51]
        -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
        -->/usr/sbin/glusterfs(cleanup_and_exit+0x
65) [0x4059b5] ) 0-: received signum (15), shutting down
[2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini]
        0-fuse: Unmounting
        '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.
The VM pause you saw could be because of the unmount.I understand that a 
fix (https://gerrit.ovirt.org/#/c/40240/)  went in for ovirt 3-.6 
(vdsm-4.17) to prevent vdsm from unmounting the gluster volume when vdsm 
exits/restarts.
Is it possible to run a test setup on 3.6 and see if this is still 
happening?
...
And the mount is broken at that point:
[root@ovirt-node267 ~]# df
*df:
        `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02':
        Transport endpoint is not connected*
Yes because it received a SIGTERM above.

Thanks,
Ravi
...
FilesystemÂ  Â  Â  Â  Â Â Â 1K-blocksÂ  Â
        Â Â UsedÂ Â Available Use% Mounted on
/dev/sda3Â  Â  Â  Â  Â  Â
        Â Â 51475068Â Â Â 1968452Â Â Â 46885176Â Â Â 5% /
tmpfsÂ Â  Â  Â  Â  Â  Â  Â Â Â 132210244Â Â  Â  Â
        Â Â 0Â Â 132210244Â Â Â 0% /dev/shm
/dev/sda2Â  Â  Â  Â  Â  Â  Â Â Â 487652Â Â  Â Â 32409Â Â
        Â Â 429643Â Â Â 8% /boot
/dev/sda1Â  Â  Â  Â  Â  Â  Â Â Â 204580Â Â  Â  Â Â 260Â Â
        Â Â 204320Â Â Â 1% /boot/efi
/dev/sda5Â  Â  Â  Â  Â Â Â 1849960960 156714056
        1599267616Â Â Â 9% /data1
/dev/sdb1Â  Â  Â  Â  Â Â Â 1902274676Â Â 18714468
        1786923588Â Â Â 2% /data2
ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01
Â Â  Â  Â  Â  Â  Â  Â  Â  Â Â Â 9249804800 727008640
        8052899712 <tel:8052899712>Â Â Â 9%
        /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01
ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03
Â Â  Â  Â  Â  Â  Â  Â  Â  Â Â Â 1849960960Â Â  Â Â 73728
        1755907968Â Â Â 1%
        /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03
The fix for that is to put the server in maintenance mode then
        activate it again. But all VM's need to be migrated or stopped
        for that to work.
I'm not seeing any obvious network or disk errors......Â
Are their configuration options I'm missing?
--------------030305080206090500040208
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Hi Chris,<br>
    <br>
    Replies inline..<br>
    <br>
    <div class="moz-cite-prefix">On 09/22/2015 09:31 AM, Sahina Bose
      wrote:<br>
    </div>
    <blockquote cite="mid:5600D288.8090608@redhat.com" type="cite">
      <meta http-equiv="content-type" content="text/html; charset=utf-8">
      <br>
      <div class="moz-forward-container"><br>
        <br>
        -------- Forwarded Message --------
        <table class="moz-email-headers-table" border="0"
          cellpadding="0" cellspacing="0">
          <tbody>
            <tr>
              <th nowrap="nowrap" valign="BASELINE" align="RIGHT">Subject:

              </th>
              <td>Re: [ovirt-users] urgent issue</td>
            </tr>
            <tr>
              <th nowrap="nowrap" valign="BASELINE" align="RIGHT">Date:
              </th>
              <td>Wed, 9 Sep 2015 08:31:07 -0700</td>
            </tr>
            <tr>
              <th nowrap="nowrap" valign="BASELINE" align="RIGHT">From:
              </th>
              <td>Chris Liebman <a moz-do-not-send="true"
                  class="moz-txt-link-rfc2396E"
                  href="mailto:chris.l@taboola.com"><chris.l@taboola.com></a></td>
            </tr>
            <tr>
              <th nowrap="nowrap" valign="BASELINE" align="RIGHT">To: </th>
              <td>users <a moz-do-not-send="true"
                  class="moz-txt-link-rfc2396E"
                  href="mailto:users@ovirt.org"><users@ovirt.org></a></td>
            </tr>
          </tbody>
        </table>
        <br>
        <br>
        <div dir="ltr">Ok - I think I'm going to switch to local storage
          - I've had way to many unexplainable issue with glusterfs
          Â :-(.Â  Is there any reason I cant add local storage to the
          existing shared-storage cluster?Â  I see that the menu item is
          greyed out....
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    What version of gluster and ovirt are you using? <br>
    <br>
    <blockquote cite="mid:5600D288.8090608@redhat.com" type="cite">
      <div class="moz-forward-container">
        <div dir="ltr">
          <div> </div>
          <div>
            <div><br>
            </div>
            <div><br>
            </div>
          </div>
        </div>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On Tue, Sep 8, 2015 at 4:19 PM, Chris
            Liebman <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:chris.l@taboola.com" target="_blank">chris.l@taboola.com</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div dir="ltr">Its possible that this is specific to just
                one gluster volume...Â  I've moved a few VM disks off of
                that volume and am able to start them fine.Â  My
                recolection is that any VM started on the "bad" volume
                causes it to be disconnected and forces the ovirt node
                to be marked down until Maint->Activate.</div>
              <div class="HOEnZb">
                <div class="h5">
                  <div class="gmail_extra"><br>
                    <div class="gmail_quote">On Tue, Sep 8, 2015 at 3:52
                      PM, Chris Liebman <span dir="ltr"><<a
                          moz-do-not-send="true"
                          class="moz-txt-link-abbreviated"
                          href="mailto:chris.l@taboola.com"><a class="moz-txt-link-abbreviated" href="mailto:chris.l@taboola.com">chris.l@taboola.com</a></a>></span>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr">In attempting to put an ovirt
                          cluster in production I'm running into some
                          off errors with gluster it looks like.Â  Its
                          12 hosts each with one brick in
                          distributed-replicate. Â (actually 2 bricks
                          but they are separate volumes)
                          <div><br>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    These 12 nodes in dist-rep config, are they in replica 2 or replica
    3? The latter is what is recommended for VM use-cases. Could you
    give the output of `gluster volume info` ?<br>
    <blockquote cite="mid:5600D288.8090608@redhat.com" type="cite">
      <div class="moz-forward-container">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div class="HOEnZb">
                <div class="h5">
                  <div class="gmail_extra">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr">
                          <div> </div>
                          <div>
                            <p><span>[root@ovirt-node268 glusterfs]# rpm
                                -qa | grep vdsm</span></p>
                            <p><span>vdsm-jsonrpc-4.16.20-0.el6.noarch</span></p>
                            <p><span>vdsm-gluster-4.16.20-0.el6.noarch</span></p>
                            <p><span>vdsm-xmlrpc-4.16.20-0.el6.noarch</span></p>
                            <p><span>vdsm-yajsonrpc-4.16.20-0.el6.noarch</span></p>
                            <p><span>vdsm-4.16.20-0.el6.x86_64</span></p>
                            <p><span>vdsm-python-zombiereaper-4.16.20-0.el6.noarch</span></p>
                            <p><span>vdsm-python-4.16.20-0.el6.noarch</span></p>
                            <p><span>vdsm-cli-4.16.20-0.el6.noarch</span></p>
                            <p><br>
                            </p>
                            <p>Â  Â Everything was fine last week,
                              however, today various clients in the
                              gluster cluster seem get "client quorum
                              not met" periodically - when they get this
                              they take one of the bricks offline - this
                              causes VM's to be attempted to move -
                              sometimes 20 at a time.Â  That takes a
                              long time :-(. I've tried disabling
                              automatic migration and teh VM's get
                              paused when this happens - resuming gets
                              nothing at that point as the volumes mount
                              on the server hosting the VM is not
                              connected:</p>
                            <div><br>
                            </div>
                            <div>
                              <p>from
rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:</p>
                              <p><span>[2015-09-08 21:18:42.920771] W
                                  [MSGID: 108001]
                                  [afr-common.c:4043:afr_notify]
                                  2-LADC-TBX-V02-replicate-2:
                                  Client-quorum isÂ </span><span>not met</span></p>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    When client-quorum is not met (due to network disconnects, or
    gluster brick processes going down etc), gluster makes the volume
    read-only. This is expected behavior and prevents split-brains. It's
    probably a bit late, but do you have the  gluster fuse mount logs to
    confirm this indeed was the issue?<br>
    <br>
    <blockquote cite="mid:5600D288.8090608@redhat.com" type="cite">
      <div class="moz-forward-container">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div class="HOEnZb">
                <div class="h5">
                  <div class="gmail_extra">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr">
                          <div>
                            <div>
                              <p><span>[2015-09-08 21:18:42.931751] I
                                  [fuse-bridge.c:4900:fuse_thread_proc]
                                  0-fuse: unmounting
/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02</span></p>
                              <p><span>[2015-09-08 21:18:42.931836] W
                                  [glusterfsd.c:1219:cleanup_and_exit]
                                  (-->/lib64/libpthread.so.0(+0x7a51)
                                  [0x7f1bebc84a51]
                                  -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd)
                                  [0x405e4d]
                                  -->/usr/sbin/glusterfs(cleanup_and_exit+0x</span></p>
                              <p><span>65) [0x4059b5] ) 0-: received
                                  signum (15), shutting down</span></p>
                              <p><span>[2015-09-08 21:18:42.931858] I
                                  [fuse-bridge.c:5595:fini] 0-fuse:
                                  Unmounting
'/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.</span></p>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    The VM pause you saw could be because of the unmount.I understand
    that a fix (<a class="moz-txt-link-freetext" href="https://gerrit.ovirt.org/#/c/40240/">https://gerrit.ovirt.org/#/c/40240/</a>)  went in for ovirt
    3-.6 (vdsm-4.17) to prevent vdsm from unmounting the gluster volume
    when vdsm exits/restarts. <br>
    Is it possible to run a test setup on 3.6 and see if this is still
    happening?<br>
    <br>
    <blockquote cite="mid:5600D288.8090608@redhat.com" type="cite">
      <div class="moz-forward-container">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div class="HOEnZb">
                <div class="h5">
                  <div class="gmail_extra">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr">
                          <div>
                            <div>
                              <p><span><br>
                                </span></p>
                              <p><span>And the mount is broken at that
                                  point:</span></p>
                            </div>
                            <div>
                              <p><span>[root@ovirt-node267 ~]# df</span></p>
                              <p><span><font color="#ff0000"><b>df:
                                      `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02':

                                      Transport endpoint is not
                                      connected</b></font></span></p>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    Yes because it received a SIGTERM above.<br>
    <br>
    Thanks,<br>
    Ravi<br>
    <blockquote cite="mid:5600D288.8090608@redhat.com" type="cite">
      <div class="moz-forward-container">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div class="HOEnZb">
                <div class="h5">
                  <div class="gmail_extra">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr">
                          <div>
                            <div>
                              <p><span>FilesystemÂ  Â  Â  Â  Â 
                                  Â Â 1K-blocksÂ  Â 
                                  Â Â UsedÂ Â Available Use% Mounted on</span></p>
                              <p><span>/dev/sda3Â  Â  Â  Â  Â  Â 
                                  Â Â 51475068Â Â Â 1968452Â Â Â 46885176Â Â Â 5%
                                  /</span></p>
                              <p><span>tmpfsÂ Â  Â  Â  Â  Â  Â  Â 
                                  Â Â 132210244Â Â  Â  Â 
                                  Â Â 0Â Â 132210244Â Â Â 0% /dev/shm</span></p>
                              <p><span>/dev/sda2Â  Â  Â  Â  Â  Â  Â 
                                  Â Â 487652Â Â  Â Â 32409Â Â 
                                  Â Â 429643Â Â Â 8% /boot</span></p>
                              <p><span>/dev/sda1Â  Â  Â  Â  Â  Â  Â 
                                  Â Â 204580Â Â  Â  Â Â 260Â Â 
                                  Â Â 204320Â Â Â 1% /boot/efi</span></p>
                              <p><span>/dev/sda5Â  Â  Â  Â  Â 
                                  Â Â 1849960960 156714056
                                  1599267616Â Â Â 9% /data1</span></p>
                              <p><span>/dev/sdb1Â  Â  Â  Â  Â 
                                  Â Â 1902274676Â Â 18714468
                                  1786923588Â Â Â 2% /data2</span></p>
                              <p><span>ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01</span></p>
                              <p><span>Â Â  Â  Â  Â  Â  Â  Â  Â  Â 
                                  Â Â 9249804800 727008640 <a
                                    moz-do-not-send="true"
                                    href="tel:8052899712"
                                    value="+18052899712" target="_blank">8052899712</a>Â Â Â 9%
/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01</span></p>
                              <p><span>ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03</span></p>
                              <p><span>Â Â  Â  Â  Â  Â  Â  Â  Â  Â 
                                  Â Â 1849960960Â Â  Â Â 73728
                                  1755907968Â Â Â 1%
/rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03</span></p>
                              <p>The fix for that is to put the server
                                in maintenance mode then activate it
                                again. But all VM's need to be migrated
                                or stopped for that to work.</p>
                            </div>
                            <div><br>
                            </div>
                            <div>I'm not seeing any obvious network or
                              disk errors......Â </div>
                          </div>
                          <div><br>
                          </div>
                          <div>Are their configuration options I'm
                            missing?</div>
                          <div><br>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                    <br>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
        </div>
        <br>
      </div>
      <br>
    </blockquote>
    <br>
  </body>
</html>

--------------030305080206090500040208--

Re: [ovirt-users] Fwd: Re: urgent issue

Ravishankar N

Chris Liebman

tags

participants (2)