On Thu, Feb 21, 2019 at 11:11 PM Jason P. Thomas
<jthomasp(a)gmualumni.org> wrote:
> On 2/20/19 5:33 PM, Darrell Budic wrote:
>
> I was just helping Tristam on #ovirt with a similar problem, we found that his two
upgraded nodes were running multiple glusterfsd processes per brick (but not all bricks).
His volume & brick files in /var/lib/gluster looked normal, but starting glusterd
would often spawn extra fsd processes per brick, seemed random. Gluster bug? Maybe related
to
https://secure-web.cisco.com/1zutsPlj0TjKvDiGvxmw5PZZPUoEtpkcJqhpWGvx2-fJ...,
but I’m helping debug this one second hand… Possibly related to the brick crashes? We
wound up stopping glusterd, killing off all the fsds, restarting glusterd, and repeating
until it only spawned one fsd per brick. Did that to each updated server, then restarted
glusterd on the not-yet-updated server to get it talking to the right bricks. That seemed
to get to a mostly stable gluster environment, but he’s still seeing 1-2 files listed as
needing healing on the upgraded bricks (but not the 3.12 brick). Mainly the DIRECT_IO_TEST
and one of the dom/ids files, but he can probably update that. Did manage to get his
engine going again, waiting to see if he’s stable now.
>
> Anyway, figured it was worth posting about so people could check for multiple brick
processes (glusterfsd) if they hit this stability issue as well, maybe find common
ground.
>
> Note: also encountered
https://secure-web.cisco.com/1eMDo5MMs_aJ6TOHBckwoG7rBeXsvPoHI01UBZ8YJ2Gn...
trying to get his engine back up, restarting libvirtd let us get it going again. Maybe
un-needed if he’d been able to complete his third node upgrades, but he got stuck before
then, so...
>
> -Darrell
>
> Stable is a relative term. My unsynced entries total for each of my 4 volumes
changes drastically (with the exception of the engine volume, it pretty much bounces
between 1 and 4). The cluster has been "healing" for 18 hours or so and only
the unupgraded HC node has healed bricks. I did have the problem that some
files/directories were owned by root:root. These VMs did not boot until I changed
ownership to 36:36. Even after 18 hours, there's anywhere from 20-386 entries in vol
heal info for my 3 non engine bricks. Overnight I had one brick on one volume go down on
one HC node. When I bounced glusterd, it brought up a new fsd process for that brick. I
killed the old one and now vol status reports the right pid on each of the nodes. This is
quite the debacle. If I can provide any info that might help get this debacle moving in
the right direction, let me know.
Can you provide the gluster brick logs and glusterd logs from the
servers (from /var/log/glusterfs/). Since you mention that heal seems
to be stuck, could you also provide the heal logs from
/var/log/glusterfs/glustershd.log
If you can log a bug with these logs, that would be great - please use
https://secure-web.cisco.com/1u1JXUFgLqfmecw2r1a0ZiPWiR0AlgZ1-8A3Ax1bpyCy...
to log the
bug.
with the requested
logs during the time frame I experienced issues. Sorry for the delay, I
was out of the office Friday and this morning.
Jason
> Jason aka Tristam
>
>
> On Feb 14, 2019, at 1:12 AM, Sahina Bose <sabose(a)redhat.com> wrote:
>
> On Thu, Feb 14, 2019 at 2:39 AM Ron Jerome <ronjero(a)gmail.com> wrote:
>
>
>
>
> Can you be more specific? What things did you see, and did you report bugs?
>
>
> I've got this one:
https://secure-web.cisco.com/13CTqpVLySHJsr0ExmooX7-Akp0iAcz1X7qikVLAWiuw...
> and this one:
https://secure-web.cisco.com/1zutsPlj0TjKvDiGvxmw5PZZPUoEtpkcJqhpWGvx2-fJ...
> and I've got bricks randomly going offline and getting out of sync with the
others at which point I've had to manually stop and start the volume to get things
back in sync.
>
>
> Thanks for reporting these. Will follow up on the bugs to ensure
> they're addressed.
> Regarding brciks going offline - are the brick processes crashing? Can
> you provide logs of glusterd and bricks. Or is this to do with
> ovirt-engine and brick status not being in sync?
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://secure-web.cisco.com/1nEZk1HmmOVHI5jAXqISs80YRB-SbMsrunxvZ9XJpVCV...
> oVirt Code of Conduct:
https://secure-web.cisco.com/12Y4LEt9YRdw8K3OrOj6uPi27aZWSQ67hHP7OnuT4WeY...
> List Archives:
https://secure-web.cisco.com/1ZrjEv46RcxAUmSm9tzKCuYdZL9varvvxSuqJiTCcGl2...
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://secure-web.cisco.com/1nEZk1HmmOVHI5jAXqISs80YRB-SbMsrunxvZ9XJpVCV...
> oVirt Code of Conduct:
https://secure-web.cisco.com/12Y4LEt9YRdw8K3OrOj6uPi27aZWSQ67hHP7OnuT4WeY...
> List Archives:
https://secure-web.cisco.com/1KzN5tLsiJy3-g9yYcOx5d8B1oOe9PrxbORUgNL0FrN3...
>
>
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://secure-web.cisco.com/1ubMaXUij250PN8zKVQvmo6NUYWPOdVDirkU4lwkRkpC...
> oVirt Code of Conduct:
https://secure-web.cisco.com/1HjeIIkwx_NRkoCsnonfHu87z-MFaPfE3HOMBJ02Mzwy...
> List Archives:
https://secure-web.cisco.com/1XcKrt1wH3y9o2mcDXqQa9v-MXc1VugRHkrHz1HJwNk-...
>
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://secure-web.cisco.com/1nEZk1HmmOVHI5jAXqISs80YRB-SbMsrunxvZ9XJpVCV...
> oVirt Code of Conduct:
https://secure-web.cisco.com/12Y4LEt9YRdw8K3OrOj6uPi27aZWSQ67hHP7OnuT4WeY...
> List Archives:
https://secure-web.cisco.com/1VHG2GaLSYhMlJRaVcbioIc6zcifQKoE2LPKzsOAGjti...
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://secure-web.cisco.com/1nEZk1HmmOVHI5jAXqISs80YRB-SbMsrunxvZ9XJpVCV...
oVirt Code of Conduct:
https://secure-web.cisco.com/12Y4LEt9YRdw8K3OrOj6uPi27aZWSQ67hHP7OnuT4WeY...
List Archives:
https://secure-web.cisco.com/1Ww442-q18Ni2CsDxBntfOz7M8DrE6wAzkG6fJ0CeCJ_...