ovirt+gluster+NFS : storage hicups

older
[ANN] oVirt 3.5.4 Final Release...

Nicolas Ecarnot

5 Aug 2015 5 Aug '15

4:32 p.m.

Hi, I used the two links below to setup a test DC : http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-... The only thing I did different is I did not usea hosted engine, but I dedicated a solid server for that. So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0) As in the doc above, my 3 hosts are publishing 300 Go of replicated gluster storage, above which ctdb is managing a floating virtual ip that is used by NFS as the master storage domain. The last point is that the manager is also presenting a NFS storage I'm using as an export domain. It took me some time to plug all this setup as it is a bit more complicated as my other DC with a real SAN and no gluster, but it is eventually working (I can run VMs, migrate them...) I have made many severe tests (from a very dumb user point of view : unplug/replug the power cable of this server - does ctdb floats the vIP? does gluster self-heals?, does the VM restart?...) When precisely looking each layer one by one, all seems to be correct : ctdb is fast at managing the ip, NFS is OK, gluster seems to reconstruct, fencing eventually worked with the lanplus workaround, and so on... But from times to times, there seem to appear a severe hicup which I have great difficulties to diagnose. The messages in the web gui are not very precise, and not consistent: - some tell about some host having network issues, but I can ping it from every place it needs to be reached (especially from the SPM and the manager) "On host serv-vm-al01, Error: Network error during communication with the Host" - some tell that some volume is degraded, when it's not (gluster commands are showing no issue. Even the oVirt tab about the volumes are all green) - "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center" Just by waiting a couple of seconds lead to a self heal with no action. - Repeated "Detected change in status of brick serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP." but absolutely no action is made on this filesystem. At this time, zero VM is running in this test datacenter, and no action is made on the hosts. Though, I see some looping errors coming and going, and I find no way to diagnose. Amongst the *actions* that I had the idea to use to solve some issues : - I've found that trying to force the self-healing, and playing with gluster commands had no effect - I've found that playing with gluster adviced actions "find /gluster -exec stat {} \; ..." seem to have no either effect - I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb continue") DID SOLVE most of these issue. I believe that it's not what ctdb is doing that helps, but maybe one of its shell hook that is cleaning some troubles? As this setup is complexe, I don't ask anyone a silver bullet, but maybe you may know which layer is the most fragile, and which one I should look at more closely? -- Nicolas ECARNOT

Show replies by date

Vered Volansky

6 Aug 6 Aug

11:08 a.m.

----- Original Message -----

...

From: "Nicolas Ecarnot" <nicolas@ecarnot.net> To: "users@ovirt.org" <Users@ovirt.org> Sent: Wednesday, August 5, 2015 5:32:38 PM Subject: [ovirt-users] ovirt+gluster+NFS : storage hicups

Hi,

I used the two links below to setup a test DC :

http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-...

The only thing I did different is I did not usea hosted engine, but I dedicated a solid server for that. So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)

As in the doc above, my 3 hosts are publishing 300 Go of replicated gluster storage, above which ctdb is managing a floating virtual ip that is used by NFS as the master storage domain.

The last point is that the manager is also presenting a NFS storage I'm using as an export domain.

It took me some time to plug all this setup as it is a bit more complicated as my other DC with a real SAN and no gluster, but it is eventually working (I can run VMs, migrate them...)

I have made many severe tests (from a very dumb user point of view : unplug/replug the power cable of this server - does ctdb floats the vIP? does gluster self-heals?, does the VM restart?...) When precisely looking each layer one by one, all seems to be correct : ctdb is fast at managing the ip, NFS is OK, gluster seems to reconstruct, fencing eventually worked with the lanplus workaround, and so on...

But from times to times, there seem to appear a severe hicup which I have great difficulties to diagnose. The messages in the web gui are not very precise, and not consistent: - some tell about some host having network issues, but I can ping it from every place it needs to be reached (especially from the SPM and the manager) Ping doesn't say much as the ssh protocol is the one being used. Please try this and report. Please attach logs (engine+vdsm). Log snippets would be helpful (but more important are full logs).

In general it smells like an ssh/firewall issue.

...

"On host serv-vm-al01, Error: Network error during communication with the Host"

- some tell that some volume is degraded, when it's not (gluster commands are showing no issue. Even the oVirt tab about the volumes are all green)

- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center" Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brick serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP." but absolutely no action is made on this filesystem.

At this time, zero VM is running in this test datacenter, and no action is made on the hosts. Though, I see some looping errors coming and going, and I find no way to diagnose.

Amongst the *actions* that I had the idea to use to solve some issues : - I've found that trying to force the self-healing, and playing with gluster commands had no effect - I've found that playing with gluster adviced actions "find /gluster -exec stat {} \; ..." seem to have no either effect - I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb continue") DID SOLVE most of these issue. I believe that it's not what ctdb is doing that helps, but maybe one of its shell hook that is cleaning some troubles?

As this setup is complexe, I don't ask anyone a silver bullet, but maybe you may know which layer is the most fragile, and which one I should look at more closely?

-- Nicolas ECARNOT _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Nicolas Ecarnot

2:24 p.m.

Hi Vered, Thanks for answering. Le 06/08/2015 11:08, Vered Volansky a écrit :

...

...
But from times to times, there seem to appear a severe hicup which I have great difficulties to diagnose. The messages in the web gui are not very precise, and not consistent: - some tell about some host having network issues, but I can ping it from every place it needs to be reached (especially from the SPM and the manager) Ping doesn't say much as the ssh protocol is the one being used. Please try this and report.

Try what?

...

Please attach logs (engine+vdsm). Log snippets would be helpful (but more important are full logs).

I guess that what will be most useful is to provide logs at or around the precise moment s**t is hitting the fan. But this is very difficult to forecast : There are times I'm trying hard to break it (see dumb user tests previously described) and oVirt is doing well at coping with these situations. And at the opposite, there are times where even zero VM is running, and I see the DC appearing as non operational for some minutes. So I'll send logs the next time I see such situation.

...

In general it smells like an ssh/firewall issue.

On this test setup, I disabled the firewall on my hosts. And you're right, it appears I forgot to disable it on one of the three hosts. On the one I forgot, a brief look at the iptables rules seemed like very conform with what I'm use to see as managed by oVirt, nothing weird. Anyway, it is now completely disabled.

...

...
"On host serv-vm-al01, Error: Network error during communication with the Host"

This host had no firewall activated... -- Nicolas ECARNOT

Vered Volansky

9 Aug 9 Aug

9:40 a.m.

On Thu, Aug 6, 2015 at 3:24 PM, Nicolas Ecarnot <nicolas@ecarnot.net> wrote:

...

Hi Vered,

Thanks for answering.

Le 06/08/2015 11:08, Vered Volansky a écrit :

But from times to times, there seem to appear a severe hicup which I

...
...
have great difficulties to diagnose. The messages in the web gui are not very precise, and not consistent: - some tell about some host having network issues, but I can ping it from every place it needs to be reached (especially from the SPM and the manager)

Ping doesn't say much as the ssh protocol is the one being used. Please try this and report.

Try what?

ssh instead of ping.

...

Please attach logs (engine+vdsm). Log snippets would be helpful (but more

...
important are full logs).

I guess that what will be most useful is to provide logs at or around the precise moment s**t is hitting the fan. But this is very difficult to forecast : There are times I'm trying hard to break it (see dumb user tests previously described) and oVirt is doing well at coping with these situations. And at the opposite, there are times where even zero VM is running, and I see the DC appearing as non operational for some minutes. So I'll send logs the next time I see such situation.

You can send logs and just point us to the time your problems occurred. They are rotated, so unless you removed them they should be available to you at any time. Just make sure they have the time in question and we'll dig in.

...

...
In general it smells like an ssh/firewall issue.

On this test setup, I disabled the firewall on my hosts. And you're right, it appears I forgot to disable it on one of the three hosts. On the one I forgot, a brief look at the iptables rules seemed like very conform with what I'm use to see as managed by oVirt, nothing weird. Anyway, it is now completely disabled.

Good :)

...

...
"On host serv-vm-al01, Error: Network error during communication with

...
the Host"

This host had no firewall activated...

-- Nicolas ECARNOT

Sahina Bose

6 Aug 6 Aug

2:26 p.m.

On 08/06/2015 02:38 PM, Vered Volansky wrote:

...

----- Original Message -----

...
From: "Nicolas Ecarnot" <nicolas@ecarnot.net> To: "users@ovirt.org" <Users@ovirt.org> Sent: Wednesday, August 5, 2015 5:32:38 PM Subject: [ovirt-users] ovirt+gluster+NFS : storage hicups

Hi,

I used the two links below to setup a test DC :

http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-...

The only thing I did different is I did not usea hosted engine, but I dedicated a solid server for that. So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)

As in the doc above, my 3 hosts are publishing 300 Go of replicated gluster storage, above which ctdb is managing a floating virtual ip that is used by NFS as the master storage domain.

The last point is that the manager is also presenting a NFS storage I'm using as an export domain.

It took me some time to plug all this setup as it is a bit more complicated as my other DC with a real SAN and no gluster, but it is eventually working (I can run VMs, migrate them...)

I have made many severe tests (from a very dumb user point of view : unplug/replug the power cable of this server - does ctdb floats the vIP? does gluster self-heals?, does the VM restart?...) When precisely looking each layer one by one, all seems to be correct : ctdb is fast at managing the ip, NFS is OK, gluster seems to reconstruct, fencing eventually worked with the lanplus workaround, and so on...

But from times to times, there seem to appear a severe hicup which I have great difficulties to diagnose. The messages in the web gui are not very precise, and not consistent: - some tell about some host having network issues, but I can ping it from every place it needs to be reached (especially from the SPM and the manager) Ping doesn't say much as the ssh protocol is the one being used. Please try this and report. Please attach logs (engine+vdsm). Log snippets would be helpful (but more important are full logs).

In general it smells like an ssh/firewall issue.

...
"On host serv-vm-al01, Error: Network error during communication with the Host"

- some tell that some volume is degraded, when it's not (gluster commands are showing no issue. Even the oVirt tab about the volumes are all green)

- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center" Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brick serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP." but absolutely no action is made on this filesystem.

This is coming from the earlier issue where the Host status was marked Down, the engine sees these bricks as being Down as well, and hence the state change messages

...

...
At this time, zero VM is running in this test datacenter, and no action is made on the hosts. Though, I see some looping errors coming and going, and I find no way to diagnose.

Amongst the *actions* that I had the idea to use to solve some issues : - I've found that trying to force the self-healing, and playing with gluster commands had no effect - I've found that playing with gluster adviced actions "find /gluster -exec stat {} \; ..." seem to have no either effect - I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb continue") DID SOLVE most of these issue. I believe that it's not what ctdb is doing that helps, but maybe one of its shell hook that is cleaning some troubles?

As this setup is complexe, I don't ask anyone a silver bullet, but maybe you may know which layer is the most fragile, and which one I should look at more closely?

-- Nicolas ECARNOT _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Nicolas Ecarnot

2:41 p.m.

Le 06/08/2015 14:26, Sahina Bose a écrit :

...

...
...
- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center" Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brick serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP." but absolutely no action is made on this filesystem.

This is coming from the earlier issue where the Host status was marked Down, the engine sees these bricks as being Down as well, and hence the state change messages

OK : When I read "Host ... cannot access the Storage Domain ... attached to the Data Center", and according to the setup described earlier (NFS on ctdb on gluster), is it correct to translate it into "this host can not NFS-mount the storage domain"? In which case it will help me to narrow down my debugging. -- Nicolas ECARNOT

Tim Macy

4:36 p.m.

Nicolas, I have the same setup dedicated physical system running engine on CentOS 6.6 three hosts running CentOS 7.1 with Gluster and KVM, and firewall is disabled on all hosts. I also followed the same documents to build my environment so I assume they are very similar. I have on occasion had the same errors and have also found that "ctdb rebalanceip <floating ip>" is the only way to resolve the problem. I intend to remove ctdb since it is not needed with the configuration we are running. CTDB is only needed for hosted engine on a floating NFS mount, so you should be able change the gluster storage domain mount paths to "localhost:<name>". The only thing that has prevented me from making this change is that my environment is live with running VM's. Please let me know if you go this route. Thank you, Tim Macy On Thu, Aug 6, 2015 at 8:41 AM, Nicolas Ecarnot <nicolas@ecarnot.net> wrote:

...

Le 06/08/2015 14:26, Sahina Bose a écrit :

- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN>

...
...
...
attached to the Data Center" Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brick serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP." but absolutely no action is made on this filesystem.

This is coming from the earlier issue where the Host status was marked Down, the engine sees these bricks as being Down as well, and hence the state change messages

OK : When I read "Host ... cannot access the Storage Domain ... attached to the Data Center", and according to the setup described earlier (NFS on ctdb on gluster), is it correct to translate it into "this host can not NFS-mount the storage domain"?

In which case it will help me to narrow down my debugging.

-- Nicolas ECARNOT _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Nicolas Ecarnot

11:18 p.m.

Hi Tim, Nice to read that someone else is fighting with a similar setup :) Le 06/08/2015 16:36, Tim Macy a écrit :

...

Nicolas, I have the same setup dedicated physical system running engine on CentOS 6.6 three hosts running CentOS 7.1 with Gluster and KVM, and firewall is disabled on all hosts. I also followed the same documents to build my environment so I assume they are very similar. I have on occasion had the same errors and have also found that "ctdb rebalanceip <floating ip>" is the only way to resolve the problem.

Indeed, when I'm stopping/continuing my ctdb services, the main action is a move of the vIP. So we agree there is definitively something to dig there! Either directly, either as a side effect. I must admit I'd be glad to search further before following the second part of your answer.

...

I intend to remove ctdb since it is not needed with the configuration we are running. CTDB is only needed for hosted engine on a floating NFS mount,

And in a less obvious manner, it also allows to gently remove a host from the vIP managers pool, before removing it on the gluster layer. Not a great advantage, but worth mentionning.

...

so you should be able change the gluster storage domain mount paths to "localhost:<name>". The only thing that has prevented me from making this change is that my environment is live with running VM's. Please let me know if you go this route.

I'm more than interested to choose this way, if : - I find no time to investigate the floating vIP issue - I can simplify this setup - This can lead to increased perf About the master storage domain path, should I use only pure gluster and completely forget about NFS? -- Nicolas ECARNOT

Donny Davis

7 Aug 7 Aug

2:17 a.m.

I have the same setup, and my only issue is at the switch level with CTDB. The IP does failover, however until I issue a ping from the interface ctdb is connected to, the storage will not connect. If i go to the host with the CTDB vip, and issue a ping from the interface ctdb is on, everything works as described. On Thu, Aug 6, 2015 at 5:18 PM, Nicolas Ecarnot <nicolas@ecarnot.net> wrote:

...

Hi Tim,

Nice to read that someone else is fighting with a similar setup :)

Le 06/08/2015 16:36, Tim Macy a écrit :

...
Nicolas, I have the same setup dedicated physical system running engine on CentOS 6.6 three hosts running CentOS 7.1 with Gluster and KVM, and firewall is disabled on all hosts. I also followed the same documents to build my environment so I assume they are very similar. I have on occasion had the same errors and have also found that "ctdb rebalanceip <floating ip>" is the only way to resolve the problem.

Indeed, when I'm stopping/continuing my ctdb services, the main action is a move of the vIP. So we agree there is definitively something to dig there! Either directly, either as a side effect.

I must admit I'd be glad to search further before following the second part of your answer.

I intend to

...
remove ctdb since it is not needed with the configuration we are running. CTDB is only needed for hosted engine on a floating NFS mount,

And in a less obvious manner, it also allows to gently remove a host from the vIP managers pool, before removing it on the gluster layer. Not a great advantage, but worth mentionning.

so you should be able change the gluster storage domain mount paths to

...
"localhost:<name>". The only thing that has prevented me from making this change is that my environment is live with running VM's. Please let me know if you go this route.

I'm more than interested to choose this way, if : - I find no time to investigate the floating vIP issue - I can simplify this setup - This can lead to increased perf

About the master storage domain path, should I use only pure gluster and completely forget about NFS?

-- Nicolas ECARNOT _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Donny Davis

Nicolas Ecarnot

9:05 a.m.

Le 07/08/2015 02:17, Donny Davis a écrit :

...

I have the same setup, and my only issue is at the switch level with CTDB. The IP does failover, however until I issue a ping from the interface ctdb is connected to, the storage will not connect.

If i go to the host with the CTDB vip, and issue a ping from the interface ctdb is on, everything works as described.

I know the problem you're describing, as we faced it in a completely different context. But I'm not sure it's oVirt specific. In our case, what was worse was that our bonding induced similar issues when switching (mode 1), and our arp cache was too long. (do YOU have bondig also?) We're still in the process of correcting that, but as I said, it is in a different datacenter, so not related to this thread. -- Nicolas ECARNOT

Nicolas Ecarnot

3 Sep 3 Sep

3:28 p.m.

Le 06/08/2015 16:36, Tim Macy a écrit :

...

Nicolas, I have the same setup dedicated physical system running engine on CentOS 6.6 three hosts running CentOS 7.1 with Gluster and KVM, and firewall is disabled on all hosts. I also followed the same documents to build my environment so I assume they are very similar. I have on occasion had the same errors and have also found that "ctdb rebalanceip <floating ip>" is the only way to resolve the problem. I intend to remove ctdb since it is not needed with the configuration we are running. CTDB is only needed for hosted engine on a floating NFS mount, so you should be able change the gluster storage domain mount paths to "localhost:<name>". The only thing that has prevented me from making this change is that my environment is live with running VM's. Please let me know if you go this route.

Thank you, Tim Macy

This week, I eventually took the time to change this, as this DC is not in production. - Our big NFS storage domain was the master, it contained some VMs - I wiped all my VMs - I created a very small temporary NFS master domain, because I did not want to bother with any issue related with erasing the last master storage domain - I removed the big NFS SD - I wiped all that was inside, on a filesystem level [ - I disabled ctdb, removed the "meta" gluster volume that ctdb used for its locks ] - I added a new storage domain, using your advice : - gluster type - localhost:<name> - I removed the temp SD, and all switched correctly on the big glusterFS I then spent some time playing with P2V, and storing new VMs on this new style glusterFS storage domain. I'm watching the CPU and i/o on the hosts, and yes, they are working, but that keeps sane. On this particular change (NFS to glusterFS), everything was very smooth. Regards, -- Nicolas ECARNOT

3710

Age (days ago)

3739

Last active (days ago)

List overview

Download

10 comments

6 participants

participants (6)

Donny Davis
Nicolas Ecarnot
Sahina Bose
Tim Macy
Vered Volansky
Vered Volansky