Hi All,

With the help of gluster community and ovirt-china community...my issue got resolved...

The main root cause was the following :- 

1. the glob operation takes quite a long time, longer than the ioprocess default 60s..
2. python-ioprocess updated which makes a single change of configuration file doesn't work properly, only because this we should hack the code manually...

 Solution (Need to do on all the hosts) :- 

 1. Add the the ioprocess timeout value in the /etc/vdsm/vdsm.conf file as  :- 

------------
[irs]
process_pool_timeout = 180
-------------

2. Check /usr/share/vdsm/storage/outOfProcess.py, line 71 and see whether there is  still "IOProcess(DEFAULT_TIMEOUT)" in it,if yes...then changing the configuration file takes no effect because now timeout is the third parameter not the second of IOProcess.__init__().

3. Change IOProcess(DEFAULT_TIMEOUT) to IOProcess(timeout=DEFAULT_TIMEOUT) and remove the  /usr/share/vdsm/storage/outOfProcess.pyc file and restart vdsm and supervdsm service on all hosts.... 

Thanks,
Punit Dambiwal


On Mon, Mar 23, 2015 at 9:18 AM, Punit Dambiwal <hypunit@gmail.com> wrote:
Hi All,

Still i am facing the same issue...please help me to overcome this issue...

Thanks,
punit

On Fri, Mar 20, 2015 at 12:22 AM, Thomas Holkenbrink <thomas.holkenbrink@fibercloud.com> wrote:

I’ve seen this before. The system thinks the storage system us up and running and then attempts to utilize it.

The way I got around it was to put a delay in the startup of the gluster Node on the interface that the clients use to communicate.

 

I use a bonded link, I then add a LINKDELAY to the interface to get the underlying system up and running before the network comes up. This then causes Network dependent features to wait for the network to finish.

It adds about 10seconds to the startup time, in our environment it works well, you may not need as long of a delay.

 

CentOS

root@gls1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0

 

DEVICE=bond0

ONBOOT=yes

BOOTPROTO=static

USERCTL=no

NETMASK=255.255.248.0

IPADDR=10.10.1.17

MTU=9000

IPV6INIT=no

IPV6_AUTOCONF=no

NETWORKING_IPV6=no

NM_CONTROLLED=no

LINKDELAY=10

NAME="System Storage Bond0"

 

 

 

 

Hi Michal,

 

The Storage domain is up and running and mounted on all the host nodes...as i updated before that it was working perfectly before but just after reboot can not make the VM poweron...

 

Inline image 1

 

Inline image 2

 

[root@cpu01 log]# gluster volume info

 

Volume Name: ds01

Type: Distributed-Replicate

Volume ID: 369d3fdc-c8eb-46b7-a33e-0a49f2451ff6

Status: Started

Number of Bricks: 48 x 2 = 96

Transport-type: tcp

Bricks:

Brick1: cpu01:/bricks/1/vol1

Brick2: cpu02:/bricks/1/vol1

Brick3: cpu03:/bricks/1/vol1

Brick4: cpu04:/bricks/1/vol1

Brick5: cpu01:/bricks/2/vol1

Brick6: cpu02:/bricks/2/vol1

Brick7: cpu03:/bricks/2/vol1

Brick8: cpu04:/bricks/2/vol1

Brick9: cpu01:/bricks/3/vol1

Brick10: cpu02:/bricks/3/vol1

Brick11: cpu03:/bricks/3/vol1

Brick12: cpu04:/bricks/3/vol1

Brick13: cpu01:/bricks/4/vol1

Brick14: cpu02:/bricks/4/vol1

Brick15: cpu03:/bricks/4/vol1

Brick16: cpu04:/bricks/4/vol1

Brick17: cpu01:/bricks/5/vol1

Brick18: cpu02:/bricks/5/vol1

Brick19: cpu03:/bricks/5/vol1

Brick20: cpu04:/bricks/5/vol1

Brick21: cpu01:/bricks/6/vol1

Brick22: cpu02:/bricks/6/vol1

Brick23: cpu03:/bricks/6/vol1

Brick24: cpu04:/bricks/6/vol1

Brick25: cpu01:/bricks/7/vol1

Brick26: cpu02:/bricks/7/vol1

Brick27: cpu03:/bricks/7/vol1

Brick28: cpu04:/bricks/7/vol1

Brick29: cpu01:/bricks/8/vol1

Brick30: cpu02:/bricks/8/vol1

Brick31: cpu03:/bricks/8/vol1

Brick32: cpu04:/bricks/8/vol1

Brick33: cpu01:/bricks/9/vol1

Brick34: cpu02:/bricks/9/vol1

Brick35: cpu03:/bricks/9/vol1

Brick36: cpu04:/bricks/9/vol1

Brick37: cpu01:/bricks/10/vol1

Brick38: cpu02:/bricks/10/vol1

Brick39: cpu03:/bricks/10/vol1

Brick40: cpu04:/bricks/10/vol1

Brick41: cpu01:/bricks/11/vol1

Brick42: cpu02:/bricks/11/vol1

Brick43: cpu03:/bricks/11/vol1

Brick44: cpu04:/bricks/11/vol1

Brick45: cpu01:/bricks/12/vol1

Brick46: cpu02:/bricks/12/vol1

Brick47: cpu03:/bricks/12/vol1

Brick48: cpu04:/bricks/12/vol1

Brick49: cpu01:/bricks/13/vol1

Brick50: cpu02:/bricks/13/vol1

Brick51: cpu03:/bricks/13/vol1

Brick52: cpu04:/bricks/13/vol1

Brick53: cpu01:/bricks/14/vol1

Brick54: cpu02:/bricks/14/vol1

Brick55: cpu03:/bricks/14/vol1

Brick56: cpu04:/bricks/14/vol1

Brick57: cpu01:/bricks/15/vol1

Brick58: cpu02:/bricks/15/vol1

Brick59: cpu03:/bricks/15/vol1

Brick60: cpu04:/bricks/15/vol1

Brick61: cpu01:/bricks/16/vol1

Brick62: cpu02:/bricks/16/vol1

Brick63: cpu03:/bricks/16/vol1

Brick64: cpu04:/bricks/16/vol1

Brick65: cpu01:/bricks/17/vol1

Brick66: cpu02:/bricks/17/vol1

Brick67: cpu03:/bricks/17/vol1

Brick68: cpu04:/bricks/17/vol1

Brick69: cpu01:/bricks/18/vol1

Brick70: cpu02:/bricks/18/vol1

Brick71: cpu03:/bricks/18/vol1

Brick72: cpu04:/bricks/18/vol1

Brick73: cpu01:/bricks/19/vol1

Brick74: cpu02:/bricks/19/vol1

Brick75: cpu03:/bricks/19/vol1

Brick76: cpu04:/bricks/19/vol1

Brick77: cpu01:/bricks/20/vol1

Brick78: cpu02:/bricks/20/vol1

Brick79: cpu03:/bricks/20/vol1

Brick80: cpu04:/bricks/20/vol1

Brick81: cpu01:/bricks/21/vol1

Brick82: cpu02:/bricks/21/vol1

Brick83: cpu03:/bricks/21/vol1

Brick84: cpu04:/bricks/21/vol1

Brick85: cpu01:/bricks/22/vol1

Brick86: cpu02:/bricks/22/vol1

Brick87: cpu03:/bricks/22/vol1

Brick88: cpu04:/bricks/22/vol1

Brick89: cpu01:/bricks/23/vol1

Brick90: cpu02:/bricks/23/vol1

Brick91: cpu03:/bricks/23/vol1

Brick92: cpu04:/bricks/23/vol1

Brick93: cpu01:/bricks/24/vol1

Brick94: cpu02:/bricks/24/vol1

Brick95: cpu03:/bricks/24/vol1

Brick96: cpu04:/bricks/24/vol1

Options Reconfigured:

diagnostics.count-fop-hits: on

diagnostics.latency-measurement: on

nfs.disable: on

user.cifs: enable

auth.allow: 10.10.0.*

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

cluster.eager-lock: enable

network.remote-dio: enable

cluster.quorum-type: auto

cluster.server-quorum-type: server

storage.owner-uid: 36

storage.owner-gid: 36

server.allow-insecure: on

network.ping-timeout: 100

[root@cpu01 log]#

 

-----------------------------------------

 

[root@cpu01 log]# gluster volume status

Status of volume: ds01

Gluster process                                         Port    Online  Pid

------------------------------------------------------------------------------

Brick cpu01:/bricks/1/vol1                              49152   Y       33474

Brick cpu02:/bricks/1/vol1                              49152   Y       40717

Brick cpu03:/bricks/1/vol1                              49152   Y       18080

Brick cpu04:/bricks/1/vol1                              49152   Y       40447

Brick cpu01:/bricks/2/vol1                              49153   Y       33481

Brick cpu02:/bricks/2/vol1                              49153   Y       40724

Brick cpu03:/bricks/2/vol1                              49153   Y       18086

Brick cpu04:/bricks/2/vol1                              49153   Y       40453

Brick cpu01:/bricks/3/vol1                              49154   Y       33489

Brick cpu02:/bricks/3/vol1                              49154   Y       40731

Brick cpu03:/bricks/3/vol1                              49154   Y       18097

Brick cpu04:/bricks/3/vol1                              49154   Y       40460

Brick cpu01:/bricks/4/vol1                              49155   Y       33495

Brick cpu02:/bricks/4/vol1                              49155   Y       40738

Brick cpu03:/bricks/4/vol1                              49155   Y       18103

Brick cpu04:/bricks/4/vol1                              49155   Y       40468

Brick cpu01:/bricks/5/vol1                              49156   Y       33502

Brick cpu02:/bricks/5/vol1                              49156   Y       40745

Brick cpu03:/bricks/5/vol1                              49156   Y       18110

Brick cpu04:/bricks/5/vol1                              49156   Y       40474

Brick cpu01:/bricks/6/vol1                              49157   Y       33509

Brick cpu02:/bricks/6/vol1                              49157   Y       40752

Brick cpu03:/bricks/6/vol1                              49157   Y       18116

Brick cpu04:/bricks/6/vol1                              49157   Y       40481

Brick cpu01:/bricks/7/vol1                              49158   Y       33516

Brick cpu02:/bricks/7/vol1                              49158   Y       40759

Brick cpu03:/bricks/7/vol1                              49158   Y       18122

Brick cpu04:/bricks/7/vol1                              49158   Y       40488

Brick cpu01:/bricks/8/vol1                              49159   Y       33525

Brick cpu02:/bricks/8/vol1                              49159   Y       40766

Brick cpu03:/bricks/8/vol1                              49159   Y       18130

Brick cpu04:/bricks/8/vol1                              49159   Y       40495

Brick cpu01:/bricks/9/vol1                              49160   Y       33530

Brick cpu02:/bricks/9/vol1                              49160   Y       40773

Brick cpu03:/bricks/9/vol1                              49160   Y       18137

Brick cpu04:/bricks/9/vol1                              49160   Y       40502

Brick cpu01:/bricks/10/vol1                             49161   Y       33538

Brick cpu02:/bricks/10/vol1                             49161   Y       40780

Brick cpu03:/bricks/10/vol1                             49161   Y       18143

Brick cpu04:/bricks/10/vol1                             49161   Y       40509

Brick cpu01:/bricks/11/vol1                             49162   Y       33544

Brick cpu02:/bricks/11/vol1                             49162   Y       40787

Brick cpu03:/bricks/11/vol1                             49162   Y       18150

Brick cpu04:/bricks/11/vol1                             49162   Y       40516

Brick cpu01:/bricks/12/vol1                             49163   Y       33551

Brick cpu02:/bricks/12/vol1                             49163   Y       40794

Brick cpu03:/bricks/12/vol1                             49163   Y       18157

Brick cpu04:/bricks/12/vol1                             49163   Y       40692

Brick cpu01:/bricks/13/vol1                             49164   Y       33558

Brick cpu02:/bricks/13/vol1                             49164   Y       40801

Brick cpu03:/bricks/13/vol1                             49164   Y       18165

Brick cpu04:/bricks/13/vol1                             49164   Y       40700

Brick cpu01:/bricks/14/vol1                             49165   Y       33566

Brick cpu02:/bricks/14/vol1                             49165   Y       40809

Brick cpu03:/bricks/14/vol1                             49165   Y       18172

Brick cpu04:/bricks/14/vol1                             49165   Y       40706

Brick cpu01:/bricks/15/vol1                             49166   Y       33572

Brick cpu02:/bricks/15/vol1                             49166   Y       40815

Brick cpu03:/bricks/15/vol1                             49166   Y       18179

Brick cpu04:/bricks/15/vol1                             49166   Y       40714

Brick cpu01:/bricks/16/vol1                             49167   Y       33579

Brick cpu02:/bricks/16/vol1                             49167   Y       40822

Brick cpu03:/bricks/16/vol1                             49167   Y       18185

Brick cpu04:/bricks/16/vol1                             49167   Y       40722

Brick cpu01:/bricks/17/vol1                             49168   Y       33586

Brick cpu02:/bricks/17/vol1                             49168   Y       40829

Brick cpu03:/bricks/17/vol1                             49168   Y       18192

Brick cpu04:/bricks/17/vol1                             49168   Y       40727

Brick cpu01:/bricks/18/vol1                             49169   Y       33593

Brick cpu02:/bricks/18/vol1                             49169   Y       40836

Brick cpu03:/bricks/18/vol1                             49169   Y       18201

Brick cpu04:/bricks/18/vol1                             49169   Y       40735

Brick cpu01:/bricks/19/vol1                             49170   Y       33600

Brick cpu02:/bricks/19/vol1                             49170   Y       40843

Brick cpu03:/bricks/19/vol1                             49170   Y       18207

Brick cpu04:/bricks/19/vol1                             49170   Y       40741

Brick cpu01:/bricks/20/vol1                             49171   Y       33608

Brick cpu02:/bricks/20/vol1                             49171   Y       40850

Brick cpu03:/bricks/20/vol1                             49171   Y       18214

Brick cpu04:/bricks/20/vol1                             49171   Y       40748

Brick cpu01:/bricks/21/vol1                             49172   Y       33614

Brick cpu02:/bricks/21/vol1                             49172   Y       40858

Brick cpu03:/bricks/21/vol1                             49172   Y       18222

Brick cpu04:/bricks/21/vol1                             49172   Y       40756

Brick cpu01:/bricks/22/vol1                             49173   Y       33621

Brick cpu02:/bricks/22/vol1                             49173   Y       40864

Brick cpu03:/bricks/22/vol1                             49173   Y       18227

Brick cpu04:/bricks/22/vol1                             49173   Y       40762

Brick cpu01:/bricks/23/vol1                             49174   Y       33626

Brick cpu02:/bricks/23/vol1                             49174   Y       40869

Brick cpu03:/bricks/23/vol1                             49174   Y       18234

Brick cpu04:/bricks/23/vol1                             49174   Y       40769

Brick cpu01:/bricks/24/vol1                             49175   Y       33631

Brick cpu02:/bricks/24/vol1                             49175   Y       40874

Brick cpu03:/bricks/24/vol1                             49175   Y       18239

Brick cpu04:/bricks/24/vol1                             49175   Y       40774

Self-heal Daemon on localhost                           N/A     Y       33361

Self-heal Daemon on cpu05                               N/A     Y       2353

Self-heal Daemon on cpu04                               N/A     Y       40786

Self-heal Daemon on cpu02                               N/A     Y       32442

Self-heal Daemon on cpu03                               N/A     Y       18664

 

Task Status of Volume ds01

------------------------------------------------------------------------------

Task                 : Rebalance

ID                   : 5db24b30-4b9f-4b65-8910-a7a0a6d327a4

Status               : completed

 

[root@cpu01 log]#

 

[root@cpu01 log]# gluster pool list

UUID                                    Hostname        State

626c9360-8c09-480f-9707-116e67cc38e6    cpu02           Connected

dc475d62-b035-4ee6-9006-6f03bf68bf24    cpu05           Connected

41b5b2ff-3671-47b4-b477-227a107e718d    cpu03           Connected

c0afe114-dfa7-407d-bad7-5a3f97a6f3fc    cpu04           Connected

9b61b0a5-be78-4ac2-b6c0-2db588da5c35    localhost       Connected

[root@cpu01 log]#

 

Inline image 3

 

Thanks,

Punit

 

On Thu, Mar 19, 2015 at 2:53 PM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:


On Mar 19, 2015, at 03:18 , Punit Dambiwal <hypunit@gmail.com> wrote:

> Hi All,
>
> Is there any one have any idea about this problem...it seems it's bug either in Ovirt or Glusterfs...that's why no one has the idea about it....please correct me if i am wrong….

Hi,
as I said, storage access times out; so it seems to me as a gluster setup problem, the storage domain you have your VMs on is not working…

Thanks,
michal


>
> Thanks,
> Punit
>
> On Wed, Mar 18, 2015 at 5:05 PM, Punit Dambiwal <hypunit@gmail.com> wrote:
> Hi Michal,
>
> Would you mind to let me know the possible messedup things...i will check and try to resolve it....still i am communicating gluster community to resolve this issue...
>
> But in the ovirt....gluster setup is quite straight....so how come it will be messedup with reboot ?? if it can be messedup with reboot then it seems not good and stable technology for the production storage....
>
> Thanks,
> Punit
>
> On Wed, Mar 18, 2015 at 3:51 PM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
>
> On Mar 18, 2015, at 03:33 , Punit Dambiwal <hypunit@gmail.com> wrote:
>
> > Hi,
> >
> > Is there any one from community can help me to solve this issue...??
> >
> > Thanks,
> > Punit
> >
> > On Tue, Mar 17, 2015 at 12:52 PM, Punit Dambiwal <hypunit@gmail.com> wrote:
> > Hi,
> >
> > I am facing one strange issue with ovirt/glusterfs....still didn't find this issue is related with glusterfs or Ovirt....
> >
> > Ovirt :- 3.5.1
> > Glusterfs :- 3.6.1
> > Host :- 4 Hosts (Compute+ Storage)...each server has 24 bricks
> > Guest VM :- more then 100
> >
> > Issue :- When i deploy this cluster first time..it work well for me(all the guest VM created and running successfully)....but suddenly one day my one of the host node rebooted and none of the VM can boot up now...and failed with the following error "Bad Volume Specification"
> >
> > VMId :- d877313c18d9783ca09b62acf5588048
> >
> > VDSM Logs :- http://ur1.ca/jxabi
>
> you've got timeouts while accessing storage…so I guess something got messed up on reboot, it may also be just a gluster misconfiguration…
>
> > Engine Logs :- http://ur1.ca/jxabv
> >
> > ------------------------
> > [root@cpu01 ~]# vdsClient -s 0 getVolumeInfo e732a82f-bae9-4368-8b98-dedc1c3814de 00000002-0002-0002-0002-000000000145 6d123509-6867-45cf-83a2-6d679b77d3c5 9030bb43-6bc9-462f-a1b9-f6d5a02fb180
> >         status = OK
> >         domain = e732a82f-bae9-4368-8b98-dedc1c3814de
> >         capacity = 21474836480
> >         voltype = LEAF
> >         description =
> >         parent = 00000000-0000-0000-0000-000000000000
> >         format = RAW
> >         image = 6d123509-6867-45cf-83a2-6d679b77d3c5
> >         uuid = 9030bb43-6bc9-462f-a1b9-f6d5a02fb180
> >         disktype = 2
> >         legality = LEGAL
> >         mtime = 0
> >         apparentsize = 21474836480
> >         truesize = 4562972672
> >         type = SPARSE
> >         children = []
> >         pool =
> >         ctime = 1422676305
> > ---------------------
> >
> > I opened same thread earlier but didn't get any perfect answers to solve this issue..so i reopen it...
> >
> > https://www.mail-archive.com/users@ovirt.org/msg25011.html
> >
> > Thanks,
> > Punit
> >
> >
> >
>
>
>