Strahil Nikolov

2 days ago I found that 2 of the 3 oVirt nodes had been set to 'Non-Operational'. GlusterFS seemed to be ok from the commandline, but the oVirt engine WebUI was reporting 2 out of 3 bricks per volume as down and event logs were filling up with the following types of messages.

****************************************************
Failed to connect Host ddmovirtprod03 to the Storage Domains data03.
The error message for connection ddmovirtprod03-strg:/data03 returned by VDSM was: Problem while trying to mount target
Failed to connect Host ddmovirtprod03 to Storage Serverthe s
Host ddmovirtprod03 cannot access the Storage Domain(s) data03 attached to the Data Center DDM_Production_DC. Setting Host state to Non-Operational.
Failed to connect Host ddmovirtprod03 to Storage Pool

Host ddmovirtprod01 reports about one of the Active Storage Domains as Problematic.
Host ddmovirtprod01 cannot access the Storage Domain(s) data03 attached to the Data Center DDM_Production_DC. Setting Host state to Non-Operational.
Failed to connect Host ddmovirtprod01 to Storage Pool DDM_Production_DC
****************************************************

The following is from the vdsm.log on host01:
****************************************************
[root@ddmovirtprod01 vdsm]# tail -f /var/log/vdsm/vdsm.log | grep "WARN"
2022-03-15 11:37:14,299+0000 WARN (ioprocess/232748) [IOProcess] (6bf1ef03-77e1-423b-850e-9bb6030b590d) Failed to create a probe file: '/rhev/data-center/mnt/glusterSD/ddmovirtprod03-strg:data03/.prob-6c101766-4e5d-40c6-8fa8-0f7e3b3e931e', error: 'Stale file handle' (init:461)
2022-03-15 11:37:24,313+0000 WARN (ioprocess/232748) [IOProcess] (6bf1ef03-77e1-423b-850e-9bb6030b590d) Failed to create a probe file: '/rhev/data-center/mnt/glusterSD/ddmovirtprod03-strg:_data03/.prob-c3fa017b-94dc-47d1-89a4-8ee046509a32', error: 'Stale file handle' (init:461)
2022-03-15 11:37:34,325+0000 WARN (ioprocess/232748) [IOProcess] (6bf1ef03-77e1-423b-850e-9bb6030b590d) Failed to create a probe file: '/rhev/data-center/mnt/glusterSD/ddmovirtprod03-strg:_data03/.prob-e173ecac-4d4d-4b59-a437-61eb5d0beb83', error: 'Stale file handle' (init:461)
2022-03-15 11:37:44,337+0000 WARN (ioprocess/232748) [IOProcess] (6bf1ef03-77e1-423b-850e-9bb6030b590d) Failed to create a probe file: '/rhev/data-center/mnt/glusterSD/ddmovirtprod03-strg:_data03/.prob-baf13698-0f43-4672-90a4-86cecdf9f8d0', error: 'Stale file handle' (init:461)
2022-03-15 11:37:54,350+0000 WARN (ioprocess/232748) [IOProcess] (6bf1ef03-77e1-423b-850e-9bb6030b590d) Failed to create a probe file: '/rhev/data-center/mnt/glusterSD/ddmovirtprod03-strg:_data03/.prob-1e92fdfd-d8e9-48b4-84a9-a2b84fc0d14c', error: 'Stale file handle' (init_:461)
****************************************************

After trying different methods to resolve without success I did the following.

1. Moved any VM disks using Storage Domain data03 onto other Storage Domains.
2. Placed data03 Storage Domain ionto Maintenance mode.
3. Placed host03 into Maintenance mode, stopping Gluster services and rebooting.
4. Ensuring all Bricks were up, the peers connected and healing started.
5. Once Gluster volumes were healed I activated host03, at which point host01 also activated.
6. Host01 was showing as disconnected on most bricks so I rebooted it which resolved this.
7. I activated Storage Domain data03 without issue.

The system has been left for 24hrs with no further issues.

The issue is now resolved but it would be helful to know what happened to cause the issues with the Storage Domain data03 and where do I look to confirm.

Regards

Simon...
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/55XNGNKOGS3ONWTWDGGJSBORZ2D2MZUT/