On Sun, Oct 28, 2018 at 5:17 PM fsoyer <fsoyer(a)systea.fr> wrote:
Well guys,
I can say now that I have a real problem, maybe between ovirt and gluster storage, but I
can't be sure. Yesterday, I wanted to clone a VM (named "crij2") from a
snapshot, but (this is another problem I think) the UI never gave me the popup (blank
window with the cursor with a message 400 after a timeout). So I decided to export it,
then import it.
The export/import finally works, but when it was working, some VMs became randomly
unresponsives, and one restarted on error. At this time, the engine was on
"ginger" node. Copy of the event log :
27 oct. 2018 20:32:12 VM crij2 started on Host victor.local.systea.fr
27 oct. 2018 20:31:37 VM crij2 was started by admin@internal-authz (Host:
victor.local.systea.fr).
27 oct. 2018 20:26:39 Vm crij2 was imported successfully to Data Center Default, Cluster
Default
27 oct. 2018 20:22:53 VM logcollector is not responding.
27 oct. 2018 20:22:10 VM Sogov3 is not responding.
27 oct. 2018 20:17:53 VM cerbere4 is not responding.
27 oct. 2018 20:17:49 VM cerbere3 is not responding.
27 oct. 2018 20:17:48 VM logcollector is not responding.
27 oct. 2018 20:16:38 VM Sogov3 is not responding.
27 oct. 2018 20:16:38 VM cerbere4 is not responding.
27 oct. 2018 20:16:38 VM op2drugs1 is not responding.
27 oct. 2018 20:16:33 VM cerbere3 is not responding.
27 oct. 2018 20:07:30 VM op2drugs1 is not responding.
27 oct. 2018 20:06:14 VM cerbere3 is not responding.
27 oct. 2018 20:02:27 VM cerbere3 is not responding.
27 oct. 2018 20:01:11 VM logcollector is not responding.
27 oct. 2018 20:00:56 VM zabbix is not responding.
27 oct. 2018 19:57:42 VM zabbix is not responding.
27 oct. 2018 19:57:42 VM cerbere3 is not responding.
27 oct. 2018 19:57:42 VM logcollector is not responding.
27 oct. 2018 19:54:40 VM zabbix is not responding.
27 oct. 2018 19:53:25 VM cerbere3 is not responding.
27 oct. 2018 19:53:25 VM cerbere4 is not responding.
27 oct. 2018 19:48:29 Starting to import Vm crij2 to Data Center Default, Cluster
Default
27 oct. 2018 19:47:41 Refresh image list succeeded for domain(s): ISO (ISO file type)
27 oct. 2018 19:46:46 VM crij2 was renamed from crij2 to crij2_ok by admin.
27 oct. 2018 19:46:46 VM crij2 configuration was updated by admin@internal-authz.
27 oct. 2018 19:46:12 Refresh image list succeeded for domain(s): ISO (ISO file type)
27 oct. 2018 19:42:36 Refresh image list succeeded for domain(s): ISO (ISO file type)
27 oct. 2018 19:37:22 Vm crij2 was exported successfully to EXPORT
27 oct. 2018 19:36:04 VM HostedEngine is not responding.
27 oct. 2018 19:33:03 VM op2drugs1 is not responding.
27 oct. 2018 19:32:48 VM altern8 is not responding.
27 oct. 2018 19:32:48 VM patjoub1 is not responding.
27 oct. 2018 19:31:03 VM op2drugs1 is not responding.
27 oct. 2018 19:30:48 VM altern8 is not responding.
27 oct. 2018 19:30:48 VM patjoub1 is not responding.
27 oct. 2018 19:28:37 VM Sogov3 is not responding.
27 oct. 2018 19:28:07 VM altern8 is not responding.
27 oct. 2018 19:28:07 VM op2drugs1 is not responding.
27 oct. 2018 19:28:07 VM patjoub1 is not responding.
27 oct. 2018 19:25:10 VM Mint19 is not responding.
27 oct. 2018 19:25:10 VM zabbix is not responding.
27 oct. 2018 19:24:55 VM HostedEngine is not responding.
27 oct. 2018 19:23:33 VM op2drugs1 is not responding.
27 oct. 2018 19:23:18 VM altern8 is not responding.
27 oct. 2018 19:23:18 VM patjoub1 is not responding.
27 oct. 2018 19:21:52 VM op2drugs1 is not responding.
27 oct. 2018 19:20:06 VM patjoub1 is not responding.
27 oct. 2018 19:19:51 VM Sogov3 is not responding.
27 oct. 2018 19:18:26 Host ginger.local.systea.fr power management was verified
successfully.
27 oct. 2018 19:18:26 Status of host ginger.local.systea.fr was set to Up.
27 oct. 2018 19:18:25 Manually synced the storage devices from host
ginger.local.systea.fr
27 oct. 2018 19:17:51 Executing power management status on Host ginger.local.systea.fr
using Proxy Host victor.local.systea.fr and Fence Agent ipmilan:10.0.0.225.
27 oct. 2018 19:17:39 Host ginger.local.systea.fr is not responding. It will stay in
Connecting state for a grace period of 82 seconds and after that an attempt to fence the
host will be issued.
27 oct. 2018 19:17:21 VM altern8 is not responding.
27 oct. 2018 19:17:21 Invalid status on Data Center Default. Setting Data Center status
to Non Responsive (On host ginger.local.systea.fr, Error: Network error during
communication with the Host.).
27 oct. 2018 19:17:21 VM patjoub1 is not responding.
27 oct. 2018 19:17:20 VM HostedEngine is not responding.
27 oct. 2018 19:17:20 VM op2drugs1 is not responding.
27 oct. 2018 19:17:19 VDSM ginger.local.systea.fr command SpmStatusVDS failed: Connection
timeout for host 'ginger.local.systea.fr', last response arrived 17279 ms ago.
27 oct. 2018 19:16:16 Failed to update VMs/Templates OVF data for Storage Domain DATA02
in Data Center Default. 27 oct. 2018 19:16:16
Failed to update OVF disks 85d67951-d610-49b3-aaab-a81850621e35, OVF data isn't
updated on those OVF stores (Data Center Default, Storage Domain DATA02).
27 oct. 2018 19:16:16 VDSM command SetVolumeDescriptionVDS failed: Resource timeout: ()
27 oct. 2018 19:16:16 VM patjoub1 is not responding.
27 oct. 2018 19:16:16 VM op2drugs1 is not responding.
27 oct. 2018 19:14:46 VM patjoub1 is not responding.
27 oct. 2018 19:14:46 VM op2drugs1 is not responding.
27 oct. 2018 19:13:18 Host ginger.local.systea.fr power management was verified
successfully.
27 oct. 2018 19:13:18 Status of host ginger.local.systea.fr was set to Up.
27 oct. 2018 19:13:03 Manually synced the storage devices from host
ginger.local.systea.fr
27 oct. 2018 19:12:51 VM altern8 is not responding.
27 oct. 2018 19:12:51 VM HostedEngine is not responding.
27 oct. 2018 19:12:51 VM op2drugs1 is not responding.
27 oct. 2018 19:12:48 Executing power management status on Host ginger.local.systea.fr
using Proxy Host victor.local.systea.fr and Fence Agent ipmilan:10.0.0.225.
27 oct. 2018 19:12:44 Host ginger.local.systea.fr does not enforce SELinux. Current
status: DISABLED
27 oct. 2018 19:12:36 Invalid status on Data Center Default. Setting Data Center status
to Non Responsive (On host ginger.local.systea.fr, Error: Network error during
communication with the Host.).
27 oct. 2018 19:12:28 Host ginger.local.systea.fr is not responding. It will stay in
Connecting state for a grace period of 82 seconds and after that an attempt to fence the
host will be issued.
27 oct. 2018 19:12:28 VDSM ginger.local.systea.fr command SpmStatusVDS failed: Connection
timeout for host 'ginger.local.systea.fr', last response arrived 25225 ms ago.
27 oct. 2018 19:10:06 VM altern8 is not responding.
27 oct. 2018 19:10:06 VM patjoub1 is not responding.
27 oct. 2018 19:10:06 VM op2drugs1 is not responding.
27 oct. 2018 19:08:49 VM op2drugs1 is not responding.
27 oct. 2018 19:08:45 Refresh image list succeeded for domain(s): ISO (ISO file type)
27 oct. 2018 19:08:34 VM altern8 is not responding.
27 oct. 2018 19:08:34 VM patjoub1 is not responding.
27 oct. 2018 19:08:34 VM HostedEngine is not responding.
27 oct. 2018 19:04:01 VM op2drugs1 is not responding.
27 oct. 2018 19:01:08 VM HostedEngine is not responding.
27 oct. 2018 19:00:53 VM zabbix is not responding.
27 oct. 2018 19:00:01 Trying to restart VM npi2 on Host victor.local.systea.fr
27 oct. 2018 18:59:14 Trying to restart VM npi2 on Host victor.local.systea.fr
27 oct. 2018 18:59:13 Highly Available VM np2 failed. It will be restarted
automatically.
27 oct. 2018 18:59:13 VM npi2 is down with error. Exit message: VM has been terminated on
the host.
27 oct. 2018 18:59:05 VM altern8 is not responding.
27 oct. 2018 18:58:44 Storage domain DATA02 experienced a high latency of 6.16279 seconds
from host ginger.local.systea.fr. This may cause performance and functional issues. Please
consult your Storage Administrator.
27 oct. 2018 18:57:19 VM altern8 is not responding.
27 oct. 2018 18:57:19 VM patjoub1 is not responding.
27 oct. 2018 18:57:19 VM HostedEngine is not responding.
27 oct. 2018 18:57:19 VM op2drugs1 is not responding.
27 oct. 2018 18:55:56 VM altern8 is not responding.
27 oct. 2018 18:55:41 VM op2drugs1 is not responding.
27 oct. 2018 18:55:00 VM altern8 is not responding.
27 oct. 2018 18:54:45 VM op2drugs1 is not responding.
27 oct. 2018 18:52:21 VM Sogov3 is not responding.
27 oct. 2018 18:52:21 VM npi2 is not responding.
27 oct. 2018 18:50:50 VM altern8 is not responding.
27 oct. 2018 18:50:47 VM zabbix is not responding.
27 oct. 2018 18:48:16 VM op2drugs1 is not responding.
27 oct. 2018 18:48:03 VM altern8 is not responding.
27 oct. 2018 18:48:03 VM HostedEngine is not responding.
27 oct. 2018 18:45:48 Starting export Vm crij2 to EXPORT
27 oct. 2018 18:42:57 Refresh image list succeeded for domain(s): ISO (ISO file type)
27 oct. 2018 18:40:44 Refresh image list succeeded for domain(s): ISO (ISO file type)
27 oct. 2018 18:40:04 VM crij2 is down. Exit message: User shut down from within the
guest
27 oct. 2018 18:39:25 User admin@internal-authz got disconnected from VM crij2.
I checked the network and gluster since it works but saw absolutly nothing special. The
storage network was not too sollicited (bwm-ng indicated max 50MB/s on bond1). Gluster log
no errors too (even if the engine reported some timeouts).
This morning I started to search why and wanted to submit to you some logs on this
thread, but I found something that had not caught my attention before, so I ask about it
before all.
I recall the configuration :
3 hosts with gluster (replica 2 + arbiter). The volumes are on a separate network (bond1
is an aggregation of 2 Gb cards while ovirmgmt is on bond0, 2 NICs in backup mode).
For now, I have only declared the first 2 nodes in the engine GUI as ovirt nodes, because
the arbiter is a small machine with a smaller CPU (and only 8Gb memory), that needed to
downgrade the cluster from Sandybridge to Nehalem. Maybe it was an error. The
storagenetwork on bond1 was declared too in the GYUI, but not yet as a gluster storage.
The Gluster volumes themselves were declared on the storage network by using names
indicated in /etc/hosts on bond1 network. Here is a volume status :
# gluster volume status
Status of volume: DATA01
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata01/data01/brick 49152 0 Y 2489
Brick gingerstorage.local.systea.fr:/home/d
ata01/data01/brick 49152 0 Y 2531
Brick eskarinastorage.local.systea.fr:/home
/data01/data01/brick 49153 0 Y 28119
Self-heal Daemon on localhost N/A N/A Y 24859
Self-heal Daemon on eskarinastorage.local.s
ystea.fr N/A N/A Y 30725
Self-heal Daemon on victorstorage.local.sys
tea.fr N/A N/A Y 2810
Task Status of Volume DATA01
------------------------------------------------------------------------------
There are no active volume tasks
Status of volume: DATA02
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata02/data02/brick 49153 0 Y 2553
Brick gingerstorage.local.systea.fr:/home/d
ata02/data02/brick 49153 0 Y 2561
Brick eskarinastorage.local.systea.fr:/home
/data01/data02/brick 49154 0 Y 28204
Self-heal Daemon on localhost N/A N/A Y 24859
Self-heal Daemon on eskarinastorage.local.s
ystea.fr N/A N/A Y 30725
Self-heal Daemon on victorstorage.local.sys
tea.fr N/A N/A Y 2810
Task Status of Volume DATA02
------------------------------------------------------------------------------
There are no active volume tasks
Status of volume: ENGINE
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata02/engine/brick 49154 0 Y 2571
Brick gingerstorage.local.systea.fr:/home/d
ata02/engine/brick 49154 0 Y 2610
Brick eskarinastorage.local.systea.fr:/home
/data01/engine/brick 49152 0 Y 28013
Self-heal Daemon on localhost N/A N/A Y 24859
Self-heal Daemon on eskarinastorage.local.s
ystea.fr N/A N/A Y 30725
Self-heal Daemon on victorstorage.local.sys
tea.fr N/A N/A Y 2810
Task Status of Volume ENGINE
------------------------------------------------------------------------------
There are no active volume tasks
Status of volume: EXPORT
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata01/export/brick 49155 0 Y 2588
Brick gingerstorage.local.systea.fr:/home/d
ata01/export/brick 49155 0 Y 2629
Brick eskarinastorage.local.systea.fr:/home
/data01/export/brick 49156 0 Y 28384
Self-heal Daemon on localhost N/A N/A Y 24859
Self-heal Daemon on eskarinastorage.local.s
ystea.fr N/A N/A Y 30725
Self-heal Daemon on victorstorage.local.sys
tea.fr N/A N/A Y 2810
Task Status of Volume EXPORT
------------------------------------------------------------------------------
There are no active volume tasks
Status of volume: ISO
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick victorstorage.local.systea.fr:/home/d
ata01/iso/brick 49156 0 Y 2595
Brick gingerstorage.local.systea.fr:/home/d
ata01/iso/brick 49156 0 Y 2636
Brick eskarinastorage.local.systea.fr:/home
/data01/iso/brick 49155 0 Y 28292
Self-heal Daemon on localhost N/A N/A Y 24859
Self-heal Daemon on eskarinastorage.local.s
ystea.fr N/A N/A Y 30725
Self-heal Daemon on victorstorage.local.sys
tea.fr N/A N/A Y 2810
Task Status of Volume ISO
-------------------------------
But, a df on the nodes shows that all volumes except ENGINE was mounted on ovirmgmt
network (hosts names without "storage") :
gingerstorage.local.systea.fr:/ENGINE 5,0T 226G 4,7T 5%
/rhev/data-center/mnt/glusterSD/gingerstorage.local.systea.fr:_ENGINE
victor.local.systea.fr:/DATA01 1,3T 425G 862G 33%
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA01
victor.local.systea.fr:/DATA02 5,0T 226G 4,7T 5%
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02
victor.local.systea.fr:/ISO 1,3T 425G 862G 33%
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_ISO
victor.local.systea.fr:/EXPORT 1,3T 425G 862G 33%
/rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_EXPORT
I can't remember how it was declared at install time, maybe I had not seen that, but
if I tried to had a domain now, gluster managed, effectively it proposes to me only the
nodes by their ovirmgmt names, not storage names.
Names are only known in the /etc/hosts of all nodes + engine, there is no DNS for this
local addresses.
So : in your opinion, can this configuration be a (the) source of my problems ? And have
you an idea how I could correct this now, without loosing anything ?
I don't think this is the cause of your issues.
Are there errors in vdsm logs? Do you have issues with storage latency
(can you check the gluster volume profile output?)
>
> Thanks for all suggestions.
>
> --
>
> Regards,
>
> Frank
>
>
> Le Jeudi, Octobre 18, 2018 23:13 CEST, Nir Soffer <nsoffer(a)redhat.com> a
écrit:
> On Thu, Oct 18, 2018 at 3:43 PM fsoyer <fsoyer(a)systea.fr> wrote:Hi,
> I forgot to look in the /var/log/messages file on the host ! What a shame :/
> Here is the messages file at the time of the error :
https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737
> At the sasme time, the second host as no particular messages in its log