Hello Albert,
thanks, and sorry for the late response.
On Sept. 26 vdsmd went down on the host again. (didn't monitor properly, so, only now,
I realized it was down:)
This time I could not find any segvs in the logs.
When did it start?
We used until now, 4.3, and then upgraded first to 4.5.1 (through 4.4), as described in
the documentation. Until then, we never encountered crashes of the vdsmd. Only ony after
going to 4.5.2 and switching to compatibility level 4.7, random crashed of the vdsmd
happen.
The Ovirt engine, btw. is on a seperate Virtual machine, outside of the ovirt cluster.
On the host ,/var/log/vdsm/vdsm.log:
I only can see, logging stoppped, and after restarting vdsmd, it started again. Also,
after restarting vdsmd, all Virtual machines on that host are shown with the proper
running state.
---<snipp>--
2022-09-26 11:52:25,161+0200 INFO (jsonrpc/6) [api.host] FINISH getAllVmStats
return={'status': {'code': 0, 'message': 'Done'},
'statsList': (suppressed)} from=::1,58732 (api:54)
2022-09-26 11:52:25,316+0200 INFO (periodic/2) [vdsm.api] START repoStats(domains=())
from=internal, task_id=cfb0ce21-b884-4fd9-baa6-d020107035fe (api:48)
2022-09-26 11:52:25,317+0200 INFO (periodic/2) [vdsm.api] FINISH repoStats
return={'6f018fbd-de93-4c56-880d-8ede2aad2674': {'code': 0,
'lastCheck': '4.6', 'delay': '0.000571996',
'valid': True, 'version': 5, 'a
cquired': True, 'actual': True},
'41012bfb-b802-4092-b699-7f5284d95c8e': {'code': 0, 'lastCheck':
'5.7', 'delay': '0.00073916', 'valid': True,
'version': 5, 'acquired': True, 'actual': True},
'a15496dc-c241-4658
-af9d-0dfe11783916': {'code': 0, 'lastCheck': '4.4',
'delay': '0.00066827', 'valid': True, 'version': 5,
'acquired': True, 'actual': True},
'2c870e06-6c70-45ec-b665-ce29408c8a8e': {'code': 0, 'lastCheck':
'4.4',
'delay': '0.000624771', 'valid': True, 'version': 5,
'acquired': True, 'actual': True},
'8008fa12-17c7-4d9f-b871-153bdd4a283e': {'code': 0, 'lastCheck':
'1.8', 'delay': '0.000390569', 'valid': True,
'version':
5, 'acquired': True, 'actual': True}} from=internal,
task_id=cfb0ce21-b884-4fd9-baa6-d020107035fe (api:54)
2022-09-26 11:52:26,171+0200 INFO (jsonrpc/4) [vdsm.api] START
getSpmStatus(spUUID='5836aaac-0030-0064-024d-0000000002e4')
from=::ffff:10.2.0.4,37000, task_id=d9d88470-3f04-4060-8d19-f9c68fa928ec (api:48)
2022-09-26 11:52:26,191+0200 INFO (jsonrpc/4) [vdsm.api] FINISH getSpmStatus
return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 66,
'spmId': 2}} from=::ffff:10.2.0.4,37000, task_id=d9d88470-3f04-4060-8d19-f9c68f
a928ec (api:54)
2022-09-26 11:52:26,218+0200 INFO (jsonrpc/5) [vdsm.api] START
getStoragePoolInfo(spUUID='5836aaac-0030-0064-024d-0000000002e4')
from=::ffff:10.2.0.4,54708, task_id=96aab09f-60ca-4ae7-be31-aa3070a08030 (api:48)
2022-09-26 11:52:26,240+0200 INFO (jsonrpc/5) [vdsm.api] FINISH getStoragePoolInfo
return={'info': {'domains':
'8008fa12-17c7-4d9f-b871-153bdd4a283e:Active,6f018fbd-de93-4c56-880d-8ede2aad2674:Active,2c870e06-6
c70-45ec-b665-ce29408c8a8e:Active,a15496dc-c241-4658-af9d-0dfe11783916:Active,41012bfb-b802-4092-b699-7f5284d95c8e:Active',
'isoprefix': '', 'lver': 66, 'master_uuid':
'41012bfb-b802-4092-b699-7f5284d95c8e', 'ma
ster_ver': 293, 'name': 'No Description', 'pool_status':
'connected', 'spm_id': 2, 'type': 'ISCSI',
'version': '5'}, 'dominfo':
{'8008fa12-17c7-4d9f-b871-153bdd4a283e': {'status': 'Active',
'alerts': [], 'isopre
fix': '', 'version': 5, 'disktotal': '879066349568',
'diskfree': '863233900544'},
'6f018fbd-de93-4c56-880d-8ede2aad2674': {'status': 'Active',
'alerts': [], 'isoprefix': '', 'version': 5,
'disktotal': '854966927
36', 'diskfree': '67779952640'},
'2c870e06-6c70-45ec-b665-ce29408c8a8e': {'status': 'Active',
'alerts': [], 'isoprefix': '', 'version': 5,
'disktotal': '536468258816', 'diskfree': '445602856960'},
'a15496dc-c241
-4658-af9d-0dfe11783916': {'status': 'Active', 'alerts': [],
'isoprefix': '', 'version': 5, 'disktotal':
'2683951906816', 'diskfree': '1178028998656'},
'41012bfb-b802-4092-b699-7f5284d95c8e': {'status': 'Active'
, 'alerts': [], 'isoprefix': '', 'version': 5,
'disktotal': '536468258816', 'diskfree': '450300477440'}}}
from=::ffff:10.2.0.4,54708, task_id=96aab09f-60ca-4ae7-be31-aa3070a08030 (api:54)
2022-09-26 11:52:29,273+0200 INFO (jsonrpc/1) [api.host] START getAllVmStats()
from=::ffff:10.2.0.4,37000 (api:48)
2022-09-26 11:52:29,302+0200 INFO (jsonrpc/1) [api.host] FINISH getAllVmStats
return={'status': {'code': 0, 'message': 'Done'},
'statsList': (suppressed)} from=::ffff:10.2.0.4,37000 (api:54)
2022-09-26 11:52:33,280+0200 INFO (jsonrpc/2) [api.host] START getStats()
from=::ffff:10.2.0.4,37000 (api:48)
2022-09-26 11:52:33,392+0200 INFO (jsonrpc/2) [vdsm.api] START repoStats(domains=())
from=::ffff:10.2.0.4,37000, task_id=a6e9dfd1-ed16-4178-b2b5-3396216e39cc (api:48)
2022-09-26 11:52:33,393+0200 INFO (jsonrpc/2) [vdsm.api] FINISH repoStats
return={'6f018fbd-de93-4c56-880d-8ede2aad2674': {'code': 0,
'lastCheck': '2.6', 'delay': '0.000660599',
'valid': True, 'version': 5, 'ac
quired': True, 'actual': True},
'41012bfb-b802-4092-b699-7f5284d95c8e': {'code': 0, 'lastCheck':
'3.7', 'delay': '0.00630966', 'valid': True,
'version': 5, 'acquired': True, 'actual': True},
'a15496dc-c241-4658-
af9d-0dfe11783916': {'code': 0, 'lastCheck': '2.5',
'delay': '0.000574606', 'valid': True, 'version': 5,
'acquired': True, 'actual': True},
'2c870e06-6c70-45ec-b665-ce29408c8a8e': {'code': 0, 'lastCheck':
'2.5',
'delay': '0.000632427', 'valid': True, 'version': 5,
'acquired': True, 'actual': True},
'8008fa12-17c7-4d9f-b871-153bdd4a283e': {'code': 0, 'lastCheck':
'5.2', 'delay': '0.000390569', 'valid': True,
'version':
5, 'acquired': True, 'actual': True}} from=::ffff:10.2.0.4,37000,
task_id=a6e9dfd1-ed16-4178-b2b5-3396216e39cc (api:54)
2022-09-26 11:52:33,393+0200 INFO (jsonrpc/2) [vdsm.api] START multipath_health()
from=::ffff:10.2.0.4,37000, task_id=730e795b-ea87-4523-8711-caa27d7491fa (api:48)
2022-09-26 11:52:33,394+0200 INFO (jsonrpc/2) [vdsm.api] FINISH multipath_health
return={} from=::ffff:10.2.0.4,37000, task_id=730e795b-ea87-4523-8711-caa27d7491fa
(api:54)
2022-09-26 11:52:33,413+0200 INFO (jsonrpc/2) [api.host] FINISH getStats
return={'status': {'code': 0, 'message': 'Done'},
'info': (suppressed)} from=::ffff:10.2.0.4,37000 (api:54)
2022-10-01 07:03:26,112+0200 INFO (MainThread) [vds] (PID: 4130768) I am the actual vdsm
4.50.2.2.1 vserv05 (4.18.0-372.26.1.el8_6.x86_64) (vdsmd:152)
2022-10-01 07:03:26,113+0200 INFO (MainThread) [vds] VDSM will run with cpu affinity:
frozenset({1}) (vdsmd:271)
2022-10-01 07:03:26,122+0200 INFO (MainThread) [storage.hsm] START HSM init (hsm:217)
2022-10-01 07:03:26,125+0200 INFO (MainThread) [storage.hsm] Creating data-center mount
directory '/rhev/data-center/mnt' (hsm:222)
2022-10-01 07:03:26,125+0200 INFO (MainThread) [storage.fileutils] Creating directory:
/rhev/data-center/mnt mode: None (fileUtils:231)
2022-10-01 07:03:26,127+0200 INFO (MainThread) [storage.storagepool] Unmounting master
/rhev/data-center/mnt/blockSD/41012bfb-b802-4092-b699-7f5284d95c8e/master (sp:438)
2022-10-01 07:03:26,215+0200 INFO (MainThread) [storage.mount] unmounting
/rhev/data-center/mnt/blockSD/41012bfb-b802-4092-b699-7f5284d95c8e/master (mount:215)
2022-10-01 07:03:26,338+0200 INFO (MainThread) [storage.hsm] Unlinking file
'/rhev/data-center/5836aaac-0030-0064-024d-0000000002e4/2c870e06-6c70-45ec-b665-ce29408c8a8e'
(hsm:344)
2022-10-01 07:03:26,339+0200 INFO (MainThread) [storage.hsm] Unlinking file
'/rhev/data-center/5836aaac-0030-0064-024d-0000000002e4/6f018fbd-de93-4c56-880d-8ede2aad2674'
(hsm:344)
2022-10-01 07:03:26,339+0200 INFO (MainThread) [storage.hsm] Unlinking file
'/rhev/data-center/5836aaac-0030-0064-024d-0000000002e4/a15496dc-c241-4658-af9d-0dfe11783916'
(hsm:344)
---<snapp>---
On Ovirt-engine, I can see, it cannot communicate with vdsmd, wich is not really
surprinsing, when the vdsmd is down.
---<snipp>---
2022-09-26 11:53:51,974+02 INFO
[org.ovirt.engine.core.bll.aaa.TerminateSessionsForTokenCommand] (default task-237)
[5ad44f73] Running command: TerminateSessionsForTokenCommand internal: true.
2022-09-26 11:53:54,705+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connecting to /10.2.0.5
2022-09-26 11:53:54,705+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connected to /10.2.0.5:54321
2022-09-26 11:53:54,707+02 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] Unable to
RefreshCapabilities: ConnectExce
ption: Connection refused
2022-09-26 11:53:54,715+02 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] Command
'GetCapabilitiesAsy
ncVDSCommand(HostName = vserv05,
VdsIdAndVdsVDSCommandParametersBase:{hostId='75188aec-44ef-43b9-92b5-e5b06ae22ada',
vds='Host[vserv05,75188aec-44ef-43b9-92b5-e5b06ae22ada]'})' execution failed:
java.net.Connect
Exception: Connection refused
2022-09-26 11:53:56,246+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connection timeout for host '10.2.0.5', last response
arrived 90001 ms ago.
2022-09-26 11:53:56,269+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [3ba89774]
EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetStoragePoolInfoVDS failed:
Connection timeout for host '10.2.0.5', last response arrived 90001 ms ago.
2022-09-26 11:53:56,270+02 ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [3ba89774] ERROR,
org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception:
VDSGenericException: VDSNetworkException: Connection timeout for host '10.2.0.5',
last response arrived 90001 ms ago., log id: c571eb8
2022-09-26 11:53:56,273+02 ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [3ba89774]
Exception VDSNetworkException: VDSGenericException: VDSNetworkException: Connection
timeout for host '10.2.0.5', last response arrived 90001 ms ago.
2022-09-26 11:53:57,726+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connecting to /10.2.0.5
2022-09-26 11:53:57,727+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connected to /10.2.0.5:54321
2022-09-26 11:53:57,732+02 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85) [] Unable to
RefreshCapabilities: ConnectException: Connection refused
2022-09-26 11:53:57,734+02 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85) [] Command
'GetCapabilitiesAsyncVDSCommand(HostName = vserv05,
VdsIdAndVdsVDSCommandParametersBase:{hostId='75188aec-44ef-43b9-92b5-e5b06ae22ada',
vds='Host[vserv05,75188aec-44ef-43b9-92b5-e5b06ae22ada]'})' execution failed:
java.net.ConnectException: Connection refused
2022-09-26 11:54:00,750+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connecting to /10.2.0.5
2022-09-26 11:54:00,750+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connected to /10.2.0.5:54321
2022-09-26 11:54:00,754+02 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [] Unable to
RefreshCapabilities: ConnectException: Connection refused
2022-09-26 11:54:00,758+02 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [] Command
'GetCapabilitiesAsyncVDSCommand(HostName = vserv05,
VdsIdAndVdsVDSCommandParametersBase:{hostId='75188aec-44ef-43b9-92b5-e5b06ae22ada',
vds='Host[vserv05,75188aec-44ef-43b9-92b5-e5b06ae22ada]'})' execution failed:
java.net.ConnectException: Connection refused
2022-09-26 11:54:03,772+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connecting to /10.2.0.5
2022-09-26 11:54:03,773+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connected to /10.2.0.5:54321
2022-09-26 11:54:03,780+02 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [] Unable to
RefreshCapabilities: ConnectException: Connection refused
2022-09-26 11:54:03,782+02 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [] Command
'GetCapabilitiesAsyncVDSCommand(HostName = vserv05,
VdsIdAndVdsVDSCommandParametersBase:{hostId='75188aec-44ef-43b9-92b5-e5b06ae22ada',
vds='Host[vserv05,75188aec-44ef-43b9-92b5-e5b06ae22ada]'})' execution failed:
java.net.ConnectException: Connection refused
2022-09-26 11:54:06,330+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connecting to /10.2.0.5
2022-09-26 11:54:06,330+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connected to /10.2.0.5:54321
2022-09-26 11:54:06,333+02 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [] Command
'SpmStatusVDSCommand(HostName = vserv05,
SpmStatusVDSCommandParameters:{hostId='75188aec-44ef-43b9-92b5-e5b06ae22ada',
storagePoolId='5836aaac-0030-0064-024d-0000000002e4'})' execution failed:
java.net.ConnectException: Connection refused
2022-09-26 11:54:06,517+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connecting to /10.2.0.5
2022-09-26 11:54:06,517+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient]
(SSL Stomp Reactor) [] Connected to /10.2.0.5:54321
---<snapp>----