Cannot restart ovirt after massive failure.

newer
NVIDIA vGPU driver for Ovirt 4.5.4

Gilboa Davara

8 Aug 2021 8 Aug '21

3:53 p.m.

Hello all, During the night, one of my (smaller) setups, a single node self hosted engine (localhost NFS) crashed due to what-looks-like a massive disk failure (Software RAID6, with 10 drives + spare). After a reboot, I let the RAID resync with a fresh drive) and went on to start oVirt. However, no such luck. Two issues: 1. ovirt-ha-broker fails due to broken hosted engine state (log attached). 2. ovirt-ha-agent fails due to network test (tcp) even though both remote-host and DNS servers are active. (log attached). Two questions: 1. Can I somehow force the agent to disable the network liveliness test? 2. Can I somehow force the broker to rebuild / fix the hosted engine state? - Gilboa

Attachments:

attachment.html (text/html — 879 bytes)
hosted_log_config.tgz (application/x-compressed-tar — 19.0 KB)

Show replies by date

Gilboa Davara

8 Aug 8 Aug

4:05 p.m.

On Sun, Aug 8, 2021 at 7:53 PM Gilboa Davara <gilboad@gmail.com> wrote:

...

Hello all,

During the night, one of my (smaller) setups, a single node self hosted engine (localhost NFS) crashed due to what-looks-like a massive disk failure (Software RAID6, with 10 drives + spare). After a reboot, I let the RAID resync with a fresh drive) and went on to start oVirt. However, no such luck. Two issues: 1. ovirt-ha-broker fails due to broken hosted engine state (log attached). 2. ovirt-ha-agent fails due to network test (tcp) even though both remote-host and DNS servers are active. (log attached).

Two questions: 1. Can I somehow force the agent to disable the network liveliness test? 2. Can I somehow force the broker to rebuild / fix the hosted engine state?

- Gilboa

Strahil Nikolov

5:08 p.m.

Usually this is not the problem. Start checking: 1. Export FS is mounted 2. NFS server is running (after all this is a single node NFS setup) 3. Check that vdsmd , supervdsmd and sanlock are running 4. If needed, enable debug for the ovirt-ha-{agent,broker} as usually the default log level won't show the problem. Best Regards, Strahil Nikolov В неделя, 8 август 2021 г., 20:06:46 ч. Гринуич+3, Gilboa Davara <gilboad@gmail.com> написа: On Sun, Aug 8, 2021 at 7:53 PM Gilboa Davara <gilboad@gmail.com> wrote:

...

Hello all,

During the night, one of my (smaller) setups, a single node self hosted engine (localhost NFS) crashed due to what-looks-like a massive disk failure (Software RAID6, with 10 drives + spare). After a reboot, I let the RAID resync with a fresh drive) and went on to start oVirt. However, no such luck. Two issues: 1. ovirt-ha-broker fails due to broken hosted engine state (log attached). 2. ovirt-ha-agent fails due to network test (tcp) even though both remote-host and DNS servers are active. (log attached).

Two questions: 1. Can I somehow force the agent to disable the network liveliness test? 2. Can I somehow force the broker to rebuild / fix the hosted engine state?

- Gilboa

FWIW switching agent network test to none (via hosted-engine --set-shared-config network_test none --type=he_local) doesn't seem to work. (Unless I'm missing the point and the agent is failing due to broker issues and not due to a failed network liveliness check). - Gilboa _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OH4H5K2FZXO6YN...

Gilboa Davara

6:06 p.m.

Hello, On Sun, Aug 8, 2021 at 9:08 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Usually this is not the problem.

Start checking: 1. Export FS is mounted 2. NFS server is running (after all this is a single node NFS setup) 3. Check that vdsmd , supervdsmd and sanlock are running 4. If needed, enable debug for the ovirt-ha-{agent,broker} as usually the default log level won't show the problem.

Best Regards, Strahil Nikolov

1. All NFS shares are exported, hosted storage (used by the hosted engine) is mounted by oVirt. $ mount | grep rhev localhost:/exports/hosted on /rhev/data-center/mnt/localhost:_exports_hosted type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1) 2. NFS is working as expected. $ exportfs | grep exports /exports/hosted 127.0.0.1/255.255.255.0 /exports/data 127.0.0.1/255.255.255.0 /exports/iso 127.0.0.1/255.255.255.0 /exports/export 127.0.0.1/255.255.255.0 3. All services seem to run just fine (minus broker and agent). $ ps -AH | /bin/egrep -e 'vdsm|sanlock' 2282 ? 00:00:00 sanlock 2284 ? 00:00:00 sanlock-helper 5065 ? 00:00:02 supervdsmd 12259 ? 00:20:15 vdsmd 4. In both cases I can see the problem in the log. Broker: -------------------------------------------------------------------------------------------------------------------------------------------------- MainThread::INFO::2021-08-08 19:46:06,962::status_broker::121::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker::(__init__) Status broker initialized. Listener::INFO::2021-08-08 19:46:06,962::listener::44::ovirt_hosted_engine_ha.broker.listener.Listener::(__init__) Initializing RPCServer Listener::INFO::2021-08-08 19:46:06,963::listener::57::ovirt_hosted_engine_ha.broker.listener.Listener::(__init__) RPCServer ready StatusStorageThread::ERROR::2021-08-08 19:46:06,985::storage_broker::167::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Corrupted metadata from /run/vdsm/storage/9541c195-9f59-4225-91be-53391b4f1bb3/10cb67f7-6be2-47e4-9268-81fca9862057/deadf86f-b937-4172-8359-90c991dc2ecf Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats data = bdata.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 4191756: invalid start byte StatusStorageThread::ERROR::2021-08-08 19:46:06,986::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state. Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats data = bdata.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 4191756: invalid start byte During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run self._storage_broker.get_raw_stats() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 169, in get_raw_stats .format(str(e))) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Corrupted read metadata: 'utf-8' codec can't decode byte 0xb9 in position 4191756: invalid start byte StatusStorageThread::ERROR::2021-08-08 19:46:06,987::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker Listener::INFO::2021-08-08 19:46:07,464::broker::77::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Server shutting down Listener::INFO::2021-08-08 19:46:07,464::monitor::117::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_all_submonitors) Stopping all submonitors MainThread::INFO::2021-08-08 19:46:08,060::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.7 started Agent: -------------------------------------------------------------------------------------------------------------------------------------------------- MainThread::INFO::2021-08-08 19:36:25,467::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.1.9', 'network_test': 'tcp', 'tcp_t_address': '192.168.1.2', 'tcp_t_port': '22'} MainThread::ERROR::2021-08-08 19:36:25,468::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2021-08-08 19:36:25,470::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1264, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1040, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 978, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '192.168.1.9', 'network_test': 'tcp', 'tcp_t_address': '192.168.1.2', 'tcp_t_port': '22'}]

...

В неделя, 8 август 2021 г., 20:06:46 ч. Гринуич+3, Gilboa Davara < gilboad@gmail.com> написа:

On Sun, Aug 8, 2021 at 7:53 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
Hello all,

During the night, one of my (smaller) setups, a single node self hosted engine (localhost NFS) crashed due to what-looks-like a massive disk failure (Software RAID6, with 10 drives + spare). After a reboot, I let the RAID resync with a fresh drive) and went on to start oVirt. However, no such luck. Two issues: 1. ovirt-ha-broker fails due to broken hosted engine state (log attached). 2. ovirt-ha-agent fails due to network test (tcp) even though both remote-host and DNS servers are active. (log attached).

Two questions: 1. Can I somehow force the agent to disable the network liveliness test? 2. Can I somehow force the broker to rebuild / fix the hosted engine state?

- Gilboa

FWIW switching agent network test to none (via hosted-engine --set-shared-config network_test none --type=he_local) doesn't seem to work. (Unless I'm missing the point and the agent is failing due to broker issues and not due to a failed network liveliness check).

- Gilboa

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OH4H5K2FZXO6YN...

Strahil Nikolov

9 Aug 9 Aug

7:33 a.m.

Corrupted metadata is the problem you see. I think there was a command to fix it, but I can't recall it right now. Best Regards,Strahil Nikolov On Sun, Aug 8, 2021 at 22:09, Gilboa Davara<gilboad@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIFK3ZYOL2DBEI...

Yedidyah Bar David

7:50 a.m.

On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote:

...

Corrupted metadata is the problem you see.

I think there was a command to fix it, but I can't recall it right now.

I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck. That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware. Best regards,

...

Best Regards, Strahil Nikolov

On Sun, Aug 8, 2021 at 22:09, Gilboa Davara <gilboad@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIFK3ZYOL2DBEI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LMJEU5MA3MAIEL...

-- Didi

Gilboa Davara

8:16 a.m.

Hello, On Mon, Aug 9, 2021 at 11:50 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote:

...
Corrupted metadata is the problem you see.

I think there was a command to fix it, but I can't recall it right now.

I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck.

That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware.

Best regards,

Thanks for the pointer. This is a side setup that's about to replaced by a real setup (3 host Gluster). That said, beyond the corrupted meta data, everything else seems to be working just fine, host boot just fine, RAID sync showed no issues. XFS partitions mounted OK, etc. The only thing that seems damaged is the hosted engine meta data. I'll test it and report back. - Gilboa

...

...
Best Regards, Strahil Nikolov

On Sun, Aug 8, 2021 at 22:09, Gilboa Davara <gilboad@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct:

https://www.ovirt.org/community/about/community-guidelines/

...
List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIFK3ZYOL2DBEI...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct:

https://www.ovirt.org/community/about/community-guidelines/

...
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LMJEU5MA3MAIEL...

-- Didi

Gilboa Davara

9:55 a.m.

On Mon, Aug 9, 2021 at 12:16 PM Gilboa Davara <gilboad@gmail.com> wrote:

...

Hello,

On Mon, Aug 9, 2021 at 11:50 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote:

...
Corrupted metadata is the problem you see.

I think there was a command to fix it, but I can't recall it right now.

I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck.

That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware.

Best regards,

Thanks for the pointer. This is a side setup that's about to replaced by a real setup (3 host Gluster). That said, beyond the corrupted meta data, everything else seems to be working just fine, host boot just fine, RAID sync showed no issues. XFS partitions mounted OK, etc. The only thing that seems damaged is the hosted engine meta data.

I'll test it and report back.

- Gilboa

Stupid question: Won't clean meta data remove the host from the "cluster" and given the fact that its a single host configuration, require a clean redploy? - Gilboa

...

...
...
Best Regards, Strahil Nikolov

On Sun, Aug 8, 2021 at 22:09, Gilboa Davara <gilboad@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIFK3ZYOL2DBEI...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct:

https://www.ovirt.org/community/about/community-guidelines/

...
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LMJEU5MA3MAIEL...

-- Didi

Yedidyah Bar David

10:46 a.m.

On Mon, Aug 9, 2021 at 1:56 PM Gilboa Davara <gilboad@gmail.com> wrote:

...

On Mon, Aug 9, 2021 at 12:16 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
Hello,

On Mon, Aug 9, 2021 at 11:50 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote:

...
Corrupted metadata is the problem you see.

I think there was a command to fix it, but I can't recall it right now.

I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck.

That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware.

Best regards,

Thanks for the pointer. This is a side setup that's about to replaced by a real setup (3 host Gluster). That said, beyond the corrupted meta data, everything else seems to be working just fine, host boot just fine, RAID sync showed no issues. XFS partitions mounted OK, etc. The only thing that seems damaged is the hosted engine meta data.

I'll test it and report back.

- Gilboa

Stupid question: Won't clean meta data remove the host from the "cluster" and given the fact that its a single host configuration, require a clean redploy?

It's not stupid. Generally speaking, the metadata is populated by the HA daemons themselves, not something "external". If a specific host's entry is missing, they should write it.

...

- Gilboa

...
...
...
Best Regards, Strahil Nikolov

On Sun, Aug 8, 2021 at 22:09, Gilboa Davara <gilboad@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIFK3ZYOL2DBEI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LMJEU5MA3MAIEL...

-- Didi

-- Didi

Gilboa Davara

12:13 p.m.

On Mon, Aug 9, 2021 at 2:46 PM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Mon, Aug 9, 2021 at 1:56 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
On Mon, Aug 9, 2021 at 12:16 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
Hello,

On Mon, Aug 9, 2021 at 11:50 AM Yedidyah Bar David <didi@redhat.com>

...
...
...
On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote:

...
Corrupted metadata is the problem you see.

I think there was a command to fix it, but I can't recall it right

now.

...
I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck.

That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware.

Best regards,

Thanks for the pointer. This is a side setup that's about to replaced by a real setup (3 host Gluster). That said, beyond the corrupted meta data, everything else seems to be working just fine, host boot just fine, RAID sync showed no issues. XFS

wrote: partitions mounted OK, etc.

...
...
The only thing that seems damaged is the hosted engine meta data.

I'll test it and report back.

- Gilboa

Stupid question: Won't clean meta data remove the host from the "cluster" and given the fact that its a single host configuration, require a clean redploy?

It's not stupid.

Generally speaking, the metadata is populated by the HA daemons themselves, not something "external". If a specific host's entry is missing, they should write it.

OK. Thanks again for the prompt answer. - Gilboa

Gilboa Davara

10 Aug 10 Aug

3:19 a.m.

On Mon, Aug 9, 2021 at 4:13 PM Gilboa Davara <gilboad@gmail.com> wrote:

...

On Mon, Aug 9, 2021 at 2:46 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Aug 9, 2021 at 1:56 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
On Mon, Aug 9, 2021 at 12:16 PM Gilboa Davara <gilboad@gmail.com>

...
...
Hello,

On Mon, Aug 9, 2021 at 11:50 AM Yedidyah Bar David <didi@redhat.com>

wrote:

...
...
On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote:

...
Corrupted metadata is the problem you see.

I think there was a command to fix it, but I can't recall it right

now.

...
I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck.

That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware.

Best regards,

Thanks for the pointer. This is a side setup that's about to replaced by a real setup (3 host Gluster). That said, beyond the corrupted meta data, everything else seems to be working just fine, host boot just fine, RAID sync showed no issues. XFS

wrote: partitions mounted OK, etc.

...
...
The only thing that seems damaged is the hosted engine meta data.

I'll test it and report back.

- Gilboa

Stupid question: Won't clean meta data remove the host from the "cluster" and given the fact that its a single host configuration, require a clean redploy?

It's not stupid.

Generally speaking, the metadata is populated by the HA daemons themselves, not something "external". If a specific host's entry is missing, they should write it.

OK. Thanks again for the prompt answer.

- Gilboa

Sadly enough, it seems that --clean-metadata requires an active agent. E.g. $ hosted-engine --clean-metadata The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. Can I manually delete the metadata state files? - Gilboa

Yedidyah Bar David

5:14 a.m.

On Tue, Aug 10, 2021 at 7:19 AM Gilboa Davara <gilboad@gmail.com> wrote:

...

On Mon, Aug 9, 2021 at 4:13 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
On Mon, Aug 9, 2021 at 2:46 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Aug 9, 2021 at 1:56 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
On Mon, Aug 9, 2021 at 12:16 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
Hello,

On Mon, Aug 9, 2021 at 11:50 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Mon, Aug 9, 2021 at 11:43 AM Strahil Nikolov via Users <users@ovirt.org> wrote: > > Corrupted metadata is the problem you see. > > I think there was a command to fix it, but I can't recall it right now.

I think you refer to 'hosted-engine --clean_metadata'. Gilboa - I suggest to search the net/archives for docs/mentions/discussions of this option - it's rather drastic. Good luck.

That said, I must say that if your metadata is corrupted, I wonder what else is - so would continue using this setup with great care. Ideally restore from backups, after testing/replacing the hardware.

Best regards,

Thanks for the pointer. This is a side setup that's about to replaced by a real setup (3 host Gluster). That said, beyond the corrupted meta data, everything else seems to be working just fine, host boot just fine, RAID sync showed no issues. XFS partitions mounted OK, etc. The only thing that seems damaged is the hosted engine meta data.

I'll test it and report back.

- Gilboa

Stupid question: Won't clean meta data remove the host from the "cluster" and given the fact that its a single host configuration, require a clean redploy?

It's not stupid.

Generally speaking, the metadata is populated by the HA daemons themselves, not something "external". If a specific host's entry is missing, they should write it.

OK. Thanks again for the prompt answer.

- Gilboa

Sadly enough, it seems that --clean-metadata requires an active agent. E.g. $ hosted-engine --clean-metadata The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

Did you try to search the net/list archives?

...

Can I manually delete the metadata state files?

Yes, see e.g.: https://lists.ovirt.org/pipermail/users/2016-April/072676.html As an alternative to the 'find' command there, you can also find the IDs with: $ grep metadata /etc/ovirt-hosted-engine/hosted-engine.conf Best regards, -- Didi

Gilboa Davara

5:20 p.m.

Hello, Many thanks again for taking the time to try and help me recover this machine (even though it would have been far easier to simply redeploy it...)

...

...
Sadly enough, it seems that --clean-metadata requires an active agent. E.g. $ hosted-engine --clean-metadata The hosted engine configuration has not been retrieved from shared

storage. Please ensure that ovirt-ha-agent

...
is running and the storage server is reachable.

Did you try to search the net/list archives?

Yes. All of them seem to repeat the same clean-metadata command (which fails).

...

...
Can I manually delete the metadata state files?

Yes, see e.g.:

https://lists.ovirt.org/pipermail/users/2016-April/072676.html

As an alternative to the 'find' command there, you can also find the IDs with:

$ grep metadata /etc/ovirt-hosted-engine/hosted-engine.conf

Best regards, -- Didi

Yippie! Success (At least it seems that way...) Following https://lists.ovirt.org/pipermail/users/2016-April/072676.html, I stopped the broker and agent services, archived the existing hosted metadata files, created an empty 1GB metadata file using dd, (dd if=/dev/zero of=/run/vdsm/storage/<uuid>/<uuid> bs=1M count=1024), making double sure permissions (0660 / 0644), owner (vdsm:kvm) and SELinux labels (restorecon, just incase) stay the same. Let everything settle down. Restarted the services.... ... and everything is up again :) I plan to let the engine run overnight with zero VMs (making sure all backups are fully up-to-date). Once done, I'll return to normal (until I replace this setup with a normal multi-node setup). Many thanks again! - Gilboa

Yedidyah Bar David

11 Aug 11 Aug

6:03 a.m.

On Tue, Aug 10, 2021 at 9:20 PM Gilboa Davara <gilboad@gmail.com> wrote:

...

Hello,

Many thanks again for taking the time to try and help me recover this machine (even though it would have been far easier to simply redeploy it...)

...
...
Sadly enough, it seems that --clean-metadata requires an active agent. E.g. $ hosted-engine --clean-metadata The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

Did you try to search the net/list archives?

Yes. All of them seem to repeat the same clean-metadata command (which fails).

I suppose we need better documentation. Sorry. Perhaps open a bug/issue about that.

...

...
...
Can I manually delete the metadata state files?

Yes, see e.g.:

https://lists.ovirt.org/pipermail/users/2016-April/072676.html

As an alternative to the 'find' command there, you can also find the IDs with:

$ grep metadata /etc/ovirt-hosted-engine/hosted-engine.conf

Best regards, -- Didi

Yippie! Success (At least it seems that way...)

Following https://lists.ovirt.org/pipermail/users/2016-April/072676.html, I stopped the broker and agent services, archived the existing hosted metadata files, created an empty 1GB metadata file using dd, (dd if=/dev/zero of=/run/vdsm/storage/<uuid>/<uuid> bs=1M count=1024), making double sure permissions (0660 / 0644), owner (vdsm:kvm) and SELinux labels (restorecon, just incase) stay the same. Let everything settle down. Restarted the services.... ... and everything is up again :)

I plan to let the engine run overnight with zero VMs (making sure all backups are fully up-to-date). Once done, I'll return to normal (until I replace this setup with a normal multi-node setup).

Many thanks again!

Glad to hear that, welcome, thanks for the report! More tests you might want to do before starting your real VMs: - Set and later clear global maintenance from each hosts, see that this propagates to the others (both 'hosted-engine --vm-status' and agent.log) - Migrate the engine VM between the hosts and see this propagates - Shutdown the engine VM without global maint and see that it's started automatically. But I do not think all of this is mandatory, if 'hosted-engine --vm-status' looks ok on all hosts. I'd still be careful with other things that might have been corrupted, though - obviously can't tell you what/where... Best regards, -- Didi

Gilboa Davara

14 Aug 14 Aug

4:57 a.m.

Shabbat Shalom, On Wed, Aug 11, 2021 at 10:03 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Tue, Aug 10, 2021 at 9:20 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
Hello,

Many thanks again for taking the time to try and help me recover this

machine (even though it would have been far easier to simply redeploy it...)

...
...
...
Sadly enough, it seems that --clean-metadata requires an active agent. E.g. $ hosted-engine --clean-metadata The hosted engine configuration has not been retrieved from shared

storage. Please ensure that ovirt-ha-agent

...
...
...
is running and the storage server is reachable.

Did you try to search the net/list archives?

Yes. All of them seem to repeat the same clean-metadata command (which fails).

I suppose we need better documentation. Sorry. Perhaps open a bug/issue about that.

Done. https://bugzilla.redhat.com/show_bug.cgi?id=1993575

...

...
...
...
Can I manually delete the metadata state files?

Yes, see e.g.:

https://lists.ovirt.org/pipermail/users/2016-April/072676.html

As an alternative to the 'find' command there, you can also find the

...
...
$ grep metadata /etc/ovirt-hosted-engine/hosted-engine.conf

Best regards, -- Didi

Yippie! Success (At least it seems that way...)

Following https://lists.ovirt.org/pipermail/users/2016-April/072676.html , I stopped the broker and agent services, archived the existing hosted

IDs with: metadata files, created an empty 1GB metadata file using dd, (dd if=/dev/zero of=/run/vdsm/storage/<uuid>/<uuid> bs=1M count=1024), making double sure permissions (0660 / 0644), owner (vdsm:kvm) and SELinux labels (restorecon, just incase) stay the same.

...
Let everything settle down. Restarted the services.... ... and everything is up again :)

I plan to let the engine run overnight with zero VMs (making sure all backups are fully up-to-date). Once done, I'll return to normal (until I replace this setup with a normal multi-node setup).

Many thanks again!

Glad to hear that, welcome, thanks for the report!

More tests you might want to do before starting your real VMs:

- Set and later clear global maintenance from each hosts, see that this propagates to the others (both 'hosted-engine --vm-status' and agent.log)

- Migrate the engine VM between the hosts and see this propagates

- Shutdown the engine VM without global maint and see that it's started automatically.

But I do not think all of this is mandatory, if 'hosted-engine --vm-status' looks ok on all hosts.

I'd still be careful with other things that might have been corrupted, though - obviously can't tell you what/where...

Host is back to normal. The log looks clean (minus some odd smtp errors in the log). Either way, I'm already in the process of replacing this setup with a real 3 host + gluster setup, so I just need this machine to survive the next couple of weeks :) - Gilboa

Yedidyah Bar David

16 Aug 16 Aug

4:04 a.m.

On Sat, Aug 14, 2021 at 8:58 AM Gilboa Davara <gilboad@gmail.com> wrote:

...

Shabbat Shalom,

On Wed, Aug 11, 2021 at 10:03 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Tue, Aug 10, 2021 at 9:20 PM Gilboa Davara <gilboad@gmail.com> wrote:

...
Hello,

Many thanks again for taking the time to try and help me recover this machine (even though it would have been far easier to simply redeploy it...)

...
...
Sadly enough, it seems that --clean-metadata requires an active agent. E.g. $ hosted-engine --clean-metadata The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

Did you try to search the net/list archives?

Yes. All of them seem to repeat the same clean-metadata command (which fails).

I suppose we need better documentation. Sorry. Perhaps open a bug/issue about that.

Done. https://bugzilla.redhat.com/show_bug.cgi?id=1993575

Thanks.

...

...
...
...
...
Can I manually delete the metadata state files?

Yes, see e.g.:

https://lists.ovirt.org/pipermail/users/2016-April/072676.html

As an alternative to the 'find' command there, you can also find the IDs with:

$ grep metadata /etc/ovirt-hosted-engine/hosted-engine.conf

Best regards, -- Didi

Yippie! Success (At least it seems that way...)

Following https://lists.ovirt.org/pipermail/users/2016-April/072676.html, I stopped the broker and agent services, archived the existing hosted metadata files, created an empty 1GB metadata file using dd, (dd if=/dev/zero of=/run/vdsm/storage/<uuid>/<uuid> bs=1M count=1024), making double sure permissions (0660 / 0644), owner (vdsm:kvm) and SELinux labels (restorecon, just incase) stay the same. Let everything settle down. Restarted the services.... ... and everything is up again :)

I plan to let the engine run overnight with zero VMs (making sure all backups are fully up-to-date). Once done, I'll return to normal (until I replace this setup with a normal multi-node setup).

Many thanks again!

Glad to hear that, welcome, thanks for the report!

More tests you might want to do before starting your real VMs:

- Set and later clear global maintenance from each hosts, see that this propagates to the others (both 'hosted-engine --vm-status' and agent.log)

- Migrate the engine VM between the hosts and see this propagates

- Shutdown the engine VM without global maint and see that it's started automatically.

But I do not think all of this is mandatory, if 'hosted-engine --vm-status' looks ok on all hosts.

I'd still be careful with other things that might have been corrupted, though - obviously can't tell you what/where...

Host is back to normal. The log looks clean (minus some odd smtp errors in the log).

That's normal, if you didn't configure a local (by default) mail server.

...

Either way, I'm already in the process of replacing this setup with a real 3 host + gluster setup, so I just need this machine to survive the next couple of weeks :)\

Good luck and best regards, -- Didi

Austin Coppock

17 Jan 17 Jan

3:03 a.m.

Thanks Gilboa, Your comment here about performing a dd to clear the meta data just saved me having to rebuild a new Engine. Much appreciated. Austin

Gilboa Davara

18 Jan 18 Jan

1:47 p.m.

We should thank Yedidyah Bar David who gave the original solution. - Gilboa On Wed, Jan 17, 2024 at 6:03 AM Austin Coppock <redhat@intheoutback.com> wrote:

...

Thanks Gilboa, Your comment here about performing a dd to clear the meta data just saved me having to rebuild a new Engine. Much appreciated.

Austin _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/JBDPMZLVJCUNPD...

567

Age (days ago)

1460

Last active (days ago)

List overview

Download

17 comments

4 participants

participants (4)

Austin Coppock
Gilboa Davara
Strahil Nikolov
Yedidyah Bar David

Cannot restart ovirt after massive failure.

tags

participants (4)