TiKV Node Fails to Start

translator_bot · June 21, 2024, 10:33am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点无法启动

| username: Martins

[TiDB Usage Environment] Production environment
[TiDB Version] v7.1.1
[Reproduction Path] Unable to start after being killed by the system due to high memory usage
[Encountered Problem: Phenomenon and Impact] Startup reports [FATAL] [setup.rs:309] [“invalid configuration: Cannot find raft data set.”]
3 KV nodes, 2 cannot start
[Resource Configuration] 3 PD nodes, 4c8g; two server nodes 4c8g; 3 KV nodes, 8c16g, 500G SSD
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

After data repair, it still cannot start.

translator_bot · June 21, 2024, 10:33am

| username: tidb菜鸟一只 | Original post link

Two out of three TiKV nodes are down, check out the three-step solution… 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

Thank you!

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

Hello, expert! When starting the kv node, it shows [2023/12/13 15:28:42.268 +08:00] [FATAL] [setup.rs:309] [“invalid configuration: Cannot find raft data set.”] Can it still be started?

translator_bot · June 21, 2024, 10:33am

| username: tidb菜鸟一只 | Original post link

How are you handling this now? It looks like there’s a configuration issue, and it can’t find the raft file.

translator_bot · June 21, 2024, 10:33am

| username: 小龙虾爱大龙虾 | Original post link

How did it end up like this?

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

The kv process has exited, and restarting the process gives this error.

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

None of the 3 KV nodes can start now. Can I create a new cluster to recover using the data files?

translator_bot · June 21, 2024, 10:33am

| username: h5n1 | Original post link

Container environment?

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

No, it’s on the cloud.

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

Currently, all three KV nodes are down. Node 1 reports:

[2023/12/13 22:08:33.016 +08:00] [ERROR] [util.rs:682] ["connect failed"] [error="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://172.16.27.211:2379]
[2023/12/13 22:08:33.021 +08:00] [FATAL] [setup.rs:309] ["invalid configuration: Cannot find raft data set."]

Node 2 reports:

[2023/12/13 22:06:43.282 +08:00] [ERROR] [util.rs:682] ["connect failed"] [error="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://172.16.27.211:2379]
[2023/12/13 22:06:49.727 +08:00] [FATAL] [server.rs:921] ["failed to start node: Engine(Other(\"[components/raftstore/src/store/fsm/store.rs:1230]: \\\"[components/raftstore/src/store/entry_storage.rs:657]: [region 16] 17 validate state fail: Other(\\\\\\\"[components/raftstore/src/store/entry_storage.rs:472]: log at recorded commit index [12] 19 doesn't exist, may lose data, region 16, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 19 commit_index: 19 commit_term: 12 truncated_state { index: 17 term: 11 }\\\\\\\")\\\"\"))"]

Node 3 reports:

[2023/12/13 22:07:33.620 +08:00] [ERROR] [util.rs:682] ["connect failed"] [error="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://172.16.27.211:2379]
[2023/12/13 22:07:42.971 +08:00] [FATAL] [server.rs:921] ["failed to start node: StoreTombstone(\"store is tombstone\")"]

translator_bot · June 21, 2024, 10:33am

| username: 有猫万事足 | Original post link

It seems like it’s all because PD can’t connect. That’s why it won’t start.

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

Yesterday, I set up a new cluster according to the original configuration. It started normally, and then I copied the data files of the old cluster’s PD and KV to the new cluster for startup. I found that this method didn’t work, and the old cluster was also contaminated by the new cluster. When starting, I found that the new cluster had connections to the old cluster. I suspect that the PD data files recorded the cluster information. At this point, I shut down the new cluster and found that one PD in the old cluster was down, and the startup reported an error:

[2023/12/13 20:44:55.410 +08:00] [ERROR] [kv.go:295] [“fail to load safepoint from pd”] [error=“context deadline exceeded”]
[2023/12/13 20:44:55.528 +08:00] [ERROR] [pd_service_discovery.go:221] [“[pd] failed to update member”] [urls=“[http://172.16.26.212:2379,http://172.16.26.213:2379,http://172.16.27.211:2379]”] [error=“[PD:client:ErrClientGetMember]get member failed”]

Can it still be rescued?

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

The original cluster is 172.16.27.211-216, and the new cluster is 172.16.26.211-216.

translator_bot · June 21, 2024, 10:33am

| username: h5n1 | Original post link

Your cloud environment looks quite unreliable. The errors on nodes 1 and 2 are due to data loss or corruption, causing them to fail to start. Is node 3 tombstone due to scaling down? Starting a new cluster with old data files seems untested, especially since your original data files are already compromised. Even if you set up a new cluster, the new cluster’s PD still needs to use pd-recover to restore the cluster-id and alloc-id.

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

Can you explain this in detail? Or provide some documentation, thanks!
This environment has been running for a long time without any issues. Yesterday morning, one KV node suddenly went down, and the system logs showed it was OOM. The node couldn’t be brought back up. By noon, another KV node went down, also due to OOM, and it couldn’t be brought back up either.

translator_bot · June 21, 2024, 10:33am

| username: Jolyne | Original post link

translator_bot · June 21, 2024, 10:33am

| username: Martins | Original post link

Thanks a lot!

translator_bot · June 21, 2024, 10:33am

| username: dba远航 | Original post link

It’s unbelievable that the cloud environment can cause IO corruption, it’s so unreliable.