TiKV Node Startup Error

translator_bot · June 21, 2024, 4:43am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点启动报错

| username: 胡杨树旁

Error reported when starting TiKV:

Check the current region status:

translator_bot · June 21, 2024, 4:43am

| username: Kongdom | Original post link

It looks like it might be a disk issue~

translator_bot · June 21, 2024, 4:43am

| username: 哈喽沃德 | Original post link

Have you ever performed an upgrade operation?

translator_bot · June 21, 2024, 4:43am

| username: 胡杨树旁 | Original post link

sst file corrupted

translator_bot · June 21, 2024, 4:43am

| username: 胡杨树旁 | Original post link

If an SST file is corrupted and has undergone repair operations, should the leader of the problematic node be migrated first before taking it offline?

translator_bot · June 21, 2024, 4:43am

| username: TiDBer_jYQINSnf | Original post link

This node has an issue, destroy and rebuild it. If only this machine in the cluster is broken, you can safely execute store delete. After it becomes a tombstone, delete the data directory of this node and restart TiKV.

translator_bot · June 21, 2024, 4:43am

| username: TiDBer_jYQINSnf | Original post link

If a node is damaged, and only this one is damaged, directly execute store delete. After the node is tombstoned, delete the tikv directory and restart tikv to complete the reconstruction.

translator_bot · June 21, 2024, 4:43am

| username: oceanzhang | Original post link

Will reinitializing cause the cluster to hang?

translator_bot · June 21, 2024, 4:43am

| username: TiDBer_jYQINSnf | Original post link

If only one is broken, just replace it.

translator_bot · June 21, 2024, 4:43am

| username: 胡杨树旁 | Original post link

Three are broken now, and two have been expanded.

translator_bot · June 21, 2024, 4:43am

| username: Kongdom | Original post link

How many nodes are there in total?

translator_bot · June 21, 2024, 4:43am

| username: 胡杨树旁 | Original post link

Including the two newly added nodes, there are a total of 13 nodes. Currently, 3 nodes are down, and 9 nodes are up.

translator_bot · June 21, 2024, 4:43am

| username: TiDBer_jYQINSnf | Original post link

This is a bit risky:

Check how many regions have 2 replicas on these two TiKV nodes.

translator_bot · June 21, 2024, 4:43am

| username: Kongdom | Original post link

If it’s less than half, it’s fine; the leader will automatically migrate during scaling down.

It should be the number of replicas being less than half, not the number of nodes being less than half.

translator_bot · June 21, 2024, 4:43am

| username: 小龙虾爱大龙虾 | Original post link

Waiting for the expert to appear and learn from them.

translator_bot · June 21, 2024, 4:43am

| username: cassblanca | Original post link

How many replicas does the cluster have?

translator_bot · June 21, 2024, 4:44am

| username: TiDBer_jYQINSnf | Original post link

Kon’s statement is not precise.
It’s not that if 3 out of 9 TiKV nodes fail, there won’t be any issues.
If the 9 TiKV nodes are divided into 3 groups and the 3 failed nodes all belong to the same group, then there won’t be any issues.
If they are not divided into 3 groups but into 9 groups, then it’s possible that any 2 TiKV nodes might have 2 replicas of the same region. If 2 nodes fail, that region will become unusable.

translator_bot · June 21, 2024, 4:44am

| username: Kongdom | Original post link

Indeed, what I said above is problematic.

translator_bot · June 21, 2024, 4:44am

| username: 胡杨树旁 | Original post link

The cluster has 3 replicas, but there are still some leaders on the downed node, and now the TiKV node cannot be started.

translator_bot · June 21, 2024, 4:44am

| username: 胡杨树旁 | Original post link

The current situation is that the three broken TiKV nodes have been tagged, with two of the nodes on the same rack, and all three down nodes have leaders. The question is: if a TiKV node goes down, shouldn’t there be a replica replenishment and leader migration action? Why haven’t some of the leaders migrated out?