TiKV Node Startup Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点启动报错

| username: 胡杨树旁

Error reported when starting TiKV:


Check the current region status:

| username: Kongdom | Original post link

It looks like it might be a disk issue~

| username: 哈喽沃德 | Original post link

Have you ever performed an upgrade operation?

| username: 胡杨树旁 | Original post link

sst file corrupted

| username: 胡杨树旁 | Original post link

If an SST file is corrupted and has undergone repair operations, should the leader of the problematic node be migrated first before taking it offline?

| username: TiDBer_jYQINSnf | Original post link

This node has an issue, destroy and rebuild it. If only this machine in the cluster is broken, you can safely execute store delete. After it becomes a tombstone, delete the data directory of this node and restart TiKV.

| username: TiDBer_jYQINSnf | Original post link

If a node is damaged, and only this one is damaged, directly execute store delete. After the node is tombstoned, delete the tikv directory and restart tikv to complete the reconstruction.

| username: oceanzhang | Original post link

Will reinitializing cause the cluster to hang?

| username: TiDBer_jYQINSnf | Original post link

If only one is broken, just replace it.

| username: 胡杨树旁 | Original post link

Three are broken now, and two have been expanded.

| username: Kongdom | Original post link

How many nodes are there in total?

| username: 胡杨树旁 | Original post link

Including the two newly added nodes, there are a total of 13 nodes. Currently, 3 nodes are down, and 9 nodes are up.

| username: TiDBer_jYQINSnf | Original post link

This is a bit risky:

Check how many regions have 2 replicas on these two TiKV nodes.

| username: Kongdom | Original post link

If it’s less than half, it’s fine; the leader will automatically migrate during scaling down.

It should be the number of replicas being less than half, not the number of nodes being less than half.

| username: 小龙虾爱大龙虾 | Original post link

Waiting for the expert to appear and learn from them. :+1:

| username: cassblanca | Original post link

How many replicas does the cluster have?

| username: TiDBer_jYQINSnf | Original post link

Kon’s statement is not precise.
It’s not that if 3 out of 9 TiKV nodes fail, there won’t be any issues.
If the 9 TiKV nodes are divided into 3 groups and the 3 failed nodes all belong to the same group, then there won’t be any issues.
If they are not divided into 3 groups but into 9 groups, then it’s possible that any 2 TiKV nodes might have 2 replicas of the same region. If 2 nodes fail, that region will become unusable.

| username: Kongdom | Original post link

:sweat_smile: Indeed, what I said above is problematic.

| username: 胡杨树旁 | Original post link

The cluster has 3 replicas, but there are still some leaders on the downed node, and now the TiKV node cannot be started.

| username: 胡杨树旁 | Original post link

The current situation is that the three broken TiKV nodes have been tagged, with two of the nodes on the same rack, and all three down nodes have leaders. The question is: if a TiKV node goes down, shouldn’t there be a replica replenishment and leader migration action? Why haven’t some of the leaders migrated out?