Several TiKV Nodes Disconnected After Pruning the Cluster

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: prune集群后出现几个tikv节点disconnect

| username: wakaka

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.0.6
[Reproduction Path] After taking TiFlash offline and waiting for the replica count of the node to become 0 and enter the tombstone state, the tiup cluster prune operation was executed.
[Encountered Problem: Phenomenon and Impact] Several TiKV nodes experienced disconnection, reporting an error with a non-existent store ID.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

SELECT * FROM INFORMATION_SCHEMA.`TIKV_STORE_STATUS` a WHERE a.`STORE_ID`='';
Let's see what node it is.
| username: wakaka | Original post link

This doesn’t work.


The first pdctl command couldn’t retrieve anything and was killed, the second one also returned 0.

| username: 考试没答案 | Original post link

Do you dare to use the restart method? Restart the PD leader and switch to another node to see.

| username: wakaka | Original post link

Is there any theoretical support for this? I also don’t see this store when accessing different PD APIs separately.

| username: wakaka | Original post link

The image you provided is not accessible. Please provide the text you need translated.

| username: jansu-dev | Original post link

This issue is quite complex and should be related to the problem supported last night:
TiKV disconnected and monitoring elevation issue: Currently, it is determined that TiKV disconnected because the TiKV raftstore was overwhelmed. The raftstore was overwhelmed because silent regions were activated, continuously sending requests to PD, causing the raftstore CPU to spike and lose connection (simultaneously, PD monitoring elevated). However, the reason why the silent regions were activated has not been confirmed yet. It is currently suspected that it might be a bug triggered by taking TiFlash offline. I will post a clear conclusion once we have one.

| username: wakaka | Original post link

Thanks a lot, boss!

| username: jansu-dev | Original post link

There are two issues with the cluster:

  1. Abnormal offline nodes in the cluster: The abnormal behavior is due to an unpatched bug in version 5.0.6. To completely avoid this bug, you need to upgrade the database version. Removed tombstone stores show again if transfer pd leader during scale in · Issue #4941 · tikv/pd · GitHub and PD client keeps reconnecting on error StoreTombstone · Issue #12506 · tikv/tikv · GitHub
  2. High PD monitoring metrics: This is caused by the previous two bugs, which continuously recognize the already tombstoned TiKV. The issue will be resolved after properly offlining the Store nodes.
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.