Seeking advice on the duplicated store address issue caused by TiKV disk failure, scaling down, and then scaling up

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 求教关于tikv磁盘损坏,缩容后扩容导致的 duplicated store address问题

| username: 末0_0想

【TiDB Usage Environment】Test

【TiDB Version】6.5
【Reproduction Path】Due to disk damage, the originally three-node TiKV became two nodes. Then, I forcibly removed the TiKV using tiup cluster scale-in zdww-tidb --node 10.18.10.100:20160 --force. After that, I performed scale-in and scale-out operations. To reduce anomalies, I used the original port for the scale-out node configuration. Later, I performed the tiup cluster prune zdww-tidb cleanup operation.

After scaling out, the new TiKV node has been in the Offline state. The following error occurs when starting:

Error: failed to start tikv: failed to start: 10.18.10.100 tikv-20160.service, please check the instance's log(/tidb-deploy/tikv-20160/log) for more detail.: timed out waiting for port 20160 to be started after 2m0s.

I checked the TiKV node log and found:

[2023/03/30 19:15:58.377 +08:00] [FATAL] [server.rs:1099] ["failed to start node: Other(\"[components/pd_client/src/util.rs:878]: duplicated store address: id:31001 address:\\\"10.18.10.100:20160\\\" version:\\\"6.5.0\\\" peer_address:\\\"10.18.10.100\\\" status_address:\\\"10.18.10.100:20180\\\" git_hash:\\\"47b81680f75adc4b7200480cea5dbe46ae07c4b5\\\" start_timestamp:1680174958 deploy_path:\\\"/tidb-deploy/tikv-20160/bin\\\" , already registered by id:7 address:\\\"10.18.10.100:20160\\\" state:Offline version:\\\"6.5.0\\\" peer_address:\\\"10.18.10.100:20160\\\" status_address:\\\"10.18.10.100:20180\\\" git_hash:\\\"47b81680f75adc4b7200480cea5dbe46ae07c4b5\\\" start_timestamp:1678552849 deploy_path:\\\"/tidb-deploy/tikv-20160/bin\\\" last_heartbeat:1679019937896881375 node_state:Removing \")"]

Many posts suggest clearing the PD cache.

tiup ctl:v6.5.0 pd -u http://10.18.100.162:2379 store 
tiup ctl:v6.5.0 pd -u 10.18.100.164:2379 -i
store delete 7

I tried many times but failed. Executing store delete 7 did not work, or it was already updated to the latest. Therefore, I seek advice on how to handle this.

【Encountered Issues: Symptoms and Impact】

TiKV has been in the Offline state.
TiKV reports a duplicated store address error when starting.

Additionally, I have three questions:
1. If there is a cache configuration in PD, how do I clear it?
2. Isn’t TiKV a three-replica system? If a node is re-added, will the data automatically synchronize from other nodes?
3. I would like to ask if TiKV forms a cluster called by PD, or does PD call the three TiKVs separately? The information online is quite mixed; can you provide a more understandable post?

Master node status

TiKV error

tiup ctl:v6.5.0 pd -u http://10.10.100.162:2379 store

store

| username: h5n1 | Original post link

You can see 3001 with pdctl store. The original one hasn’t been cleaned up yet, and a new one has been added with the original address, causing a conflict. Try to handle the original one first.

The offline status is the state after store delete, waiting for region migration to complete. If there are only three machines, first expand by adding a TiKV on another port. After expansion, observe whether the regions on the original offline store have migrated. Finally, it will become the tombstone state, and then execute prune. If there are issues during the process, refer to 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区.

Regarding several issues:

  1. The region information of PD is reported by the store through heartbeat. If TiKV no longer has regions, cleaning up is useless.
  2. Nodes will synchronize during normal operation, but your current scenario has conflicts.
  3. PD allocates txn/region/store IDs and balances regions, scheduling region distribution among TiKV nodes.
| username: 末0_0想 | Original post link

Hello! I followed your instructions to create a new port on the faulty node and started TiKV. However, the node that was originally in the offline state still hasn’t changed its status. I observed that the REGION_COUNT of the new node is increasing, but the old node remains unchanged.

I read the post you sent me, but since I don’t have a deep understanding of TiDB, I didn’t grasp the purpose of manually scheduling the migration of regions. Could you please explain it in detail when you have time?

As I understand it, TiKV requires three nodes. If one node is damaged, the PD distribution will be in a waiting state. Only after adding a new node to reach three nodes will the PD resume normal distribution. But how do you release the damaged node at this time, and how do you clean up the records of the damaged TiKV node in PD? Please advise.


| username: h5n1 | Original post link

Your understanding is correct. You can wait for a while to see, or you can manually add scheduling using pd-ctl operate add remove-peer to handle the original store.

| username: 末0_0想 | Original post link

Thank you for your support. Indeed, it took more than a day for the TiKV node to automatically change from offline to up status. It seems that after the three TiKV nodes were restored, the data was automatically synchronized and the previously damaged node was also repaired and became the fourth node.