Inconsistency Between PD Storage Information and TiKV Storage Information

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD存储信息与tikv存储信息不一致

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] v3.0.3
[Encountered Issue:]
Querying region information using pd-ctl

Querying region information using tikv-ctl directly results in not found

The following also does not match

| username: xfworld | Original post link

Was there any operation before? And then this situation occurred?

| username: 普罗米修斯 | Original post link

  1. Previously, a server crashed. This server had 4 TiKV nodes. After the server was restarted and the cluster was brought back up, only two TiKV nodes were up. Using unsafe-recover to remove these two nodes did not exclude them from the cluster. After restarting the cluster, these two nodes were also up, but during cluster usage, it reported region unavailable.
  2. Checked the cluster and found down-peer and 3 miss-peer. After increasing the limit, the number decreased to a certain extent and then stopped. Manually clearing the down-peer still reported region unavailable, and after restarting the cluster, the number of down-peer increased again.
  3. Following the community’s suggestion, these two nodes were taken offline, but some regions and leaders could not be taken offline. Using pd-ctl to evict leaders, schedule leaders, and transfer regions were ineffective, with the operator actions in the pd-leader logs ultimately timing out.
  4. Then, these two TiKV nodes were set to tombstone mode, removed tombstone, and recreated some regions. During this process, it was found that the region status viewed by tikv-ctl and pd-ctl was inconsistent.
| username: 普罗米修斯 | Original post link

This is a log of a TiKV in the up state. For the “peer is not leader for region” error reported below, I checked using pd-ctl and found that all regions have leaders.

| username: redgame | Original post link

Are there any other errors occurring during use now?

| username: 普罗米修斯 | Original post link

It’s still this 9005 region unavailable.

| username: xfworld | Original post link

unsafe-recover is used for recovery. If problems occur during recovery, you can only recreate it.

It means that the bad node still has records in PD and has not been completely removed, and the operation continues?

There are a few questions you need to consider carefully:

  • Is there a backup of the data?
  • Is data loss acceptable?
  • If the answers to the above two questions are “no,” then you can only take this path, which is the most difficult:
    • Determine the scope of the lost replicas on the bad node.
    • Identify the range where the rebuilt region and PD’s region are inconsistent.
    • Forcefully remove the inconsistent regions and continue to recreate, ensuring consistency between PD and TiKV.
    • Restore the cluster’s working state and re-enable replica scheduling capability.

After the above operations, data may still be lost (depending on whether the single replica data is complete).
Actually, the best approach before all operations is to expand the nodes first, ensure the safety of the replicas, and then solve the bad node issue…

| username: 普罗米修斯 | Original post link

Is it caused by the inconsistency of the region version reported in TiKV?

| username: 普罗米修斯 | Original post link

The regions that were not taken offline before cannot be removed through unsafe-recover. Now, do we need to recreate everything from scratch to make the cluster normal again?

| username: 普罗米修斯 | Original post link

  • The range of the rebuilt region and the region in PD that are inconsistent
  • After forcibly removing the inconsistent region, continue to recreate it to ensure consistency between PD and TiKV

Are there specific operation commands for these two steps?

| username: xfworld | Original post link

No, I manually checked it… very hard mode… :rofl:

| username: 普罗米修斯 | Original post link

It is acceptable to lose the receipt. Are there any other solutions?

| username: xfworld | Original post link

Rebuild it… :upside_down_face:

It’s best to use the new LTS version.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.