Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 模拟多节点TiKV故障后,集群count(1) 结果不正确
Cluster version: 5.0.4
Cluster architecture: A → B two clusters in a primary-backup architecture, data synchronized via CDC.
Operation process:
- Simulate three-node TiKV failure
- Use sync-diff-inspector for data verification and data supplementation
- After restoring the A → B data synchronization architecture, the count(1) result is incorrect when executed on the A cluster
mysql> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 50320 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from sbtest1 where id>0;
+----------+
| count(*) |
+----------+
| 50000 |
+----------+
1 row in set (0.04 sec)
mysql> select min(id) from sbtest1;
+---------+
| min(id) |
+---------+
| 1 |
+---------+
1 row in set (0.00 sec)
After reloading the cluster and collecting statistics, the result is still incorrect
Is the data volume consistent after CDC synchronization is completed?
- The cluster uses 3 replicas.
- Simulate according to the steps: 【SOP 系列 18】TiUP 环境恢复 TiKV 副本 - TiDB 的问答社区
- After adding data, the actual number of rows = 50000, executing count(1) results in 50320. The question is why this happens.
Steps:
- Simulate region loss of 3 replicas. First, move the deploy installation directory, then kill the process, and then move the data directory.
- Disable schedule calls.
- Check the store_id corresponding to the 3 crashed machines; query fails.
MySQL [test]> select count(1) from sbtest1;
ERROR 9010 (HY000): TiKV server reports stale command
- Stop the CDC task.
tiup ctl:v5.0.4 cdc changefeed --pd=http://192.168.8.11:2379 pause --changefeed-id dr-replication-task-5
- Stop TiKV in the faulty cluster.
tiup cluster stop dr-primary -R=tikv
- Execute unsafe-recover on the normal TiKV.
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20161/db unsafe-recover remove-fail-stores -s 1,2,7 --all-regions
- Scale in the faulty TiKV, restart TiKV and PD.
tiup cluster scale-in dr-primary -N=192.168.8.11:20161,192.168.8.11:20162,192.168.8.12:20160 --force -y
tiup cluster stop dr-primary -R=pd
tiup cluster start dr-primary -R=pd,tikv
- Stop a normal TiKV node, recreate-region.
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r 4019
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r 4031
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r 4068
- Reload the cluster.
tiup cluster reload dr-primary -y
- Verify the data.
bin/sync_diff_inspector -config=conf/dr-check.toml
After supplementing the data, the actual number of rows is 50,000, but executing count(1)
returns 50,320. Issue: select count(1) from t1;
is not accurate.
First, use admin check table
to check if there is any inconsistency between the data and the index.
A 3-replica cluster, simulating the loss of all 3 replicas, undergoing disaster recovery—can it be recovered at this point? Even if it is recovered, all data, including system metadata, would be lost. In this scenario, disaster recovery is unnecessary.
Is the data between clusters A and B consistent?
Have you verified it before and after stopping the CDC service?
Additionally, after cluster A was restored, what steps were taken to recover the data? How did you determine that the data was fully recovered?
unsafe-recover cannot guarantee normal data; it can only restore the cluster’s state…
Are the results of admin check table
consistent? Is there any inconsistency between the index and the data?
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.