The count(1) result of the cluster is incorrect after simulating multiple node failures in TiKV

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 模拟多节点TiKV故障后,集群count(1) 结果不正确

| username: TiDBer_yyy

Cluster version: 5.0.4
Cluster architecture: A → B two clusters in a primary-backup architecture, data synchronized via CDC.
Operation process:

  1. Simulate three-node TiKV failure
  2. Use sync-diff-inspector for data verification and data supplementation
  3. After restoring the A → B data synchronization architecture, the count(1) result is incorrect when executed on the A cluster
mysql> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
|    50320 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from sbtest1 where id>0;
+----------+
| count(*) |
+----------+
|    50000 |
+----------+
1 row in set (0.04 sec)

mysql> select min(id) from sbtest1;
+---------+
| min(id) |
+---------+
|       1 |
+---------+
1 row in set (0.00 sec)

After reloading the cluster and collecting statistics, the result is still incorrect

| username: xfworld | Original post link

  1. Each cluster has three nodes, how many replicas?
  2. What steps and methods were used during the simulation?
  3. After data verification and data supplementation, how can there still be inconsistencies?
  4. In the above operation process, was there any step that was overlooked?
| username: alfred | Original post link

Is the data volume consistent after CDC synchronization is completed?

| username: TiDBer_yyy | Original post link

  1. The cluster uses 3 replicas.
  2. Simulate according to the steps: 【SOP 系列 18】TiUP 环境恢复 TiKV 副本 - TiDB 的问答社区
  3. After adding data, the actual number of rows = 50000, executing count(1) results in 50320. The question is why this happens.

Steps:

  1. Simulate region loss of 3 replicas. First, move the deploy installation directory, then kill the process, and then move the data directory.
  2. Disable schedule calls.
  3. Check the store_id corresponding to the 3 crashed machines; query fails.
  MySQL [test]> select count(1) from sbtest1;
ERROR 9010 (HY000): TiKV server reports stale command
  1. Stop the CDC task.
tiup ctl:v5.0.4 cdc changefeed  --pd=http://192.168.8.11:2379 pause --changefeed-id  dr-replication-task-5
  1. Stop TiKV in the faulty cluster.
tiup cluster stop dr-primary -R=tikv
  1. Execute unsafe-recover on the normal TiKV.
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_p_20161/db unsafe-recover remove-fail-stores -s 1,2,7 --all-regions
  1. Scale in the faulty TiKV, restart TiKV and PD.
tiup cluster scale-in dr-primary -N=192.168.8.11:20161,192.168.8.11:20162,192.168.8.12:20160 --force -y

tiup cluster stop dr-primary -R=pd
tiup cluster start dr-primary -R=pd,tikv
  1. Stop a normal TiKV node, recreate-region.
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r  4019
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r  4031
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r  4068
  1. Reload the cluster.
tiup cluster reload dr-primary  -y
  1. Verify the data.
bin/sync_diff_inspector -config=conf/dr-check.toml
| username: TiDBer_yyy | Original post link

After supplementing the data, the actual number of rows is 50,000, but executing count(1) returns 50,320. Issue: select count(1) from t1; is not accurate.

| username: yilong | Original post link

First, use admin check table to check if there is any inconsistency between the data and the index.

| username: Meditator | Original post link

A 3-replica cluster, simulating the loss of all 3 replicas, undergoing disaster recovery—can it be recovered at this point? Even if it is recovered, all data, including system metadata, would be lost. In this scenario, disaster recovery is unnecessary.

| username: xfworld | Original post link

Is the data between clusters A and B consistent?

Have you verified it before and after stopping the CDC service?

Additionally, after cluster A was restored, what steps were taken to recover the data? How did you determine that the data was fully recovered?
unsafe-recover cannot guarantee normal data; it can only restore the cluster’s state…

| username: TiDBer_yyy | Original post link

  1. Data consistency: Verified twice.
    Using the official verification tool: sync_diff_inspector for two verifications. The first verification is used to supplement data between A-B clusters, and the second verification result is consistent.

  2. unsafe-recover and recreate_region are used to recover the cluster’s TiKV and create empty regions with missing 3 replicas (in my personal tests, if the region is not empty, recreate_region will fail). After the cluster’s TiKV is restored to normal, use sync_diff_inspector to perform two verifications on the A-B cluster.

| username: yilong | Original post link

Are the results of admin check table consistent? Is there any inconsistency between the index and the data?

| username: TiDBer_yyy | Original post link

Subsequent reproduction

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.