The count(1) result of the cluster is incorrect after simulating multiple node failures in TiKV

translator_bot · June 23, 2024, 6:00am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 模拟多节点TiKV故障后，集群count(1) 结果不正确

| username: TiDBer_yyy

Cluster version: 5.0.4
Cluster architecture: A → B two clusters in a primary-backup architecture, data synchronized via CDC.
Operation process:

Simulate three-node TiKV failure
Use sync-diff-inspector for data verification and data supplementation
After restoring the A → B data synchronization architecture, the count(1) result is incorrect when executed on the A cluster

mysql> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
|    50320 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from sbtest1 where id>0;
+----------+
| count(*) |
+----------+
|    50000 |
+----------+
1 row in set (0.04 sec)

mysql> select min(id) from sbtest1;
+---------+
| min(id) |
+---------+
|       1 |
+---------+
1 row in set (0.00 sec)

After reloading the cluster and collecting statistics, the result is still incorrect

translator_bot · June 23, 2024, 6:00am

| username: xfworld | Original post link

Each cluster has three nodes, how many replicas?
What steps and methods were used during the simulation?
After data verification and data supplementation, how can there still be inconsistencies?
In the above operation process, was there any step that was overlooked?

translator_bot · June 23, 2024, 6:00am

| username: alfred | Original post link

Is the data volume consistent after CDC synchronization is completed?

translator_bot · June 23, 2024, 6:00am

| username: TiDBer_yyy | Original post link

The cluster uses 3 replicas.
Simulate according to the steps: 【SOP 系列 18】TiUP 环境恢复 TiKV 副本 - TiDB 的问答社区
After adding data, the actual number of rows = 50000, executing count(1) results in 50320. The question is why this happens.

Steps:

Simulate region loss of 3 replicas. First, move the deploy installation directory, then kill the process, and then move the data directory.
Disable schedule calls.
Check the store_id corresponding to the 3 crashed machines; query fails.

  MySQL [test]> select count(1) from sbtest1;
ERROR 9010 (HY000): TiKV server reports stale command

Stop the CDC task.

tiup ctl:v5.0.4 cdc changefeed  --pd=http://192.168.8.11:2379 pause --changefeed-id  dr-replication-task-5

Stop TiKV in the faulty cluster.

tiup cluster stop dr-primary -R=tikv

Execute unsafe-recover on the normal TiKV.

tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_p_20161/db unsafe-recover remove-fail-stores -s 1,2,7 --all-regions

Scale in the faulty TiKV, restart TiKV and PD.

tiup cluster scale-in dr-primary -N=192.168.8.11:20161,192.168.8.11:20162,192.168.8.12:20160 --force -y

tiup cluster stop dr-primary -R=pd
tiup cluster start dr-primary -R=pd,tikv

Stop a normal TiKV node, recreate-region.

tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r  4019
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r  4031
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_p_20162/db recreate-region -p '192.168.8.11:2379' -r  4068

Reload the cluster.

tiup cluster reload dr-primary  -y

Verify the data.

bin/sync_diff_inspector -config=conf/dr-check.toml

translator_bot · June 23, 2024, 6:00am

| username: TiDBer_yyy | Original post link

After supplementing the data, the actual number of rows is 50,000, but executing count(1) returns 50,320. Issue: select count(1) from t1; is not accurate.

translator_bot · June 23, 2024, 6:00am

| username: yilong | Original post link

First, use admin check table to check if there is any inconsistency between the data and the index.

translator_bot · June 23, 2024, 6:00am

| username: Meditator | Original post link

A 3-replica cluster, simulating the loss of all 3 replicas, undergoing disaster recovery—can it be recovered at this point? Even if it is recovered, all data, including system metadata, would be lost. In this scenario, disaster recovery is unnecessary.

translator_bot · June 23, 2024, 6:00am

| username: xfworld | Original post link

Is the data between clusters A and B consistent?

Have you verified it before and after stopping the CDC service?

Additionally, after cluster A was restored, what steps were taken to recover the data? How did you determine that the data was fully recovered?
unsafe-recover cannot guarantee normal data; it can only restore the cluster’s state…

translator_bot · June 23, 2024, 6:00am

| username: TiDBer_yyy | Original post link

Data consistency: Verified twice.
Using the official verification tool: sync_diff_inspector for two verifications. The first verification is used to supplement data between A-B clusters, and the second verification result is consistent.
unsafe-recover and recreate_region are used to recover the cluster’s TiKV and create empty regions with missing 3 replicas (in my personal tests, if the region is not empty, recreate_region will fail). After the cluster’s TiKV is restored to normal, use sync_diff_inspector to perform two verifications on the A-B cluster.

translator_bot · June 23, 2024, 6:00am

| username: yilong | Original post link

Are the results of admin check table consistent? Is there any inconsistency between the index and the data?

translator_bot · June 23, 2024, 6:00am

| username: TiDBer_yyy | Original post link

Subsequent reproduction

translator_bot · June 23, 2024, 6:00am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.