CDC Data Synchronization Test Between Two Clusters: Data Inconsistency After Stress Testing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: CDC进行两个集群间数据同步测试,压测任务完成后数据不一致

| username: TiDBA

[TiDB Usage Environment] POC
[TiDB Version] 7.5.1
[Reproduction Path]
The source is a TiDB cluster with 1 tidb, 1 pd, and 3 tikv nodes created in the local DC, and the destination is a single-node TiDB cluster with 1 tidb, 1 pd, and 1 tikv created on a cloud host.

Steps are as follows:

Disable GC on the source
MySQL [test]> SET GLOBAL tidb_gc_enable=FALSE;

Perform a full backup on the source
MySQL [(none)]> BACKUP DATABASE * TO ‘local:///newsftp/bakcup20240305_03’;

Transfer the file to the destination

Restore on the destination
MySQL [(none)]> RESTORE DATABASE * FROM ‘local:///opt/bakcup20240305_03’;

Compare the data, the comparison results are as follows
[tidb@localhost ~]$ sync_diff_inspector -C ./config.yaml
{“level”:“warn”,“ts”:“2024-03-05T20:26:06.99427+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.10/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0010ee1c0/172.16.16.11:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
A total of 11 tables need to be compared

Comparing the table structure of test`.`t … equivalent
Comparing the table structure of test`.`warehouse … equivalent
Comparing the table data of test`.`t … equivalent
Comparing the table data of test`.`warehouse … equivalent
Comparing the table structure of test`.`item … equivalent
Comparing the table structure of test`.`new_order … equivalent
Comparing the table structure of test`.`district … equivalent
Comparing the table data of test`.`district … equivalent
Comparing the table data of test`.`item … equivalent
Comparing the table data of test`.`new_order … equivalent
Comparing the table structure of test`.`orders … equivalent
Comparing the table structure of test_wdd`.`test_wdd … equivalent
Comparing the table structure of test`.`stock … equivalent
Comparing the table structure of test`.`customer … equivalent
Comparing the table structure of test`.`history … equivalent
Comparing the table structure of test`.`order_line … equivalent
Comparing the table data of test`.`orders … equivalent
Comparing the table data of test_wdd`.`test_wdd … equivalent
Comparing the table data of test`.`stock … equivalent
Comparing the table data of test`.`customer … equivalent
Comparing the table data of test`.`history … equivalent
Comparing the table data of test`.`order_line … equivalent


Progress [============================================================>] 100% 0/0
A total of 11 tables have been compared and all are equal.

Configure CDC task
tiup cdc cli changefeed create --server=http://...:8300 --sink-uri=“mysql://root:@...:4000” --changefeed-id=“upstream-to-downstream” --start-ts=“448171971630858243”

Check task status
[tidb@localhost ~]$ tiup cdc cli changefeed list
Checking updates for component cdc… Timedout (after 2s)
Starting component cdc: /home/tidb/.tiup/components/cdc/v7.5.1/cdc cli changefeed list
[
{
“id”: “upstream-to-downstream”,
“namespace”: “default”,
“summary”: {
“state”: “normal”,
“tso”: 448172355724509186,
“checkpoint”: “2024-03-05 20:30:59.911”,
“error”: null
}
}
]

Perform database stress testing
[tidb@localhost ~]$ tiup bench tpcc --warehouses 20 --time 10m run -p “*******”

[Encountered Problem: Problem Phenomenon and Impact]

After the stress test, compare the data on the source and destination, and the data volume is inconsistent:
Source:
MySQL [test]> select 'customer ',count() from customer union all select 'district ',count() from district union all select 'history ',count() from history union all select 'item ',count() from item union all select 'new_order ',count() from new_order union all select ‘order_line’,count() from order_line union all select 'orders ',count() from orders union all select 'stock ',count() from stock union all select 'warehouse ',count() from warehouse order by 1;
±-----------±---------+
| customer | count(
) |
±-----------±---------+
| customer | 600000 |
| district | 200 |
| history | 601935 |
| item | 100000 |
| new_order | 179874 |
| orders | 601934 |
| order_line | 6017699 |
| stock | 2000000 |
| warehouse | 20 |
±-----------±---------+
9 rows in set, 1 warning (0.28 sec)

Destination:
MySQL [test]> select 'customer ',count() from customer union all select 'district ',count() from district union all select 'history ',count() from history union all select 'item ',count() from item union all select 'new_order ',count() from new_order union all select ‘order_line’,count() from order_line union all select 'orders ',count() from orders union all select 'stock ',count() from stock union all select 'warehouse ',count() from warehouse order by 1;
±-----------±---------+
| customer | count(
) |
±-----------±---------+
| customer | 600000 |
| district | 200 |
| history | 600891 |
| item | 100000 |
| new_order | 179874 |
| orders | 601934 |
| order_line | 6017699 |
| stock | 2000000 |
| warehouse | 20 |
±-----------±---------+
9 rows in set, 1 warning (0.02 sec)

The data volume of the history table is inconsistent

The structure of the history table is as follows:
MySQL [test]> show create table history\G
*************************** 1. row ***************************
Table: history
Create Table: CREATE TABLE history (
h_c_id int(11) NOT NULL,
h_c_d_id int(11) NOT NULL,
h_c_w_id int(11) NOT NULL,
h_d_id int(11) NOT NULL,
h_w_id int(11) NOT NULL,
h_date datetime DEFAULT NULL,
h_amount decimal(6,2) DEFAULT NULL,
h_data varchar(24) DEFAULT NULL,
KEY idx_h_w_id (h_w_id),
KEY idx_h_c_w_id (h_c_w_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
PARTITION BY HASH (h_w_id) PARTITIONS 20
1 row in set (0.00 sec)

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]

Monitoring screenshot

Command to start the task screenshot

Logs from the start of the task to the time of posting
cdc_download.log (1.2 MB)

| username: yiduoyunQ | Original post link

Use diff to compare and output the difference data, then check the ticdc logs for any clues.

| username: tidb菜鸟一只 | Original post link

Uh, as I understand it, shouldn’t the data generated by the stress test be variable?

| username: WinterLiu | Original post link

Check if there are any errors in the synchronization.

| username: Soysauce520 | Original post link

Is there a parameter for creating a table with a primary key?

| username: dba远航 | Original post link

I feel it might be because the historical table does not have a primary key.

| username: TiDBA | Original post link

The cluster deployed with default configuration has not been set up separately. However, when I created the synchronization task, there was no prompt indicating that these tables could not be synchronized. These tables contain data from the tiup bench stress test.

image

| username: TiDBA | Original post link

The cluster deployed with default configuration has not been set up separately. However, when I created the synchronization task, there was no prompt indicating that these tables could not be synchronized. These tables contain data from the tiup bench stress test.

| username: TiDBA | Original post link

I checked the logs for this stage, and there are no obvious errors.

| username: TiDBA | Original post link

Using 20 warehouses of data prepared with tiup bench, then performing a full backup and recovery after a 10-minute stress test, and then configuring CDC.

| username: TiDBA | Original post link

Let me add that during the synchronization phase, there were no obvious errors in the CDC logs.

| username: redgame | Original post link

Reduce the load a bit and see where the difference is, it will be easier to check.

| username: 小龙虾爱大龙虾 | Original post link

The requirement for CDC to synchronize tables is to have a primary key or an effective index. The data inconsistency you mentioned is expected. For reference, see: TiCDC 简介 | PingCAP 文档中心

| username: 像风一样的男子 | Original post link

The person upstairs is right, the table synchronized by TiCDC needs to have at least one valid index.

| username: 不想干活 | Original post link

Were there any error messages during the process?

| username: yiduoyunQ | Original post link

Wasn’t the original table and data prepared using tiup bench tpc?

| username: kelvin | Original post link

CDC’s requirement for synchronized tables is indeed to have a primary key or a valid index. I’ve learned this the hard way before.

| username: TiDBA | Original post link

I have reviewed the corresponding stage and did not see any error logs. The CDC logs have been uploaded.

| username: TiDBA | Original post link

Yes, using tiup bench to prepare data for 20 warehouses, and after stress testing for ten minutes, then perform backup and restore, and configure the synchronization task. After the configuration is completed, use tiup bench to stress test for another ten minutes.

| username: TiDBA | Original post link

When setting up the task, there was a table without any indexes, and it prompted that it couldn’t be synchronized. However, this part of the table didn’t give any prompt, so I thought it could be executed normally.