Construction of Dual-Cloud Disaster Recovery: Replace Data Execution Fails After Verification with sync-diff-inspector Tool

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 双云灾备建设,使用sync-diff-inspector 工具校验后,replace补数据执行失败

| username: TiDBer_yyy

Version: 5.0.4
Architecture: Dual-cloud disaster recovery

Problem: Using the sync-diff-inspector tool to verify data, the replace statement execution fails:
[Overview] Using sync-diff-inspector to verify if the data between two TiDB clusters is consistent.
Result: Data inconsistency, replace statement execution fails after writing.

Unique key:

  PRIMARY KEY (`I_ID`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `FSI` (`I_FILE_ID`),
ERROR 1105 (HY000) at line 40950: can not be duplicated row, due to old row not found. handle 2396698 not found
ERROR 1105 (HY000) at line 40951: can not be duplicated row, due to old row not found. handle 2396699 not found
ERROR 1105 (HY000) at line 40952: can not be duplicated row, due to old row not found. handle 2396700 not found
ERROR 1105 (HY000) at line 40953: can not be duplicated row, due to old row not found. handle 2396701 not found
ERROR 1105 (HY000) at line 40956: can not be duplicated row, due to old row not found. handle 2396704 not found
ERROR 1105 (HY000) at line 40957: can not be duplicated row, due to old row not found. handle 2396705 not found
ERROR 1105 (HY000) at line 40964: can not be duplicated row, due to old row not found. handle 2396712 not found
  • Successful replace:

  • Failed replace
    Successful ID=2, Failed ID=3

  • After recollecting statistics, failure:
    Executing admin check table failed

mysql> admin check table f_file_info;
ERROR 8003 (HY000): f_file_info err:[admin:8223]index:<nil> != record:&admin.RecordData{Handle:54323, Values:[]types.Datum{types.Datum{k:0x5, decimal:0x0, length:0x0, i:0, collation:"utf8mb4_bin", b:[]uint8{0x65, 0x37, 0x34, 0x39, 0x35, 0x62, 0x33, 0x30, 0x34, 0x32, 0x66, 0x64, 0x39, 0x65, 0x38, 0x35, 0x65, 0x32, 0x61, 0x65, 0x37, 0x63, 0x36, 0x66, 0x66, 0x65, 0x31, 0x63, 0x31, 0x32, 0x34, 0x62, 0x66, 0x34, 0x65, 0x61, 0x36, 0x34, 0x35, 0x39}, x:interface {}(nil)}}}

Detailed steps are as follows:

  1. Check which TiKV node the table region is on
mysql> SPLIT TABLE sbtest1 BETWEEN (0) AND (100000) REGIONS 8;
+--------------------+----------------------+
| TOTAL_SPLIT_REGION | SCATTER_FINISH_RATIO |
+--------------------+----------------------+
|                  8 |                    1 |
+--------------------+----------------------+
1 row in set (3.54 sec)
mysql> show table sbtest1 regions;
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
| REGION_ID | START_KEY    | END_KEY      | LEADER_ID | LEADER_STORE_ID | PEERS            | SCATTERING | WRITTEN_BYTES | READ_BYTES | APPROXIMATE_SIZE(MB) | APPROXIMATE_KEYS |
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
|      1596 | t_85_r       | t_85_r_12500 |      1624 |               3 | 1626, 1624, 1625 |          0 |          1740 |          0 |                    1 |                0 |
|      1600 | t_85_r_12500 | t_85_r_25000 |      1602 |              14 | 1601, 1602, 1603 |          0 |          1503 |          0 |                    1 |                0 |
|      1604 | t_85_r_25000 | t_85_r_37500 |      1627 |              13 | 1628, 1629, 1627 |          0 |          4178 |          0 |                    1 |                0 |
|      1608 | t_85_r_37500 | t_85_r_50000 |      1611 |               2 | 1609, 1610, 1611 |          0 |          1659 |          0 |                    1 |                0 |
|      1612 | t_85_r_50000 | t_85_r_62500 |      1632 |               1 | 1632, 1630, 1631 |          0 |          1725 |          0 |                    1 |                0 |
|      1616 | t_85_r_62500 | t_85_r_75000 |      1618 |              14 | 1617, 1618, 1619 |          0 |            27 |          0 |                    1 |                0 |
|      1620 | t_85_r_75000 | t_85_r_87500 |      1635 |               1 | 1635, 1633, 1634 |          0 |          1725 |          0 |                    1 |                0 |
|         5 | t_85_r_87500 |              |       792 |               4 | 792, 1296, 1326  |          0 |          4258 |    2470034 |                    1 |                0 |
|      1592 | t_85_        | t_85_r       |      1593 |               4 | 1593, 1594, 1595 |          0 |          1491 |          0 |                    3 |            11363 |
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
9 rows in set (0.01 sec)

- Execute on the PD node
tiup ctl:v5.0.4 pd --pd=host1:2379 -i

» store --jq=".stores[].store | { id, address, state_name}"
{"id":8,"address":"host1:20161","state_name":"Up"}
{"id":9,"address":"host2:20161","state_name":"Up"}
{"id":7,"address":"host1:20160","state_name":"Up"}
{"id":2,"address":"host3:20160","state_name":"Up"}
{"id":10,"address":"host2:20160","state_name":"Up"}
{"id":1,"address":"host3:20161","state_name":"Up"}

» region 1512
      "store_id": 10,
      "store_id": 1,
      "store_id": 8,

» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(8,1,10) then . else empty end) | length>=$total-length) }"
{"id":1135,"peer_stores":[9,8,1]}
{"id":1270,"peer_stores":[8,10,1]}
{"id":805,"peer_stores":[10,7,1]}
{"id":1127,"peer_stores":[8,10,2]}
{"id":890,"peer_stores":[10,7,1]}
{"id":175,"peer_stores":[10,7,1]}
{"id":707,"peer_stores":[8,10,1]}
{"id":1536,"peer_stores":[1,8,10]}
{"id":712,"peer_stores":[8,10,1]}
{"id":1023,"peer_stores":[7,1,10]}
{"id":1223,"peer_stores":[9,8,1]}
{"id":1512,"peer_stores":[1,8,10]}
{"id":941,"peer_stores":[10,1,8]}
{"id":874,"peer_stores":[8,10,1]}
{"id":1520,"peer_stores":[1,8,10]}
{"id":1528,"peer_stores":[1,8,10]}
{"id":1008,"peer_stores":[10,8,2]}
{"id":917,"peer_stores":[2,8,10]}
{"id":669,"peer_stores":[2,10,8]}
{"id":133,"peer_stores":[1,10,8]}
{"id":184,"peer_stores":[8,1,10]}
{"id":1131,"peer_stores":[1,10,8]}
{"id":129,"peer_stores":[8,1,9]}
{"id":988,"peer_stores":[2,10,8]}
{"id":1087,"peer_stores":[10,7,1]}
{"id":977,"peer_stores":[2,10,8]}
{"id":971,"peer_stores":[1,9,8]}
{"id":1278,"peer_stores":[7,1,10]}
{"id":796,"peer_stores":[2,8,10]}
{"id":1141,"peer_stores":[9,8,1]}
{"id":1097,"peer_stores":[10,7,1]}
{"id":900,"peer_stores":[10,1,8]}
{"id":1019,"peer_stores":[10,1,8]}
{"id":1169,"peer_stores":[8,10,2]}
{"id":1145,"peer_stores":[10,1,7]}
{"id":859,"peer_stores":[1,8,10]}
{"id":896,"peer_stores":[8,10,2]}
{"id":866,"peer_stores":[10,7,1]}
{"id":153,"peer_stores":[8,1,9]}
  1. Simulate failure
# Remove the TiKV deployment files
mv /data/tidb-deploy/tikv-20160 /data/tidb-deploy/tikv-20160_bak
# Kill the task
ps -ef|grep 20160
kill 12374

# Move the data directory; if disk space is insufficient, directly rm -rf the file directory.
mv /data/tidb-data/tikv_data_20160 /data/tidb-data/tikv_data_20160_bak

tiup ctl:v5.0.4 pd --pd=host1:2379 -i

store --jq=".stores[].store | { id, address, state_name}"

config show grep schedule-limit

config set leader-schedule-limit 0
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set merge-schedule-limit 0
config set hot-region-schedule-limit 0

tiup cluster list tiup cluster display tidb_expertise_5.0

tiup cluster stop tidb_xxx_5.0 -R=tidb -y 
  • Stop the TiKV node
tiup cluster stop tidb_xxx_5.0 -R=tikv -y

Execute unsafe-recover on all stopped normal TiKV nodes to recover TiKV, data loss may occur at this time

# bj3-all-tidb-xxx-01:
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_20160/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
# bj3-all-tidb-xxx-02:
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_20160/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
# bj3-all-tidb-xxx-03
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_20161/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
  • Scale in the faulty TiKV, restart TiKV, PD
tiup cluster scale-in tidb_expertise_5.0 -N=host1:20160,host2:20161,host3:20160 -y 
tiup cluster stop tidb_expertise_5.0 -R=pd
tiup cluster start tidb_expertise_5.0 -R=pd,tikv


tiup ctl:v5.0.4 pd --pd=host1:2379 -i
config set leader-schedule-limit 4
config set region-schedule-limit 2048
config set replica-schedule-limit 64
config set merge-schedule-limit 8
config set hot-region-schedule-limit 4

[tidb@bj3-all-tidb-expertise-01 ~]$ tiup cluster display tidb_expertise_5.0
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.10.2/tiup-cluster display tidb_expertise_5.0
Cluster type:       tidb
Cluster name:       tidb_expertise_5.0
Cluster version:    v5.0.4
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://bj3-all-tidb-xxx-01:2379/dashboard
Grafana URL:        http://bj3-all-tidb-xxx-01:3000
ID                   Role        Host           Ports        OS/Arch       Status        Data Dir                         Deploy Dir
--                   ----        ----           -----        -------       ------        --------                         ----------
bj3-all-tidb-xxx-02:8300   cdc         bj3-all-tidb-xxx-02  8300         linux/x86_64  Up            /data/tidb-data/cdc-8300         /data/tidb-deploy/cdc-8300
bj3-all-tidb-xxx-01:3000    grafana     bj3-all-tidb-xxx-01   3000         linux/x86_64  Up            -                                /data/tidb-deploy/grafana-3000
bj3-all-tidb-xxx-01:2379    pd          bj3-all-tidb-xxx-01   2379/2380    linux/x86_64  Up|L|UI       /data/tidb-data/pd_data          /data/tidb-deploy/pd-2379
bj3-all-tidb-xxx-01:9090    prometheus  bj3-all-tidb-xxx-01   9090         linux/x86_64  Up            /data/tidb-data/prometheus-9090  /data/tidb-deploy/prometheus-9090
bj3-all-tidb-xxx-02:4000   tidb        bj3-all-tidb-xxx-02  4000/10080   linux/x86_64  Up            -                                /data/tidb-deploy/tidb-4000
bj3-all-tidb-xxx-03:4000     tidb        bj3-all-tidb-xxx-03    4000/10080   linux/x86_64  Up            -                                /data/tidb-deploy/tidb-4000
bj3-all-tidb-xxx-02:20161  tikv        bj3-all-tidb-xxx-02  20161/20181  linux/x86_64  Disconnected  /data/tidb-data/tikv_data_20161  /data/tidb-deploy/tikv-20161
bj3-all-tidb-xxx-01:20161   tikv        bj3-all-tidb-xxx-01   20161/20181  linux/x86_64  Disconnected  /data/tidb-data/tikv_data_20161  /data/tidb-deploy/tikv-20161
bj3-all-tidb-xxx-03:20160    tikv        bj3-all-tidb-xxx-03    20160/20180  linux/x86_64  Disconnected  /data/tidb-data/tikv_data_20160  /data/tidb-deploy/tikv-20160
Total nodes: 9

Check the regions on the faulty TiKV instance

  • Majority replicas on the faulty TiKV node:
tiup ctl:v5.0.4 pd --pd=bj3-all-tidb-xxx-01:2379 -i
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(8,1,10) then . else empty end) | length>=$total-length) }"
{"id":1512,"peer_stores":[1,8,10]}
{"id":1528,"peer_stores":[1,8,10]}
{"id":1131,"peer_stores":[1,10,8]}
{"id":133,"peer_stores":[1,10,8]}
{"id":707,"peer_stores":[8,10,1]}
{"id":1536,"peer_stores":[1,8,10]}
{"id":1270,"peer_stores":[8,10,1]}
{"id":874,"peer_stores":[8,10,1]}
{"id":184,"peer_stores":[8,1,10]}
{"id":712,"peer_stores":[8,10,1]}
{"id
| username: duzq | Original post link

It seems that unsafe recover has caused inconsistency between the index and the data. Admin check recover or rebuilding the index should resolve the issue.

| username: TiDBer_yyy | Original post link

After removing the unique index, the data import can be completed.

mysql> alter table f_file_info drop index I_FILE_ID;
Query OK, 0 rows affected (0.51 sec)

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.