Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 双云灾备建设,使用sync-diff-inspector 工具校验后,replace补数据执行失败
Version: 5.0.4
Architecture: Dual-cloud disaster recovery
Problem: Using the sync-diff-inspector tool to verify data, the replace statement execution fails:
[Overview] Using sync-diff-inspector to verify if the data between two TiDB clusters is consistent.
Result: Data inconsistency, replace statement execution fails after writing.
Unique key:
PRIMARY KEY (`I_ID`) /*T![clustered_index] CLUSTERED */,
UNIQUE KEY `FSI` (`I_FILE_ID`),
ERROR 1105 (HY000) at line 40950: can not be duplicated row, due to old row not found. handle 2396698 not found
ERROR 1105 (HY000) at line 40951: can not be duplicated row, due to old row not found. handle 2396699 not found
ERROR 1105 (HY000) at line 40952: can not be duplicated row, due to old row not found. handle 2396700 not found
ERROR 1105 (HY000) at line 40953: can not be duplicated row, due to old row not found. handle 2396701 not found
ERROR 1105 (HY000) at line 40956: can not be duplicated row, due to old row not found. handle 2396704 not found
ERROR 1105 (HY000) at line 40957: can not be duplicated row, due to old row not found. handle 2396705 not found
ERROR 1105 (HY000) at line 40964: can not be duplicated row, due to old row not found. handle 2396712 not found
-
Successful replace:
-
Failed replace
Successful ID=2, Failed ID=3 -
After recollecting statistics, failure:
Executing admin check table failed
mysql> admin check table f_file_info;
ERROR 8003 (HY000): f_file_info err:[admin:8223]index:<nil> != record:&admin.RecordData{Handle:54323, Values:[]types.Datum{types.Datum{k:0x5, decimal:0x0, length:0x0, i:0, collation:"utf8mb4_bin", b:[]uint8{0x65, 0x37, 0x34, 0x39, 0x35, 0x62, 0x33, 0x30, 0x34, 0x32, 0x66, 0x64, 0x39, 0x65, 0x38, 0x35, 0x65, 0x32, 0x61, 0x65, 0x37, 0x63, 0x36, 0x66, 0x66, 0x65, 0x31, 0x63, 0x31, 0x32, 0x34, 0x62, 0x66, 0x34, 0x65, 0x61, 0x36, 0x34, 0x35, 0x39}, x:interface {}(nil)}}}
Detailed steps are as follows:
- Check which TiKV node the table region is on
mysql> SPLIT TABLE sbtest1 BETWEEN (0) AND (100000) REGIONS 8;
+--------------------+----------------------+
| TOTAL_SPLIT_REGION | SCATTER_FINISH_RATIO |
+--------------------+----------------------+
| 8 | 1 |
+--------------------+----------------------+
1 row in set (3.54 sec)
mysql> show table sbtest1 regions;
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
| REGION_ID | START_KEY | END_KEY | LEADER_ID | LEADER_STORE_ID | PEERS | SCATTERING | WRITTEN_BYTES | READ_BYTES | APPROXIMATE_SIZE(MB) | APPROXIMATE_KEYS |
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
| 1596 | t_85_r | t_85_r_12500 | 1624 | 3 | 1626, 1624, 1625 | 0 | 1740 | 0 | 1 | 0 |
| 1600 | t_85_r_12500 | t_85_r_25000 | 1602 | 14 | 1601, 1602, 1603 | 0 | 1503 | 0 | 1 | 0 |
| 1604 | t_85_r_25000 | t_85_r_37500 | 1627 | 13 | 1628, 1629, 1627 | 0 | 4178 | 0 | 1 | 0 |
| 1608 | t_85_r_37500 | t_85_r_50000 | 1611 | 2 | 1609, 1610, 1611 | 0 | 1659 | 0 | 1 | 0 |
| 1612 | t_85_r_50000 | t_85_r_62500 | 1632 | 1 | 1632, 1630, 1631 | 0 | 1725 | 0 | 1 | 0 |
| 1616 | t_85_r_62500 | t_85_r_75000 | 1618 | 14 | 1617, 1618, 1619 | 0 | 27 | 0 | 1 | 0 |
| 1620 | t_85_r_75000 | t_85_r_87500 | 1635 | 1 | 1635, 1633, 1634 | 0 | 1725 | 0 | 1 | 0 |
| 5 | t_85_r_87500 | | 792 | 4 | 792, 1296, 1326 | 0 | 4258 | 2470034 | 1 | 0 |
| 1592 | t_85_ | t_85_r | 1593 | 4 | 1593, 1594, 1595 | 0 | 1491 | 0 | 3 | 11363 |
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
9 rows in set (0.01 sec)
- Execute on the PD node
tiup ctl:v5.0.4 pd --pd=host1:2379 -i
» store --jq=".stores[].store | { id, address, state_name}"
{"id":8,"address":"host1:20161","state_name":"Up"}
{"id":9,"address":"host2:20161","state_name":"Up"}
{"id":7,"address":"host1:20160","state_name":"Up"}
{"id":2,"address":"host3:20160","state_name":"Up"}
{"id":10,"address":"host2:20160","state_name":"Up"}
{"id":1,"address":"host3:20161","state_name":"Up"}
» region 1512
"store_id": 10,
"store_id": 1,
"store_id": 8,
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(8,1,10) then . else empty end) | length>=$total-length) }"
{"id":1135,"peer_stores":[9,8,1]}
{"id":1270,"peer_stores":[8,10,1]}
{"id":805,"peer_stores":[10,7,1]}
{"id":1127,"peer_stores":[8,10,2]}
{"id":890,"peer_stores":[10,7,1]}
{"id":175,"peer_stores":[10,7,1]}
{"id":707,"peer_stores":[8,10,1]}
{"id":1536,"peer_stores":[1,8,10]}
{"id":712,"peer_stores":[8,10,1]}
{"id":1023,"peer_stores":[7,1,10]}
{"id":1223,"peer_stores":[9,8,1]}
{"id":1512,"peer_stores":[1,8,10]}
{"id":941,"peer_stores":[10,1,8]}
{"id":874,"peer_stores":[8,10,1]}
{"id":1520,"peer_stores":[1,8,10]}
{"id":1528,"peer_stores":[1,8,10]}
{"id":1008,"peer_stores":[10,8,2]}
{"id":917,"peer_stores":[2,8,10]}
{"id":669,"peer_stores":[2,10,8]}
{"id":133,"peer_stores":[1,10,8]}
{"id":184,"peer_stores":[8,1,10]}
{"id":1131,"peer_stores":[1,10,8]}
{"id":129,"peer_stores":[8,1,9]}
{"id":988,"peer_stores":[2,10,8]}
{"id":1087,"peer_stores":[10,7,1]}
{"id":977,"peer_stores":[2,10,8]}
{"id":971,"peer_stores":[1,9,8]}
{"id":1278,"peer_stores":[7,1,10]}
{"id":796,"peer_stores":[2,8,10]}
{"id":1141,"peer_stores":[9,8,1]}
{"id":1097,"peer_stores":[10,7,1]}
{"id":900,"peer_stores":[10,1,8]}
{"id":1019,"peer_stores":[10,1,8]}
{"id":1169,"peer_stores":[8,10,2]}
{"id":1145,"peer_stores":[10,1,7]}
{"id":859,"peer_stores":[1,8,10]}
{"id":896,"peer_stores":[8,10,2]}
{"id":866,"peer_stores":[10,7,1]}
{"id":153,"peer_stores":[8,1,9]}
- Simulate failure
# Remove the TiKV deployment files
mv /data/tidb-deploy/tikv-20160 /data/tidb-deploy/tikv-20160_bak
# Kill the task
ps -ef|grep 20160
kill 12374
# Move the data directory; if disk space is insufficient, directly rm -rf the file directory.
mv /data/tidb-data/tikv_data_20160 /data/tidb-data/tikv_data_20160_bak
tiup ctl:v5.0.4 pd --pd=host1:2379 -i
store --jq=".stores[].store | { id, address, state_name}"
config show grep schedule-limit
config set leader-schedule-limit 0
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set merge-schedule-limit 0
config set hot-region-schedule-limit 0
tiup cluster list tiup cluster display tidb_expertise_5.0
tiup cluster stop tidb_xxx_5.0 -R=tidb -y
- Stop the TiKV node
tiup cluster stop tidb_xxx_5.0 -R=tikv -y
Execute unsafe-recover on all stopped normal TiKV nodes to recover TiKV, data loss may occur at this time
# bj3-all-tidb-xxx-01:
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
# bj3-all-tidb-xxx-02:
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
# bj3-all-tidb-xxx-03
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20161/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
- Scale in the faulty TiKV, restart TiKV, PD
tiup cluster scale-in tidb_expertise_5.0 -N=host1:20160,host2:20161,host3:20160 -y
tiup cluster stop tidb_expertise_5.0 -R=pd
tiup cluster start tidb_expertise_5.0 -R=pd,tikv
tiup ctl:v5.0.4 pd --pd=host1:2379 -i
config set leader-schedule-limit 4
config set region-schedule-limit 2048
config set replica-schedule-limit 64
config set merge-schedule-limit 8
config set hot-region-schedule-limit 4
[tidb@bj3-all-tidb-expertise-01 ~]$ tiup cluster display tidb_expertise_5.0
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.10.2/tiup-cluster display tidb_expertise_5.0
Cluster type: tidb
Cluster name: tidb_expertise_5.0
Cluster version: v5.0.4
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://bj3-all-tidb-xxx-01:2379/dashboard
Grafana URL: http://bj3-all-tidb-xxx-01:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
bj3-all-tidb-xxx-02:8300 cdc bj3-all-tidb-xxx-02 8300 linux/x86_64 Up /data/tidb-data/cdc-8300 /data/tidb-deploy/cdc-8300
bj3-all-tidb-xxx-01:3000 grafana bj3-all-tidb-xxx-01 3000 linux/x86_64 Up - /data/tidb-deploy/grafana-3000
bj3-all-tidb-xxx-01:2379 pd bj3-all-tidb-xxx-01 2379/2380 linux/x86_64 Up|L|UI /data/tidb-data/pd_data /data/tidb-deploy/pd-2379
bj3-all-tidb-xxx-01:9090 prometheus bj3-all-tidb-xxx-01 9090 linux/x86_64 Up /data/tidb-data/prometheus-9090 /data/tidb-deploy/prometheus-9090
bj3-all-tidb-xxx-02:4000 tidb bj3-all-tidb-xxx-02 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
bj3-all-tidb-xxx-03:4000 tidb bj3-all-tidb-xxx-03 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
bj3-all-tidb-xxx-02:20161 tikv bj3-all-tidb-xxx-02 20161/20181 linux/x86_64 Disconnected /data/tidb-data/tikv_data_20161 /data/tidb-deploy/tikv-20161
bj3-all-tidb-xxx-01:20161 tikv bj3-all-tidb-xxx-01 20161/20181 linux/x86_64 Disconnected /data/tidb-data/tikv_data_20161 /data/tidb-deploy/tikv-20161
bj3-all-tidb-xxx-03:20160 tikv bj3-all-tidb-xxx-03 20160/20180 linux/x86_64 Disconnected /data/tidb-data/tikv_data_20160 /data/tidb-deploy/tikv-20160
Total nodes: 9
Check the regions on the faulty TiKV instance
- Majority replicas on the faulty TiKV node:
tiup ctl:v5.0.4 pd --pd=bj3-all-tidb-xxx-01:2379 -i
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(8,1,10) then . else empty end) | length>=$total-length) }"
{"id":1512,"peer_stores":[1,8,10]}
{"id":1528,"peer_stores":[1,8,10]}
{"id":1131,"peer_stores":[1,10,8]}
{"id":133,"peer_stores":[1,10,8]}
{"id":707,"peer_stores":[8,10,1]}
{"id":1536,"peer_stores":[1,8,10]}
{"id":1270,"peer_stores":[8,10,1]}
{"id":874,"peer_stores":[8,10,1]}
{"id":184,"peer_stores":[8,1,10]}
{"id":712,"peer_stores":[8,10,1]}
{"id