Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: TiCDC频繁重启,checkpoint停止

[TiDB Usage Environment] Occurred in production environment, reproduced in test environment
[TiDB Version] v5.3.0
[Reproduction Path] Occurred when deleting data after adding a varchar(64) field to 64 partitioned tables
[Encountered Problem: Phenomenon and Impact]
64 partitioned tables are synchronized to Kafka via ticdc using the maxwell format (no issues with other formats as tested). Executing the following on these 64 tables:
ALTER TABLE xxx_statinfo_0 ADD trace_id varchar(64) DEFAULT '' NOT NULL;
ALTER TABLE xxx_statinfo_1 ADD trace_id varchar(64) DEFAULT '' NOT NULL;
......
ALTER TABLE xxx_statinfo_63 ADD trace_id varchar(64) DEFAULT '' NOT NULL;
Then, deleting data from these 64 tables (traversing the 64 tables and executing delete xxx_statinfo_x where xxx limit 5000) results in the following situations:
- The checkpoint in cdc cli changefeed list does not change, and occasionally errors occur, as follows:
[root@localhost eric]# cdc cli changefeed list --pd=http://192.168.100.162:2379
[
{
"id": "socol-statinfo",
"summary": {
"state": "normal",
"tso": 440123479891640321,
"checkpoint": "2023-03-16 11:37:15.280",
"error": null
}
}
]
[root@localhost eric]# cdc cli changefeed list --pd=http://192.168.100.162:2379
[2023/03/16 11:37:32.650 +08:00] [WARN] [cli_changefeed_list.go:102] ["query changefeed info failed"] [error="Post \"http://192.168.100.166:8300/capture/owner/changefeed/query\": dial tcp 192.168.100.166:8300: connect: connection refused"]
[
{
"id": "socol-statinfo",
"summary": null
}
]
-
Checking cdc_stderr.log reports “panic: interface conversion: interface {} is string, not uint8”, which is similar to the issue described in 新增ticdc到kafka同步任务后ticdc组件不断重启 - #4,来自 LingJin - TiDB 的问答社区, but my version is v5.3.0, and according to that post, versions 5.0.4 and later should resolve this mismatch.
-
Deleting this task and recreating it with the original tso does not resolve the issue.
-
Deleting this task and recreating it with the tso taken after the delete operation completes works normally, with checkpoint changes, but the issue recurs as soon as a delete operation occurs.
-
Tested unsafe reset, scaling down and then scaling up the ticdc component, and recreating the task, but the issue persists.
[Resource Configuration]
The configuration of the test environment is as follows:
[root@localhost eric]# tiup cluster display tidb-test
tiup is checking updates for component cluster ...
A new version of cluster is available:
The latest version: v1.11.3
Local installed version: v1.11.1
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /root/.tiup/components/cluster/v1.11.1/tiup-cluster /root/.tiup/components/cluster/v1.11.1/tiup-cluster display tidb-test
Cluster type: tidb
Cluster name: tidb-test
Cluster version: v5.3.0
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://192.168.100.164:2379/dashboard
Grafana URL: http://192.168.100.161:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
192.168.100.161:9093 alertmanager 192.168.100.161 9093/9094 linux/x86_64 Up /data/tidb-data/alertmanager-9093 /data/tidb-deploy/alertmanager-9093
192.168.100.161:8300 cdc 192.168.100.161 8300 linux/x86_64 Up /data/tidb-data/cdc-8300 /data/tidb-deploy/cdc-8300
192.168.100.166:8300 cdc 192.168.100.166 8300 linux/x86_64 Up /data/tidb-data/cdc-8300 /data/tidb-deploy/cdc-8300
192.168.100.161:3000 grafana 192.168.100.161 3000 linux/x86_64 Up - /data/tidb-deploy/grafana-3000
192.168.100.162:2379 pd 192.168.100.162 2379/2380 linux/x86_64 Up /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.100.163:2379 pd 192.168.100.163 2379/2380 linux/x86_64 Up /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.100.164:2379 pd 192.168.100.164 2379/2380 linux/x86_64 Up|L|UI /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.100.161:9090 prometheus 192.168.100.161 9090 linux/x86_64 Up /data/tidb-data/prometheus-9090 /data/tidb-deploy/prometheus-9090
192.168.100.161:4000 tidb 192.168.100.161 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
192.168.100.166:9000 tiflash 192.168.100.166 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /data/tidb-data/tiflash-9000 /data/tidb-deploy/tiflash-9000
192.168.100.162:20160 tikv 192.168.100.162 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
192.168.100.163:20160 tikv 192.168.100.163 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
192.168.100.164:20160 tikv 192.168.100.164 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
Total nodes: 13
[Attachments: Screenshots/Logs/Monitoring]
cdc_log.tar.gz (223.5 KB)
cdc_stderr.log (18.0 KB)
[Others]
Table structure is as follows:
CREATE TABLE `xxx_statinfo_0` (
`id` int(10) NOT NULL AUTO_INCREMENT ,
`imei` varchar(30) NOT NULL DEFAULT '' ,
`device_no` varchar(128) NOT NULL DEFAULT '' ,
`action` tinyint(2) NOT NULL DEFAULT '0' ,
`seq` varchar(36) NOT NULL,
`source` tinyint(2) NOT NULL DEFAULT '0' ,
`img_size` int(11) unsigned NOT NULL DEFAULT '0' ,
`img_total` smallint(6) unsigned NOT NULL DEFAULT '0' ,
`vedio_duration` int(11) NOT NULL DEFAULT '0' ,
`vedio_size` int(11) NOT NULL DEFAULT '0' ,
`img_url` mediumtext NOT NULL , # Stores base64 values
`vedio_url` varchar(256) NOT NULL DEFAULT '' ,
`upload_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' ,
`create_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' ,
`update_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' ,
`append_size` int(11) unsigned NOT NULL DEFAULT '0' ,
`total_size` int(11) unsigned NOT NULL DEFAULT '0' ,
`apk_version` varchar(50) NOT NULL DEFAULT '0' ,
`is_compress` tinyint(2) unsigned NOT NULL DEFAULT '0' ,
`error_code` int(5) unsigned NOT NULL DEFAULT '0' ,
`mosaic_type` tinyint(1) NOT NULL DEFAULT '0' ,
`mosaic_size` int(10) NOT NULL DEFAULT '0' ,
`resolution` int(10) NOT NULL DEFAULT '0' ,
`isCut` tinyint(1) NOT NULL DEFAULT '0' , # Previously added field with no issues
`trace_id` varchar(64) NOT NULL DEFAULT '' , # Recently added field causing issues
PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
KEY `idx_imei_source_seq` (`imei`,`source`,`seq`),
KEY `idx_device_source_seq` (`device_no`,`source`,`seq`),
KEY `idx_create_time` (`create_time`)
);