TiCDC Frequently Restarts, Checkpoint Stops

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC频繁重启,checkpoint停止

| username: 脚本小王子

[TiDB Usage Environment] Occurred in production environment, reproduced in test environment
[TiDB Version] v5.3.0
[Reproduction Path] Occurred when deleting data after adding a varchar(64) field to 64 partitioned tables
[Encountered Problem: Phenomenon and Impact]
64 partitioned tables are synchronized to Kafka via ticdc using the maxwell format (no issues with other formats as tested). Executing the following on these 64 tables:

ALTER TABLE xxx_statinfo_0 ADD trace_id varchar(64) DEFAULT '' NOT NULL;
ALTER TABLE xxx_statinfo_1 ADD trace_id varchar(64) DEFAULT '' NOT NULL;
......
ALTER TABLE xxx_statinfo_63 ADD trace_id varchar(64) DEFAULT '' NOT NULL;

Then, deleting data from these 64 tables (traversing the 64 tables and executing delete xxx_statinfo_x where xxx limit 5000) results in the following situations:

  1. The checkpoint in cdc cli changefeed list does not change, and occasionally errors occur, as follows:
[root@localhost eric]# cdc cli changefeed list --pd=http://192.168.100.162:2379
[
  {
    "id": "socol-statinfo",
    "summary": {
      "state": "normal",
      "tso": 440123479891640321,
      "checkpoint": "2023-03-16 11:37:15.280",
      "error": null
    }
  }
]
[root@localhost eric]# cdc cli changefeed list --pd=http://192.168.100.162:2379
[2023/03/16 11:37:32.650 +08:00] [WARN] [cli_changefeed_list.go:102] ["query changefeed info failed"] [error="Post \"http://192.168.100.166:8300/capture/owner/changefeed/query\": dial tcp 192.168.100.166:8300: connect: connection refused"]
[
  {
    "id": "socol-statinfo",
    "summary": null
  }
]
  1. Checking cdc_stderr.log reports “panic: interface conversion: interface {} is string, not uint8”, which is similar to the issue described in 新增ticdc到kafka同步任务后ticdc组件不断重启 - #4,来自 LingJin - TiDB 的问答社区, but my version is v5.3.0, and according to that post, versions 5.0.4 and later should resolve this mismatch.

  2. Deleting this task and recreating it with the original tso does not resolve the issue.

  3. Deleting this task and recreating it with the tso taken after the delete operation completes works normally, with checkpoint changes, but the issue recurs as soon as a delete operation occurs.

  4. Tested unsafe reset, scaling down and then scaling up the ticdc component, and recreating the task, but the issue persists.

[Resource Configuration]
The configuration of the test environment is as follows:

[root@localhost eric]# tiup cluster display tidb-test
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.11.3
   Local installed version:    v1.11.1
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /root/.tiup/components/cluster/v1.11.1/tiup-cluster /root/.tiup/components/cluster/v1.11.1/tiup-cluster display tidb-test
Cluster type:       tidb
Cluster name:       tidb-test
Cluster version:    v5.3.0
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://192.168.100.164:2379/dashboard
Grafana URL:        http://192.168.100.161:3000
ID                     Role          Host             Ports                            OS/Arch       Status   Data Dir                           Deploy Dir
--                     ----          ----             -----                            -------       ------   --------                           ----------
192.168.100.161:9093   alertmanager  192.168.100.161  9093/9094                        linux/x86_64  Up       /data/tidb-data/alertmanager-9093  /data/tidb-deploy/alertmanager-9093
192.168.100.161:8300   cdc           192.168.100.161  8300                             linux/x86_64  Up       /data/tidb-data/cdc-8300           /data/tidb-deploy/cdc-8300
192.168.100.166:8300   cdc           192.168.100.166  8300                             linux/x86_64  Up       /data/tidb-data/cdc-8300           /data/tidb-deploy/cdc-8300
192.168.100.161:3000   grafana       192.168.100.161  3000                             linux/x86_64  Up       -                                  /data/tidb-deploy/grafana-3000
192.168.100.162:2379   pd            192.168.100.162  2379/2380                        linux/x86_64  Up       /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
192.168.100.163:2379   pd            192.168.100.163  2379/2380                        linux/x86_64  Up       /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
192.168.100.164:2379   pd            192.168.100.164  2379/2380                        linux/x86_64  Up|L|UI  /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
192.168.100.161:9090   prometheus    192.168.100.161  9090                             linux/x86_64  Up       /data/tidb-data/prometheus-9090    /data/tidb-deploy/prometheus-9090
192.168.100.161:4000   tidb          192.168.100.161  4000/10080                       linux/x86_64  Up       -                                  /data/tidb-deploy/tidb-4000
192.168.100.166:9000   tiflash       192.168.100.166  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /data/tidb-data/tiflash-9000       /data/tidb-deploy/tiflash-9000
192.168.100.162:20160  tikv          192.168.100.162  20160/20180                      linux/x86_64  Up       /data/tidb-data/tikv-20160         /data/tidb-deploy/tikv-20160
192.168.100.163:20160  tikv          192.168.100.163  20160/20180                      linux/x86_64  Up       /data/tidb-data/tikv-20160         /data/tidb-deploy/tikv-20160
192.168.100.164:20160  tikv          192.168.100.164  20160/20180                      linux/x86_64  Up       /data/tidb-data/tikv-20160         /data/tidb-deploy/tikv-20160
Total nodes: 13

[Attachments: Screenshots/Logs/Monitoring]
cdc_log.tar.gz (223.5 KB)
cdc_stderr.log (18.0 KB)

[Others]
Table structure is as follows:

CREATE TABLE `xxx_statinfo_0` (
  `id` int(10) NOT NULL AUTO_INCREMENT ,
  `imei` varchar(30) NOT NULL DEFAULT '' ,
  `device_no` varchar(128) NOT NULL DEFAULT '' ,
  `action` tinyint(2) NOT NULL DEFAULT '0' ,
  `seq` varchar(36) NOT NULL,
  `source` tinyint(2) NOT NULL DEFAULT '0' ,
  `img_size` int(11) unsigned NOT NULL DEFAULT '0' ,
  `img_total` smallint(6) unsigned NOT NULL DEFAULT '0' ,
  `vedio_duration` int(11) NOT NULL DEFAULT '0' ,
  `vedio_size` int(11) NOT NULL DEFAULT '0' ,
  `img_url` mediumtext NOT NULL ,     # Stores base64 values
  `vedio_url` varchar(256) NOT NULL DEFAULT '' ,
  `upload_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' ,
  `create_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' ,
  `update_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' ,
  `append_size` int(11) unsigned NOT NULL DEFAULT '0' ,
  `total_size` int(11) unsigned NOT NULL DEFAULT '0' ,
  `apk_version` varchar(50) NOT NULL DEFAULT '0' ,
  `is_compress` tinyint(2) unsigned NOT NULL DEFAULT '0' ,
  `error_code` int(5) unsigned NOT NULL DEFAULT '0' ,
  `mosaic_type` tinyint(1) NOT NULL DEFAULT '0' ,
  `mosaic_size` int(10) NOT NULL DEFAULT '0' ,
  `resolution` int(10) NOT NULL DEFAULT '0' ,
  `isCut` tinyint(1) NOT NULL DEFAULT '0' ,                           # Previously added field with no issues
  `trace_id` varchar(64) NOT NULL DEFAULT '' ,                    # Recently added field causing issues
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  KEY `idx_imei_source_seq` (`imei`,`source`,`seq`),
  KEY `idx_device_source_seq` (`device_no`,`source`,`seq`),
  KEY `idx_create_time` (`create_time`)
);
| username: Meditator | Original post link

I have seen this issue in versions 4.X, 5.2.X, and 5.4.X before. Yours is 5.3.X, so it is very likely the same problem.

| username: 脚本小王子 | Original post link

Is there a solution without upgrading the version?

| username: Meditator | Original post link

You need to ask the official R&D experts about what scenarios trigger this and how to avoid it.

| username: 脚本小王子 | Original post link

Synchronize my solution. After testing, removing the unsigned attribute from the img_size field and then recreating the CDC task can solve the issue. The specific steps are as follows:

# Modify the field (remove unsigned)
alter table xxx_statinfo modify img_size int NOT NULL DEFAULT '0';

# View CDC component roles
cdc cli capture list --pd=http://xxx.xxx.xxx:2379

# View tasks
cdc cli changefeed list --pd=http://xxx.xxx.xxx:2379 

# Pause
cdc cli changefeed pause --changefeed-id xxx-statinfo --pd=http://xxx.xxx.xxx:2379

# Delete
cdc cli changefeed remove --changefeed-id xxx-statinfo --pd=http://xxx.xxx.xxx:2379

# Get the TSO for the time after the field modification, for example, the operation time is 2022-01-13 17:19:32
select conv(concat(bin(unix_timestamp('2023-03-18 22:00:00')*1000),'000000000000000001'),2,10);

# Recreate the previously deleted task (replace the specified start-ts with the TSO obtained above)
cdc cli changefeed create --pd=http://xxx.xxx.xxx:2379 --sink-uri="kafka://172.16.5.167:9092/ticdc_xxx_statinfo?kafka-version=2.2.1&partition-num=6&max-message-bytes=67108864&replication-factor=1" --changefeed-id="xxx-statinfo" --start-ts=440178573312000001 --sort-engine="unified" --config=xxx_statinfo.toml
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.