TiCDC Sync Task TSO Unchanged

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc同步任务tso不变

| username: l940399478

【TiDB Usage Environment】Production
【TiDB Version】5.1.4
【Encountered Problem】
The TSO of the ticdc synchronization task does not change. There are a total of 6 cdc tasks, 3 real-time synchronization, and 3 delayed. The TSO remains unchanged.

{
“id”: “6f1ad1ff-c230-4cb1-a44b-c542dcdd1a20”,
“summary”: {
“state”: “normal”,
“tso”: 434351548455452762,
“checkpoint”: “2022-07-04 15:27:44.213”,
“error”: null
}
}

The status also appears to be normal.

| username: l940399478 | Original post link

The phenomenon is that there are 6 CDC tasks, 3 of which are continuously updating in real-time with TSO constantly changing, while the TSO of the other 3 tasks remains unchanged. The downstream is the same Kafka setup, just with different topics, and data in TiDB is also being continuously written. Will the tasks affect each other? For example, if the TSO of one task does not change, will it affect other synchronization tasks?

| username: l940399478 | Original post link

Moreover, deleting one of the three tasks with unchanged TSO and creating a new task does not work (regardless of whether --start-ts is specified or not). Currently, deleting all three tasks and then creating a new task works.

| username: TammyLi | Original post link

  1. In theory, the TSO of different tasks will not affect each other.
  2. For troubleshooting abnormal tasks, first check the cdc.log for any ERROR/WARNING information.
| username: l940399478 | Original post link

The information in cdc_stderr.log is as follows:
{“level”:“warn”,“ts”:“2022-07-04T20:11:29.415+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-49583f76-bc68-4c7e-a0f6-c2b3c34335f2/10.108.182.133:3379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}

Now, there is something I don’t quite understand. Normally, the three tasks should still be functioning, indicating that CDC itself should be fine. For the three problematic tasks, deleting and recreating them one by one should work. However, in reality, all three tasks must be stopped, then deleted entirely, and only after that, recreating the tasks works normally.

| username: l940399478 | Original post link

Error information in cdc.log:
[2022/07/04 17:35:29.728 +08:00] [ERROR] [client.go:319] [“[pd] getTS error”] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, generate global tso maximum number of retries exceeded”]

[2022/07/04 17:35:29.729 +08:00] [ERROR] [pd.go:130] [“updateTS error”] [error=“rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, generate global tso maximum number of retries exceeded”] [errorVerbose=“rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, generate global tso maximum number of retries exceeded
github.com/tikv/pd/client.(*client).processTSORequests
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:355
github.com/tikv/pd/client.(*client).tsLoop
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:304
runtime.goexit
\truntime/asm_amd64.s:1357
github.com/tikv/pd/client.(*tsoRequest).Wait
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:466
github.com/tikv/pd/client.(*client).GetTS
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:486
github.com/pingcap/tidb/util/execdetails.InterceptedPDClient.GetTS
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210512055339-e25d1d0b7354/util/execdetails/pd_interceptor.go:60
github.com/pingcap/tidb/store/tikv/oracle/oracles.(*pdOracle).getTimestamp
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210512055339-e25d1d0b7354/store/tikv/oracle/oracles/pd.go:103
github.com/pingcap/tidb/store/tikv/oracle/oracles.(*pdOracle).updateTS
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210512055339-e25d1d0b7354/store/tikv/oracle/oracles/pd.go:128
runtime.goexit
\truntime/asm_amd64.s:1357”]

| username: TammyLi | Original post link

You can check if the synchronization task TSO interruption is due to this issue: TiCDC take long time (may be a day) to recover from TiKV cluster failover · Issue #4516 · pingcap/tiflow · GitHub
For TiKV residual locks, check the metrics cdc → tikv min resolved ts.

| username: l940399478 | Original post link

Today, there were 3 tasks running normally, but 3 tasks suddenly had their TSO stop changing, causing delays. Upon inquiring with the business, it was found that TiDB was running tasks with large fields. After stopping these tasks, the TSO started changing again, and the delay gradually caught up.

So, I have a question: how does CDC handle synchronization when there might be large fields involved?

| username: xinyuzhao | Original post link

Could you please send the desensitized test data with the blob field? We need to reproduce the issue on our side.

| username: l940399478 | Original post link

This large field stores some JSON strings.

| username: xinyuzhao | Original post link

Is there still an issue with synchronizing tables that contain blob fields? Can this issue be consistently reproduced? Based on the error message above, it seems to be a problem with TiCDC fetching TSO from PD.

| username: l940399478 | Original post link

It can be stably reproduced. As long as the development side adds this JSON string field, the TSO corresponding to the CDC task will remain unchanged.

| username: 小王同学Plus | Original post link

Could you describe the specific steps to reproduce it stably? Or provide the reproduction method. We’ll take a look.

| username: l940399478 | Original post link

Once there are operations involving this table, such as a large number of updates or deletes, CDC will become unsynchronized. When there are no such operations, this field can still be synchronized. What CDC parameters need to be adjusted?