Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: ticdc同步任务tso不变
【TiDB Usage Environment】Production
【TiDB Version】5.1.4
【Encountered Problem】
The TSO of the ticdc synchronization task does not change. There are a total of 6 cdc tasks, 3 real-time synchronization, and 3 delayed. The TSO remains unchanged.
{
“id”: “6f1ad1ff-c230-4cb1-a44b-c542dcdd1a20”,
“summary”: {
“state”: “normal”,
“tso”: 434351548455452762,
“checkpoint”: “2022-07-04 15:27:44.213”,
“error”: null
}
}
The status also appears to be normal.
The phenomenon is that there are 6 CDC tasks, 3 of which are continuously updating in real-time with TSO constantly changing, while the TSO of the other 3 tasks remains unchanged. The downstream is the same Kafka setup, just with different topics, and data in TiDB is also being continuously written. Will the tasks affect each other? For example, if the TSO of one task does not change, will it affect other synchronization tasks?
Moreover, deleting one of the three tasks with unchanged TSO and creating a new task does not work (regardless of whether --start-ts is specified or not). Currently, deleting all three tasks and then creating a new task works.
The information in cdc_stderr.log is as follows:
{“level”:“warn”,“ts”:“2022-07-04T20:11:29.415+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-49583f76-bc68-4c7e-a0f6-c2b3c34335f2/10.108.182.133:3379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
Now, there is something I don’t quite understand. Normally, the three tasks should still be functioning, indicating that CDC itself should be fine. For the three problematic tasks, deleting and recreating them one by one should work. However, in reality, all three tasks must be stopped, then deleted entirely, and only after that, recreating the tasks works normally.
Error information in cdc.log:
[2022/07/04 17:35:29.728 +08:00] [ERROR] [client.go:319] [“[pd] getTS error”] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, generate global tso maximum number of retries exceeded”]
[2022/07/04 17:35:29.729 +08:00] [ERROR] [pd.go:130] [“updateTS error”] [error=“rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, generate global tso maximum number of retries exceeded”] [errorVerbose=“rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, generate global tso maximum number of retries exceeded
github.com/tikv/pd/client.(*client).processTSORequests
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:355
github.com/tikv/pd/client.(*client).tsLoop
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:304
runtime.goexit
\truntime/asm_amd64.s:1357
github.com/tikv/pd/client.(*tsoRequest).Wait
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:466
github.com/tikv/pd/client.(*client).GetTS
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210105112549-e5be7fd38659/client/client.go:486
github.com/pingcap/tidb/util/execdetails.InterceptedPDClient.GetTS
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210512055339-e25d1d0b7354/util/execdetails/pd_interceptor.go:60
github.com/pingcap/tidb/store/tikv/oracle/oracles.(*pdOracle).getTimestamp
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210512055339-e25d1d0b7354/store/tikv/oracle/oracles/pd.go:103
github.com/pingcap/tidb/store/tikv/oracle/oracles.(*pdOracle).updateTS
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210512055339-e25d1d0b7354/store/tikv/oracle/oracles/pd.go:128
runtime.goexit
\truntime/asm_amd64.s:1357”]
You can check if the synchronization task TSO interruption is due to this issue: TiCDC take long time (may be a day) to recover from TiKV cluster failover · Issue #4516 · pingcap/tiflow · GitHub
For TiKV residual locks, check the metrics cdc → tikv min resolved ts.
Today, there were 3 tasks running normally, but 3 tasks suddenly had their TSO stop changing, causing delays. Upon inquiring with the business, it was found that TiDB was running tasks with large fields. After stopping these tasks, the TSO started changing again, and the delay gradually caught up.
So, I have a question: how does CDC handle synchronization when there might be large fields involved?
Could you please send the desensitized test data with the blob field? We need to reproduce the issue on our side.
This large field stores some JSON strings.
Is there still an issue with synchronizing tables that contain blob fields? Can this issue be consistently reproduced? Based on the error message above, it seems to be a problem with TiCDC fetching TSO from PD.
It can be stably reproduced. As long as the development side adds this JSON string field, the TSO corresponding to the CDC task will remain unchanged.
Could you describe the specific steps to reproduce it stably? Or provide the reproduction method. We’ll take a look.
Once there are operations involving this table, such as a large number of updates or deletes, CDC will become unsynchronized. When there are no such operations, this field can still be synchronized. What CDC parameters need to be adjusted?