TICDC New Changefeed Always Reports Etcd Timeout

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TICDC新建changefeed总是报etcd超时

| username: Hacker_9LYnzJhP

[TiDB Usage Environment] Production Environment / Testing / Poc
Production environment, TiDB is deployed in k8s, with about 5000-10000 tables in the TiDB instance, but only about 150 tables are configured for TiCDC synchronization. The total data volume in TiDB is not large, less than 10G.

Currently, the entire environment is still in testing, so the tables configured with TiCDC have little traffic, with each table having only hundreds of thousands of rows. The environment has frequent DDL operations, often involving truncate table operations.

[TiDB Version]
TiDB 5.4
[Reproduction Path] What operations were performed when the issue occurred
Failed to create changefeed through TiCDC’s openapi, always encountering etcd timeout issues.

[Encountered Issue: Problem Phenomenon and Impact]
curl -X POST http://127.0.0.1:8301/api/v1/changefeeds -d ‘{“changefeed_id”:“k1”,“sink_uri”:“kafka://broker-kafka-test-az1-0.jvessel-open-hb.jdcloud.com:9092/tidb_version_test?protocol=canal-json&kafka-version=2.4.0&max-message-bytes=1073741824”, “filter_rules”:[“test.test1”]}’

Returns CDC:ErrPDEtcdAPIError]etcd api call error: context deadline exceeded

This issue occurs almost 100% of the time, and when it occurs, the HTTP request returns in about 12-14 seconds.

However, creating through cdc cli does not have this issue.

[Resource Configuration]
TiCDC configuration: 8C, 16G;

/cdc server --addr=0.0.0.0:8301 --advertise-addr=tidb-test-ticdc-0tidb-test-ticdc-peer.tidb-test.svc:8301 --gc-ttl=86400 --log-file=/tmp/cdc_data/log/cdc.log --log-level=info --pd=http://tidb-test-pd:2379

[Attachments: Screenshots/Logs/Monitoring]

When creation fails, there are logs:
[2022/12/28 09:08:12.272 +00:00] [ERROR] [client.go:502] [“[pd] tso request is canceled due to timeout”] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:12.272 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:12.272 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:12.272 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:16.273 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:16.274 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:16.274 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:16.274 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:20.274 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:20.274 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:20.274 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:20.274 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:24.275 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:24.276 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:24.276 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:24.276 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:28.277 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:28.277 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:28.277 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:28.277 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:32.279 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:32.279 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:32.279 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:32.279 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:36.281 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:36.281 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]

| username: hey-hoho | Original post link

It looks like the CDC node cannot reach PD, probably due to network issues.

| username: xfworld | Original post link

The network environment of k8s is relatively complex, and it is necessary to check whether the requester of the openapi can normally access PD.