Pd: context deadline exceeded

asamuj · May 19, 2025, 3:42am

Application environment:

production

TiDB version:

v6.5.11

Reproduction method:

Problem:

We’re occasionally getting a “context deadline exceeded” error, and it significantly slows down the cluster response.

Resource allocation:

Attachment:

[2025/05/19 01:42:11.319 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 01:42:38.753 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 01:43:03.048 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 01:44:30.593 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 01:46:05.827 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 01:46:19.712 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 10:31:43.894 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 10:36:47.553 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 10:37:34.834 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2025/05/19 10:39:12.615 +08:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc000208380/10.31.1.60:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

TiDB-AI-Assistant · May 19, 2025, 3:43am

The “context deadline exceeded” error in TiDB PD is a common issue that indicates a timeout during a process. This error can be triggered by several factors, including network issues, high system load, or problems with PD leadership. Here’s a detailed analysis and proposed solutions based on the provided knowledge graph findings and vector search results:

Causes of the “context deadline exceeded” Error

Network Issues:
- Communication problems between TiDB nodes and PD or TiKV can lead to timeouts. This is often due to network latency or disconnections.
- The PD log might show messages like “lost the TCP streaming connection,” indicating network instability.
PD Leadership Issues:
- PD leader switches or the inability to elect a leader can cause timeouts. This might be due to disk I/O load, network issues, or high system load on the PD nodes.
- The PD log might show “lease is not expired” or “server is likely overloaded,” indicating leadership election problems.
High System Load:
- If the PD server is under high load, it might not respond in time, leading to a timeout error.
- The log might show “server is likely overloaded,” suggesting that the system resources are insufficient.
PD OOM (Out of Memory):
- When PD runs out of memory, it can cause SQL statement timeouts. This might occur during operations that require significant memory, such as handling large numbers of regions.
Replication Task Complexity:
- The error is likely triggered when a replication task has many tables, as seen in the DM (Data Migration) context.

Proposed Solutions

Network Optimization:
- Ensure stable and low-latency network connections between TiDB, PD, and TiKV nodes. Use monitoring tools like Grafana to check network performance metrics such as ping latency and round trip times.
PD Leadership Stability:
- Investigate and resolve any disk I/O issues on PD nodes. Ensure that PD is not co-located with other high I/O demand components.
- Monitor and optimize system load on PD nodes to prevent overloads. Consider scaling resources if necessary.
Memory Management:
- Monitor PD memory usage and optimize configurations to prevent OOM errors. Consider increasing memory allocation or optimizing memory usage patterns.
Replication Task Management:
- For DM tasks with many tables, consider upgrading to a version that addresses known issues or apply hotfixes as necessary. Ensure that the replication task is optimized for performance.
Error Handling and Monitoring:
- Implement robust error handling mechanisms to catch and address timeouts promptly. Use logs and monitoring dashboards to identify and troubleshoot issues as they arise.

By addressing these potential causes and implementing the proposed solutions, you can mitigate the “context deadline exceeded” error in TiDB PD and improve the overall stability and performance of your TiDB cluster.

Hazel · May 19, 2025, 7:09am

Hi @asamuj , It seems that the issue might be caused by network problems or the target not responding. Could you please check the following:

Whether there are any network issues.
If the target service is under high load.

asamuj · May 19, 2025, 7:33am

Thank you very much for your reply!

The network is working fine on our end.
Could you please advise on how to determine if the service is experiencing high load?

Hazel · May 19, 2025, 8:01am

Hi @asamuj You can check the instance monitoring by referring to this documentation: Key Monitoring Metrics of PD to determine if the service is experiencing high load.

asamuj · May 20, 2025, 3:18am

Thanks for the suggestion! I’ll check out the documentation and give it a try.