Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: pd server timeout
[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.3
[Reproduction Path]
Error in production environment: pd server timeout
Checked via black_exporter, no ping delay
Checked via monitoring, apart from the pdrpc being around 10 on the TiDB monitoring interface, no other anomalies were found
[Encountered Problem: Problem Phenomenon and Impact]
Business select statement error, pd server timeout
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
At 10:23, the CPU, IO, network, and memory metrics were all normal.
At 10:23, pdrpc appeared.
I want to check the PD logs around that time to see if there’s anything noteworthy.
At 10:22:52, there was a region split, region batch split, which rarely occurs. Other than that, there are many balance regions.
The images you provided are not visible. Please provide the text content that you need translated.
Is the PD service on the PD server functioning normally?
Check if PD has switched,
then look into network issues and troubleshoot network-related situations. Use Grafana monitoring → blackbox_exporter → ping latency to determine if the network between TiDB and the PD leader is normal.
Well, I have checked it, and the ping is normal. The PD has not switched, and the leader is still on that server, so I am quite confused as to why it suddenly reported a PD server timeout.
I suspect that it might be because one of the PD followers is deployed on the same server as HAProxy, and the high request volume to HAProxy is causing the issue. I remember seeing somewhere that in version 6.5, TSO requests can be obtained from PD followers. I looked for related articles yesterday but couldn’t find any.
tidb_tso_client_batch_max_wait_time
Introduced from version v5.3.0
- Scope: GLOBAL
- Persisted to cluster: Yes
- Type: Float
- Default value:
0
- Range:
[0, 10]
- Unit: Milliseconds
- This variable is used to set the maximum wait time for TiDB to batch TSO requests to PD. The default value is
0
, meaning no additional waiting.
- When requesting TSO from PD, the PD Client used by TiDB will collect as many TSO requests as possible at the same time, batch them into one RPC request, and then send it to PD, thereby reducing the pressure on PD.
- When this variable is set to a non-zero value, TiDB will wait for a maximum duration equal to its value before each batch ends, aiming to collect more TSO requests and thus improve batching efficiency.
- Scenarios suitable for increasing this variable value:
- The PD leader reaches a CPU bottleneck due to high-pressure TSO requests, resulting in high latency for TSO RPC requests.
- The number of TiDB instances in the cluster is not large, but the concurrency on each TiDB instance is high.
- In actual use, it is recommended to set this variable to a small value as much as possible.
Note
If the TSO RPC latency of the PD leader increases, but the phenomenon is not caused by the CPU usage reaching a bottleneck (there may be network issues, etc.), increasing tidb_tso_client_batch_max_wait_time
may lead to increased statement execution latency in TiDB, affecting the cluster’s QPS performance.
It seems so. I just checked, and this is off, so it shouldn’t be the issue.
There was no switch, it has always been on that server.
This value hasn’t been changed, it’s 0.
When switching the leader, there was a timeout once, from 4 to 11 minutes. After retrying, the leader switch was successful. Is this a network issue or a blockage?
Switching the region leader, based on past experiences, will only result in retries and not timeouts. The network should not be an issue, and the ping latency of black_exporter is normal.
The image is not visible. Please provide the text you need translated.
Blocked, but not sure what is causing the blockage. The cluster load seems normal, only a few select statements are reporting errors, everything else is normal, and the overall QPS hasn’t decreased. At that time, the QPS was around 80k/s.