99% of PD's completed_cmds_duration_seconds TSO reaches 10 seconds

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD 的99% completed_cmds_duration_seconds TSO达到10S

| username: TiDBer_wX9akOFm

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.5.0
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact] The 99% completed_cmds_duration_seconds TSO metric of PD repeatedly reaches 10 seconds, and most of the time it stays above 2 seconds. This metric seems to have significant issues, and I would like to consult on how to optimize and eliminate potential risks. The business scenario involves accessing TiKV through the JuiceFS client by accessing the PD address.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]


| username: Jasper | Original post link

How is the network latency between cluster components? You can check the ping latency using blackbox-exporter monitoring.

| username: TiDBer_wX9akOFm | Original post link

The monitoring of the corresponding time periods for the six nodes is as follows, with the maximum value of node 25 reaching close to 5ms.






| username: TiDBer_wX9akOFm | Original post link

It doesn’t seem to have a very intuitive correlation with the TSO monitoring curve.

| username: WalterWj | Original post link

Does it mean you didn’t use tidb-server? Is the client experiencing packet loss to PD? How is the network ping?

| username: yytest | Original post link

Add PD nodes: If resources permit, consider adding PD nodes to improve the cluster’s processing capacity.
Optimize PD configuration: Adjust the PD configuration according to TiDB’s official documentation and best practices.
Upgrade hardware: If the resources of the PD nodes are insufficient, consider upgrading the hardware.
Optimize client access patterns: Collaborate with the developers of the JuiceFS client to optimize access patterns and reduce the pressure on PD.
Monitoring and analysis: Continuously monitor the performance metrics of PD and TiKV, and analyze any anomalies or performance bottlenecks.

| username: TiDBer_wX9akOFm | Original post link

Yes, there are no TiDB nodes used. When pinging, there is no packet loss, and the latency is related to the relative position of different applications. Below are the latency results of three applications pinging three PD nodes respectively:



| username: zhh_912 | Original post link

The content in the post’s image should not be a network issue.

| username: TiDBer_wX9akOFm | Original post link

It doesn’t seem like it, but the phenomenon of TSO reaching 10 seconds continues to occur and there is no regular pattern in terms of time.