Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: Get Timestamp too slow
[TiDB Usage Environment] Production environment, independent hardware servers, NVMe storage, newly deployed environment via K8S, imported 1T initial data, TiKV + TiFlash architecture, no business in use.
[TiDB Version] 6.1.0
[Reproduction Path] Initial data has been imported, not yet providing external services, resources are idle
Cluster status is normal, can be accessed normally
[Encountered Problem]: TiDB client reports error Get Timestamp too slow
TiDB server logs are as follows:
PD log screenshot
Dashboard screenshot
TiFlash log screenshot
Question: What is the impact of clock offset? How to specifically locate the issue point of Get Timestamp too slow through the dashboard? TiDB Grafana cluster-node is empty, no monitoring information.
If additional information is needed, it can be supplemented later in the post.
cluster-tidb-2023-02-03-17_34_28.pdf (6.6 MB)
The issue should be on the PD side. Check the network and CPU of PD. Is there mixed deployment on the PD side?
No mixed deployment, independent servers, new environment, clock offset. What exactly does “jet-lag” mean?
My personal understanding is that TSO is generated in batches by PD, not one at a time. This clock offset refers to generating a batch of TSO within a certain period, which is by default 50ms, and that is the clock offset.
jet-lag=208.152004ms What does the time refer to? Normally, I don’t see this information being frequently logged.
The PD server system resources are idle. Can you look into it more specifically to see where the issue might be?
Post everything that can be posted.
Have the PD nodes and TiDB server performed clock synchronization?
Check the CPU usage of TiDB to see if it is very high. If the CPU usage of TiDB is high, it can also cause TSO to be slow. This is because after PD returns, the TiDB goroutine may not be scheduled to obtain the TSO result.
If you need any information, please let me know. Let’s take a look at the issue together. The cluster-TiDB dashboard has already been posted, and the key TSO metrics have been attached.
cluster-tidb-2023-02-03-17_34_28.pdf (6.6 MB)
What is the issue being exhibited now, aside from this log? What problem has it caused?
In the new environment, these issues might lead to other problems. It’s better to resolve them in advance. TiFlash has also reported some PD logs, and specific details have been posted earlier.
As mentioned earlier, new environment, no pressure, no load.
Yes, the environment is initialized, but the service is not provided.
PD Client CMD Duration
I found that the PD Client CMD Duration is quite long, almost equal to the latency time, and the CMD ops are very few, mainly scan_regions. How can I further investigate this issue?
Check if the network latency between the TiDB server node and the PD node is a bit high?
Take a look at these two:
PD----TIDB---- PD server TSO handle time
PD----TIDB---- Handle requests duration
In the following graph, the TSO wait duration on the TiDB side is similar to the time spent between PD and the network, indicating that most of the time is consumed on the PD and network side.
A larger CMD duration indicates that the PD Client on the PD side is taking a longer time to process, as shown in the figure below:
image|603x404
The PC server TSO handle is normal,
Based on the above analysis, can we rule out network issues? How can we further analyze this?
The PD client is on the TiDB server, not on the PD server.
The PD server TSO handle time is the time taken by the PD server to process the TSO. Check the monitoring for the same time period and align the times. If the PD server TSO handle time is very short, it indicates that the issue is with the network.
PD TSO RPC duration = network time + PD TSO processing time