Get Timestamp too slow

translator_bot · June 22, 2024, 6:19pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Get Timestamp too slow

| username: zhenda

[TiDB Usage Environment] Production environment, independent hardware servers, NVMe storage, newly deployed environment via K8S, imported 1T initial data, TiKV + TiFlash architecture, no business in use.
[TiDB Version] 6.1.0
[Reproduction Path] Initial data has been imported, not yet providing external services, resources are idle
Cluster status is normal, can be accessed normally

[Encountered Problem]: TiDB client reports error Get Timestamp too slow
TiDB server logs are as follows:

PD log screenshot

Dashboard screenshot

TiFlash log screenshot

Question: What is the impact of clock offset? How to specifically locate the issue point of Get Timestamp too slow through the dashboard? TiDB Grafana cluster-node is empty, no monitoring information.
If additional information is needed, it can be supplemented later in the post.
cluster-tidb-2023-02-03-17_34_28.pdf (6.6 MB)

translator_bot · June 22, 2024, 6:19pm

| username: 裤衩儿飞上天 | Original post link

The issue should be on the PD side. Check the network and CPU of PD. Is there mixed deployment on the PD side?

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

No mixed deployment, independent servers, new environment, clock offset. What exactly does “jet-lag” mean?

translator_bot · June 22, 2024, 6:19pm

| username: 裤衩儿飞上天 | Original post link

My personal understanding is that TSO is generated in batches by PD, not one at a time. This clock offset refers to generating a batch of TSO within a certain period, which is by default 50ms, and that is the clock offset.

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

jet-lag=208.152004ms What does the time refer to? Normally, I don’t see this information being frequently logged.

The PD server system resources are idle. Can you look into it more specifically to see where the issue might be?

translator_bot · June 22, 2024, 6:19pm

| username: 裤衩儿飞上天 | Original post link

You can start by following this troubleshooting guide:
TSO Slow Troubleshooting Manual v2.0 - Updated on 2022.03.11 - Operations Guide / Trouble Shooting Guide - TiDB Q&A Community (asktug.com)

translator_bot · June 22, 2024, 6:19pm

| username: 裤衩儿飞上天 | Original post link

Post everything that can be posted.

translator_bot · June 22, 2024, 6:19pm

| username: tidb菜鸟一只 | Original post link

Have the PD nodes and TiDB server performed clock synchronization?

translator_bot · June 22, 2024, 6:19pm

| username: TiDBer_jYQINSnf | Original post link

Check the CPU usage of TiDB to see if it is very high. If the CPU usage of TiDB is high, it can also cause TSO to be slow. This is because after PD returns, the TiDB goroutine may not be scheduled to obtain the TSO result.

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

If you need any information, please let me know. Let’s take a look at the issue together. The cluster-TiDB dashboard has already been posted, and the key TSO metrics have been attached.

cluster-tidb-2023-02-03-17_34_28.pdf (6.6 MB)

translator_bot · June 22, 2024, 6:19pm

| username: BraveChen | Original post link

What is the issue being exhibited now, aside from this log? What problem has it caused?

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

In the new environment, these issues might lead to other problems. It’s better to resolve them in advance. TiFlash has also reported some PD logs, and specific details have been posted earlier.

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

As mentioned earlier, new environment, no pressure, no load.

translator_bot · June 22, 2024, 6:19pm

| username: TiDBer_jYQINSnf | Original post link

Not a single SQL?

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

Yes, the environment is initialized, but the service is not provided.

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

PD Client CMD Duration

I found that the PD Client CMD Duration is quite long, almost equal to the latency time, and the CMD ops are very few, mainly scan_regions. How can I further investigate this issue?

translator_bot · June 22, 2024, 6:19pm

| username: 我是咖啡哥 | Original post link

Check if the network latency between the TiDB server node and the PD node is a bit high?

translator_bot · June 22, 2024, 6:19pm

| username: 裤衩儿飞上天 | Original post link

Take a look at these two:
PD----TIDB---- PD server TSO handle time
PD----TIDB---- Handle requests duration

In the following graph, the TSO wait duration on the TiDB side is similar to the time spent between PD and the network, indicating that most of the time is consumed on the PD and network side.

translator_bot · June 22, 2024, 6:19pm

| username: zhenda | Original post link

A larger CMD duration indicates that the PD Client on the PD side is taking a longer time to process, as shown in the figure below:
image|603x404
The PC server TSO handle is normal,

Based on the above analysis, can we rule out network issues? How can we further analyze this?

translator_bot · June 22, 2024, 6:19pm

| username: 裤衩儿飞上天 | Original post link

The PD client is on the TiDB server, not on the PD server.
The PD server TSO handle time is the time taken by the PD server to process the TSO. Check the monitoring for the same time period and align the times. If the PD server TSO handle time is very short, it indicates that the issue is with the network.

PD TSO RPC duration = network time + PD TSO processing time