TiDB Node Frequently Hangs

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.1
【Encountered Issue: Problem Phenomenon and Impact】
Our environment is a two-city, three-center setup, with two data centers in Beijing and one in Shenyang. Currently, when a task is run by a developer and it lands on the Shunyi data center in Beijing, the TiDB node in Shunyi hangs and can only be resolved by a restart.

In this image, 10.122 is Shenyang, and 10.194 is Shunyi. The duration for the Shunyi node is very high.

The Shunyi node is also high here.

The duration for accessing a certain TiKV node is extremely high. This TiKV node is in another data center in Beijing.

We have optimized the tasks run by the developers:

  • Split large transactions
  • Optimized SQL
  • Shuffled the keys of the tables

Please help us identify other areas we can work on. If the information is insufficient, what additional information do we need to provide?
Thank you all.

【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

Collect the logs of TiDB nodes, especially before and after the hang. Pay special attention to slow queries, as well as memory and CPU consumption.

Additionally, you can take a snapshot of the profile to help identify the issue.

If you find it troublesome, you can also use the official diagnostic tool.

This post is valid for a long time. Currently, after optimizing some SQL, the issue has been alleviated.

Thank you all!

In the past two days, a certain TiDB node has experienced high duration again, as shown in the figure below:

I have attached the TiDB log for this node.
I have attached the TiDB log for this node.

  • TiDB service status monitoring port
  • Default: “10080”
  • This port is used to display internal TiDB data, including Prometheus metrics and pprof
  • Prometheus metrics can be accessed via http://host:status_port/metrics
  • pprof data can be accessed via http://host:status_port/debug/pprof

How can I capture data for a specific time period for the following?

  • Prometheus metrics can be accessed via http://host:status_port/metrics
  • pprof data can be accessed via http://host:status_port/debug/pprof
Why do you have three cities? The latency between cities is very serious.

This is a manual tool that can only view the current time.
You can refer to the documentation:


This is an automatic tool…


It is two cities, three centers.

For a two-city, three-center architecture, you should prioritize configuring PD and region leaders to be primarily on one side of the two centers to minimize network impact. I’m not sure if your cluster is configured this way. Refer to the documentation:

The PD leader and region leader are both in Beijing, and relevant parameters have been adjusted as follows:

    binlog.enable: true
    binlog.ignore-error: false
    log.query-log-max-len: 12288
    log.slow-threshold: 300
    mem-quota-query: 524288000
    new_collations_enabled_on_first_bootstrap: true
    oom-action: cancel
    performance.server-memory-quota: 17179869184
    readpool.coprocessor.use-unified-pool: true
    readpool.storage.use-unified-pool: true
    readpool.unified.max-thread-count: 10
    server.concurrent-recv-snap-limit: 64
    server.concurrent-send-snap-limit: 64
    server.grpc-compression-type: gzip
    server.grpc-keepalive-time: 120s
    server.grpc-keepalive-timeout: 120s
    server.grpc-raft-conn-num: 16
    storage.block-cache.capacity: 40GB
        - key: dc
          value: shenyang
    replication.enable-placement-rules: true
    replication.isolation-level: zone
      - dc
      - zone
      - rack
      - host
    replication.max-replicas: 5
    schedule.tolerant-size-ratio: 20.0
What is the memory usage like when it hangs? Also, check if overcommit_memory is set to 0 or 1. If it is set to 0, it won’t actively kill the process, so it will hang.

Today, from 04-08, it happened again. From Grafana, the memory usage is not high. As for the overcommit_memory parameter, we have not configured it, as shown below:

[root@tidbolap-tidb06-194-7-64 ~]# grep ^[^#] /etc/sysctl.conf
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
[root@tidbolap-tidb06-194-7-64 ~]#

Attached is the TiDB log for this node.
Attached is the TiDB log for this node.

