TiDB Node Frequently Hangs

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb节点经常性hang住

| username: qhd2004

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.1
【Encountered Issue: Problem Phenomenon and Impact】
Our environment is a two-city, three-center setup, with two data centers in Beijing and one in Shenyang. Currently, when a task is run by a developer and it lands on the Shunyi data center in Beijing, the TiDB node in Shunyi hangs and can only be resolved by a restart.


In this image, 10.122 is Shenyang, and 10.194 is Shunyi. The duration for the Shunyi node is very high.


The Shunyi node is also high here.

image
The duration for accessing a certain TiKV node is extremely high. This TiKV node is in another data center in Beijing.

We have optimized the tasks run by the developers:

  • Split large transactions
  • Optimized SQL
  • Shuffled the keys of the tables

Please help us identify other areas we can work on. If the information is insufficient, what additional information do we need to provide?
Thank you all.

【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

| username: xfworld | Original post link

Collect the logs of TiDB nodes, especially before and after the hang. Pay special attention to slow queries, as well as memory and CPU consumption.

Additionally, you can take a snapshot of the profile to help identify the issue.
92a0c0390a6ee656e8fdb1279ca27b7

If you find it troublesome, you can also use the official diagnostic tool.

| username: qhd2004 | Original post link

This post is valid for a long time. Currently, after optimizing some SQL, the issue has been alleviated.

Thank you all!

| username: qhd2004 | Original post link

In the past two days, a certain TiDB node has experienced high duration again, as shown in the figure below:

I have attached the TiDB log for this node.
[tidb_log.tar.gz|attachment] (20.3 MB)

| username: qhd2004 | Original post link

--status

  • TiDB service status monitoring port
  • Default: “10080”
  • This port is used to display internal TiDB data, including Prometheus metrics and pprof
  • Prometheus metrics can be accessed via http://host:status_port/metrics
  • pprof data can be accessed via http://host:status_port/debug/pprof

How can I capture data for a specific time period for the following?

  • Prometheus metrics can be accessed via http://host:status_port/metrics
  • pprof data can be accessed via http://host:status_port/debug/pprof
| username: tidb狂热爱好者 | Original post link

Why do you have three cities? The latency between cities is very serious.

| username: xfworld | Original post link

This is a manual tool that can only view the current time.
You can refer to the documentation:
pprof

grafana

This is an automatic tool…

Clinic

| username: qhd2004 | Original post link

It is two cities, three centers.

| username: 我是咖啡哥 | Original post link

For a two-city, three-center architecture, you should prioritize configuring PD and region leaders to be primarily on one side of the two centers to minimize network impact. I’m not sure if your cluster is configured this way. Refer to the documentation:

| username: qhd2004 | Original post link

The PD leader and region leader are both in Beijing, and relevant parameters have been adjusted as follows:

server_configs:
  tidb:
    binlog.enable: true
    binlog.ignore-error: false
    log.query-log-max-len: 12288
    log.slow-threshold: 300
    mem-quota-query: 524288000
    new_collations_enabled_on_first_bootstrap: true
    oom-action: cancel
    performance.server-memory-quota: 17179869184
  tikv:
    readpool.coprocessor.use-unified-pool: true
    readpool.storage.use-unified-pool: true
    readpool.unified.max-thread-count: 10
    server.concurrent-recv-snap-limit: 64
    server.concurrent-send-snap-limit: 64
    server.grpc-compression-type: gzip
    server.grpc-keepalive-time: 120s
    server.grpc-keepalive-timeout: 120s
    server.grpc-raft-conn-num: 16
    storage.block-cache.capacity: 40GB
  pd:
    label-property:
      reject-leader:
        - key: dc
          value: shenyang
    replication.enable-placement-rules: true
    replication.isolation-level: zone
    replication.location-labels:
      - dc
      - zone
      - rack
      - host
    replication.max-replicas: 5
    schedule.tolerant-size-ratio: 20.0
| username: Lucien-卢西恩 | Original post link

What is the memory usage like when it hangs? Also, check if overcommit_memory is set to 0 or 1. If it is set to 0, it won’t actively kill the process, so it will hang.

| username: qhd2004 | Original post link

Today, from 04-08, it happened again. From Grafana, the memory usage is not high. As for the overcommit_memory parameter, we have not configured it, as shown below:

[root@tidbolap-tidb06-194-7-64 ~]# grep ^[^#] /etc/sysctl.conf
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
[root@tidbolap-tidb06-194-7-64 ~]#

Attached is the TiDB log for this node.
[tidb_log_20221209.tar.gz|attachment] (7.5 MB)

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.