TiDB Primary Cluster OOM Memory Overflow in Production Environment

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 生产环境tidb主库集群OOM内存溢出

| username: vcdog

【TiDB Usage Environment】Production Environment
【TiDB Version】v6.5.0
【Reproduction Path】Around 10:00 today, the memory of one tidb-server in the cluster suddenly filled up, and after 3 minutes, an OOM occurred, and the instance automatically restarted.
【Encountered Problem: Problem Phenomenon and Impact】
The configuration of tidb and tikv is as follows:

    log.level: error
    new_collations_enabled_on_first_bootstrap: true
    performance.max-procs: 30
    performance.txn-total-size-limit: 4221225472
    prepared-plan-cache.enabled: true
    tidb_mem_oom_action: CANCEL
    tidb_mem_quota_query: 1073741824
    tmp-storage-path: /acdata/tidb-memory-cache
    tmp-storage-quota: -1
    coprocessor.region-max-size: 384MB
    coprocessor.region-split-size: 256MB
    raftstore.region-split-check-diff: 196MB
    raftstore.store-pool-size: 2
    readpool.coprocessor.use-unified-pool: true
    readpool.storage.use-unified-pool: false
    readpool.unified.max-thread-count: 24
    server.end-point-concurrency: 16
    server.grpc-concurrency: 28
    storage.reserve-space: 0MB
    storage.scheduler-worker-pool-size: 16

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: vcdog | Original post link

Error log attachment:
tidb_error_20240325.log (64.3 KB)

| username: vcdog | Original post link

I checked the dashboard and oom folders, but couldn’t find any slow SQL corresponding to the time when the memory was full. Do you have any good ideas on how to track this?

| username: 小龙虾爱大龙虾 | Original post link

In version 6.5, you can use the Top SQL feature to compare the time points of memory increase in the monitoring, locate the SQL, and then optimize the SQL.

| username: 像风一样的男子 | Original post link

In the dashboard-topsql, select the TiDB node where you encountered OOM and choose the time period. You will be able to see the SQL that consumed the most resources during that time period.

| username: vcdog | Original post link

The SQL queries during this time period were indeed very slow.

| username: vcdog | Original post link

The dashboard feature of TiDB is very powerful and also supports log search.

| username: Jolyne | Original post link

You can search for the keyword “expensive”.

| username: vcdog | Original post link


| username: vcdog | Original post link

Sorry, I can’t assist with that.

| username: 像风一样的男子 | Original post link

Optimize the SQL. Database resource monitoring, TopSQL, and slow queries need to be continuously monitored and addressed daily. You can’t wait until an OOM (Out of Memory) occurs to troubleshoot.

| username: 有猫万事足 | Original post link

Let’s take a look at the overall monitoring. Was the PD leader also on this machine at that time? Before reporting OOM, it was already unable to get TSO.

Please provide the deployment topology and overall memory monitoring.
If it is a mixed deployment, the OOM killer selecting TiDB does not necessarily mean that TiDB’s memory is out of control.

| username: tidb狂热爱好者 | Original post link

OOM is not caused by that specific SQL but by overall high memory usage. You need to address the problem holistically by first resolving the issues with the resources that are consuming the most.

| username: TiDBer_CkS2lbTx | Original post link

I directly set a parameter to kill any SQL query that exceeds 60 seconds, perfectly solved.

| username: zhaokede | Original post link

Does this affect the business?

| username: TIDB-Learner | Original post link

:100: Learned

| username: vcdog | Original post link

The PD, TiKV, and TiDB-server components are all deployed independently, with no shared servers.

| username: vcdog | Original post link

However, I don’t know why the tidb-server logs print PD information.

| username: 有猫万事足 | Original post link

The reason you can’t find “expensive” is because your TiDB log level is set too high.
“expensive” is a warning-level alert.
You can see in your configuration:

log.level: error

| username: DBAER | Original post link

You can check the top SQL and traffic analysis charts in the dashboard, as well as the logs for expensive SQL.