How to Identify Which SQL Query Crashed TiDB Memory

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb内存被打崩,如何找到是哪条SQL

| username: huaj1101

[TiDB Usage Environment] Production Environment

[TiDB Version] 6.1.1

[Encountered Problem: Phenomenon and Impact]

A certain TiDB node in the production environment suddenly crashed and restarted (within 30 seconds). The memory surge was not recorded in time, but the OOM record can be seen in the operating system logs. At the same time, there were no anomalies in the TiKV and TiFlash nodes.

The maximum memory for a single SQL execution was limited to 20G, but it seems to have had no effect.

Since the SQL did not complete execution before crashing and restarting, there are no slow SQL records in the Dashboard, and the logs of that node did not capture it either. I would like to ask, in such a situation, how can we identify which SQL caused the crash? It is a production environment with high concurrency, and many SQLs are running at any given moment.

[Resource Configuration]
16c 64g SSD

[Attachments: Screenshots/Logs/Monitoring]



image

| username: DBRE | Original post link

You can refer to https://mp.weixin.qq.com/s/KK7GffPmk_rMCRpYz_ZZyg for OOM troubleshooting ideas.

| username: TiDBer_jYQINSnf | Original post link

If the concurrency is high, it might not even exceed the 20G limit. You should set a smaller limit, like recording when it exceeds 1G. This way, you can record more instances. Your machine only has a total of just over 20G of memory, so how could it possibly trigger with a 20G limit? With a 1G limit, 20 concurrent instances would already max it out. You can set the limit even smaller.

| username: 有猫万事足 | Original post link

Under normal circumstances, this table should have records.
If you have 64GB of memory and set a 20GB memory limit, but it crashes instantly before it can record, I think it might be due to multiple SQL queries running concurrently. Each one may not reach the 20GB limit individually, but together they crash the 64GB memory.
Therefore, you should reduce concurrency or lower the limit, making it easier to observe which SQL query is causing the problem.

| username: huaj1101 | Original post link

This table was not available in version 6.1.1; it was added in a later version, right?

| username: huaj1101 | Original post link

The memory is 64GB, and the average usage is quite low.

| username: huaj1101 | Original post link

Okay, let’s study it first.

| username: dockerfile | Original post link

I am using version V5 and have the same confusion.

I don’t know which SQL is slowly consuming 100GB of memory, causing TiDB to OOM and then repeating this cycle.

The various SQL logging methods mentioned above seem to require the SQL to complete execution first. The problem with this OOM issue is that the SQL doesn’t finish running before TiDB restarts.

| username: Soysauce520 | Original post link

Check the value of the parameter tidb_memory_usage_alarm_ratio, and look for records in the directory specified by the tmp-storage-path parameter; there should be information inside.

| username: zhanggame1 | Original post link

We can only consider adjusting the parameters to capture all the SQL statements that occupy a large amount of memory.

| username: songxuecheng | Original post link

Check the heatmap to see if the traffic was abnormal at that time. Look for expensive SQL in the logs.

| username: huaj1101 | Original post link

Thanks, currently analyzing. The method mentioned in the video provided by DBRE also covers this approach.

| username: TiDBer_jYQINSnf | Original post link

Can’t you just limit it a bit and then restart? If you’re worried about affecting business, set the OOM action to log. Is the occurrence frequency very low? Are you planning to analyze it based on past clues?

| username: huaj1101 | Original post link

Well, the frequency is very low. I suspect it’s caused by some rarely executed SQL. I plan to identify which one and modify it.

| username: Soysauce520 | Original post link

Search for the following TiDB logs in the dashboard with the keyword “expensive”: 系统变量 | PingCAP 文档中心

| username: huaj1101 | Original post link

Didn’t find it. By default, 60 seconds is considered expensive. Everything was normal at 9:33:30, and OOM occurred at 9:33:51. It crashed before 60 seconds, so I couldn’t record it.

| username: huaj1101 | Original post link

Strangely, all the SQL recorded in this file were analyzed and rerun one by one, and they were all very fast with very little memory usage.

| username: Soysauce520 | Original post link

That means it wasn’t recorded.

| username: huaj1101 | Original post link

No more clues, giving up. It hasn’t appeared for days, I’ll chase it next time it shows up.

Thanks to all the experts for your help!

| username: 有猫万事足 | Original post link

  • Heap: Memory usage overhead of various internal functions on TiDB and PD instances

I think you can enable continuous profiling. If there is an OOM event during the enabled period, there will be a corresponding heap dump file. Troubleshooting will be easier then.