Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb内存被打崩,如何找到是哪条SQL
[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.1
[Encountered Problem: Phenomenon and Impact]
A certain TiDB node in the production environment suddenly crashed and restarted (within 30 seconds). The memory surge was not recorded in time, but the OOM record can be seen in the operating system logs. At the same time, there were no anomalies in the TiKV and TiFlash nodes.
The maximum memory for a single SQL execution was limited to 20G, but it seems to have had no effect.
Since the SQL did not complete execution before crashing and restarting, there are no slow SQL records in the Dashboard, and the logs of that node did not capture it either. I would like to ask, in such a situation, how can we identify which SQL caused the crash? It is a production environment with high concurrency, and many SQLs are running at any given moment.
[Resource Configuration]
16c 64g SSD
[Attachments: Screenshots/Logs/Monitoring]
If the concurrency is high, it might not even exceed the 20G limit. You should set a smaller limit, like recording when it exceeds 1G. This way, you can record more instances. Your machine only has a total of just over 20G of memory, so how could it possibly trigger with a 20G limit? With a 1G limit, 20 concurrent instances would already max it out. You can set the limit even smaller.
Under normal circumstances, this table should have records.
If you have 64GB of memory and set a 20GB memory limit, but it crashes instantly before it can record, I think it might be due to multiple SQL queries running concurrently. Each one may not reach the 20GB limit individually, but together they crash the 64GB memory.
Therefore, you should reduce concurrency or lower the limit, making it easier to observe which SQL query is causing the problem.
This table was not available in version 6.1.1; it was added in a later version, right?
The memory is 64GB, and the average usage is quite low.
Okay, let’s study it first.
I am using version V5 and have the same confusion.
I don’t know which SQL is slowly consuming 100GB of memory, causing TiDB to OOM and then repeating this cycle.
The various SQL logging methods mentioned above seem to require the SQL to complete execution first. The problem with this OOM issue is that the SQL doesn’t finish running before TiDB restarts.
Check the value of the parameter tidb_memory_usage_alarm_ratio
, and look for records in the directory specified by the tmp-storage-path
parameter; there should be information inside.
We can only consider adjusting the parameters to capture all the SQL statements that occupy a large amount of memory.
Check the heatmap to see if the traffic was abnormal at that time. Look for expensive SQL in the logs.
Thanks, currently analyzing. The method mentioned in the video provided by DBRE also covers this approach.
Can’t you just limit it a bit and then restart? If you’re worried about affecting business, set the OOM action to log. Is the occurrence frequency very low? Are you planning to analyze it based on past clues?
Well, the frequency is very low. I suspect it’s caused by some rarely executed SQL. I plan to identify which one and modify it.
Search for the following TiDB logs in the dashboard with the keyword “expensive”: 系统变量 | PingCAP 文档中心
Didn’t find it. By default, 60 seconds is considered expensive. Everything was normal at 9:33:30, and OOM occurred at 9:33:51. It crashed before 60 seconds, so I couldn’t record it.
Strangely, all the SQL recorded in this file were analyzed and rerun one by one, and they were all very fast with very little memory usage.
That means it wasn’t recorded.
No more clues, giving up. It hasn’t appeared for days, I’ll chase it next time it shows up.
Thanks to all the experts for your help!
- Heap: Memory usage overhead of various internal functions on TiDB and PD instances
I think you can enable continuous profiling. If there is an OOM event during the enabled period, there will be a corresponding heap dump file. Troubleshooting will be easier then.