How to Locate Which Statement Caused TiKV OOM

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv oom如何定位是哪条语句

| username: Jolyne

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed that caused the issue
The memory and CPU of tikv suddenly spiked at midnight, causing one tikv to restart due to OOM. By checking the dashboard panel, there were no significant SQL executions during that time period, and other monitoring indicators were also normal. I would like to ask everyone how to locate the issue in such a situation (blockcache is set to 48G, and I saw on the official website that this was deprecated after version 6.6).
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]






| username: Kongdom | Original post link

Is there an EXPENSIVE_QUERY in the TiDB logs?

| username: caiyfc | Original post link

If you want to check slow SQL, or the EXPENSIVE keyword mentioned above. You can look at the statement analysis related instructions yourself, it is possible that you might not find the SQL.

| username: Jolyne | Original post link

There is no such thing.

| username: Kongdom | Original post link

:thinking: There is another situation where this statement is in INFORMATION_SCHEMA.CLUSTER_PROCESSLIST.
Slow queries only record completed SQL.

| username: Kongdom | Original post link

Looking at this graph, it doesn’t seem like an OOM (Out of Memory) issue. It feels like a restart. An OOM should show a diagonal line, but the monitoring shows a straight up and down line.

| username: caiyfc | Original post link

Slow queries will record unfinished SQL. Previously, when there was an OOM, I would check the execution plan in the slow query log. The SQL page also has a “whether execution was successful” display, which shows 0.

| username: Jolyne | Original post link

It was OOM and got killed.

| username: Jolyne | Original post link

I see that it’s all internal SQL analysis. Is this also a contributing factor?

| username: caiyfc | Original post link

It doesn’t seem like it. You can first check according to the official website’s solution:

You can also refer to this article:
Column - Summary of TiKV’s Main Memory Structure and OOM Troubleshooting | TiDB Community

| username: Jolyne | Original post link

I’ve checked the monitoring, and everything seems fine, but I noticed these few metrics are abnormally high.



Another concern I have is whether my block-cache is set too high. I have 64GB of memory and have set 48GB for the cache.

| username: caiyfc | Original post link

With 64GB of memory, setting it to 48GB is a bit high. I usually set it to 60%. According to the monitoring, the TiKV memory suddenly increased. It would be best to find out the reason, but based on other monitoring data, there doesn’t seem to be any anomalies. The monitoring data you posted around 5:30 doesn’t show high usage.

| username: 裤衩儿飞上天 | Original post link

Investigate whether there are any scheduled tasks running during that time period, such as backups or batch processing;
Check if there were any upgrades or changes to the business the day before.

| username: Jolyne | Original post link

Previously, I set it to 40G, but after a day, I saw that the block-cache-size in the monitoring exceeded 40G and reached around 48G.

| username: tidb菜鸟一只 | Original post link

How much is this parameter set to?

| username: tidb狂热爱好者 | Original post link

The SQL at the top of your chart is 500MB each, and executing a dozen of them is the reason, isn’t it? OOM happens because everyone uses the cache. Optimize the largest SQL first.

| username: Jolyne | Original post link

Originally, it was set to 48G, but I just changed it to 38G. However, it doesn’t seem to be related to this. I just saw a post somewhere that suggested setting it to around 75% of the memory.

| username: tidb菜鸟一只 | Original post link

If TiKV is not mixed, setting this value to 45% of the machine’s memory is sufficient. This is used for caching data, and if it alone takes up 75%, OOM (Out of Memory) is quite normal…

| username: Jolyne | Original post link

Okay, thank you. I might have misread it earlier.

| username: tidb狂热爱好者 | Original post link

The expert is really amazing.
It’s best not to adjust the parameters of TiDB; the default settings are already very good.