Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TiFlash经常触发OOM,重启依然无法阻止OOM发生
[TiDB Usage Environment] Production Environment
[TiDB Version] v6.2.0
[Encountered Problem]
[Reproduction Path] Currently unable to reproduce
[Problem Phenomenon and Impact]
There is often a sudden increase in memory, with a nearly tenfold difference, triggering OOM events. However, after restarting, it continuously enters a loop of OOM unless the tiflash memory configuration is increased to resolve it.
What are the troubleshooting ideas and methods for this issue?
Here is a copy of the error message in /var/log/message
May I ask if there are any large transactions in the business? How large are the transactions, and how much has been added to TiFlash?
Additionally, please provide:
- The TiFlash-summary monitoring, exported using this method: https://metricstool.pingcap.com/
- The TiFlash.log and TiFlash-TiKV.log during the restart period.
There is no message locally, it is deployed in K8S, and there are tiflash logs.
I couldn’t find tiflash.log or tiflash-tikv.log, only found server.log in tiflash. Currently, in the db/page/log/wal directory, I located some large files, but I’m not quite sure how to pinpoint the SQL!
To find large transactions, you can check TiDB Dashboard / Slow Queries, select the time range before the OOM occurred, and see if there are any with a relatively large Max Memory.
I have been trying to find large transactions, but I have located those with large memory usage. Occasionally, there are queries with large memory usage, but they are all from TiKV, not TiFlash. Are there any fields in the logs that can help explain this? We can analyze it from the logs.
Check the dashboard for SQL statements that consume a lot of memory, then look at the execution plan to see if it involves cop[tiflash]. Focus on optimizing these types of SQL statements.
“Large memory queries” involve data update operations? If so, approximately how much data is being updated?
We have been tracking this issue, but the large memory usage is mostly due to query SQLs, and some of the timings don’t match. The rest are basically in the KB range, with occasional MBs. After TiFlash OOMs, it keeps restarting. Normally, the query SQL should not execute again after the restart, but it seems like TiFlash has some tasks that it can’t bypass. The OOM issue can only be resolved by upgrading the configuration.
The only known issue that can trigger similar TiFlash repeated OOM, and can be alleviated by increasing memory configuration, is TiDB running large update transactions. So I want to confirm this.
Can you check if there are any other special operations in the business before the OOM?
Also, please send the logs that include the OOM restart period, including server.log and proxy.log.
The attachment contains three files. The file from 9:30 AM is related to OOM, and the issue was resolved by upgrading the configuration around 10:30 AM. The server.log is the log from the afternoon, where there was a sudden increase in memory usage, but it did not trigger an OOM.
Got it. Also, please export the Grafana monitoring for tiflash-summary using this method: https://metricstool.pingcap.com/
There were some issues with exporting, so I took screenshots instead. These are from the 15th to the 16th.
Okay, let’s analyze it first.
Still under investigation. Currently, a problem with too many data fragments not being organized has been found in the logs. Whether this will cause OOM and the reasons for its occurrence are still being investigated. However, this phenomenon will definitely cause the system to be suboptimal, so you can try to manually organize the data to see if there is any improvement: ALTER TABLE xxx COMPACT TIFLASH REPLICA;
Okay, I’ll give it a try. I’ve also been keeping an eye on the wal directory. There is basically always a log_xxxx_1 file, and its size continues to grow.
Does this cleanup require executing the cleanup command individually for each TiFlash table? Or can it analyze which table it is?
I suggest running through all the tables, one by one, in series.