TiFlash Frequently Triggers OOM, Restarting Still Cannot Prevent OOM Occurrence

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash经常触发OOM,重启依然无法阻止OOM发生

| username: TiDBer_qqabzOs3

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.2.0
[Encountered Problem]
[Reproduction Path] Currently unable to reproduce
[Problem Phenomenon and Impact]
There is often a sudden increase in memory, with a nearly tenfold difference, triggering OOM events. However, after restarting, it continuously enters a loop of OOM unless the tiflash memory configuration is increased to resolve it.
What are the troubleshooting ideas and methods for this issue?

| username: neolithic | Original post link

Here is a copy of the error message in /var/log/message

| username: flow-PingCAP | Original post link

May I ask if there are any large transactions in the business? How large are the transactions, and how much has been added to TiFlash?

Additionally, please provide:

  1. The TiFlash-summary monitoring, exported using this method: https://metricstool.pingcap.com/
  2. The TiFlash.log and TiFlash-TiKV.log during the restart period.
| username: TiDBer_qqabzOs3 | Original post link

There is no message locally, it is deployed in K8S, and there are tiflash logs.

| username: TiDBer_qqabzOs3 | Original post link

I couldn’t find tiflash.log or tiflash-tikv.log, only found server.log in tiflash. Currently, in the db/page/log/wal directory, I located some large files, but I’m not quite sure how to pinpoint the SQL!

| username: flow-PingCAP | Original post link

To find large transactions, you can check TiDB Dashboard / Slow Queries, select the time range before the OOM occurred, and see if there are any with a relatively large Max Memory.

| username: TiDBer_qqabzOs3 | Original post link

I have been trying to find large transactions, but I have located those with large memory usage. Occasionally, there are queries with large memory usage, but they are all from TiKV, not TiFlash. Are there any fields in the logs that can help explain this? We can analyze it from the logs.

| username: TiDBer_jYQINSnf | Original post link

Check the dashboard for SQL statements that consume a lot of memory, then look at the execution plan to see if it involves cop[tiflash]. Focus on optimizing these types of SQL statements.

| username: flow-PingCAP | Original post link

“Large memory queries” involve data update operations? If so, approximately how much data is being updated?

| username: TiDBer_qqabzOs3 | Original post link

We have been tracking this issue, but the large memory usage is mostly due to query SQLs, and some of the timings don’t match. The rest are basically in the KB range, with occasional MBs. After TiFlash OOMs, it keeps restarting. Normally, the query SQL should not execute again after the restart, but it seems like TiFlash has some tasks that it can’t bypass. The OOM issue can only be resolved by upgrading the configuration.

| username: flow-PingCAP | Original post link

The only known issue that can trigger similar TiFlash repeated OOM, and can be alleviated by increasing memory configuration, is TiDB running large update transactions. So I want to confirm this.
Can you check if there are any other special operations in the business before the OOM?
Also, please send the logs that include the OOM restart period, including server.log and proxy.log.

| username: TiDBer_qqabzOs3 | Original post link

The attachment contains three files. The file from 9:30 AM is related to OOM, and the issue was resolved by upgrading the configuration around 10:30 AM. The server.log is the log from the afternoon, where there was a sudden increase in memory usage, but it did not trigger an OOM.

| username: flow-PingCAP | Original post link

Got it. Also, please export the Grafana monitoring for tiflash-summary using this method: https://metricstool.pingcap.com/

| username: TiDBer_qqabzOs3 | Original post link

There were some issues with exporting, so I took screenshots instead. These are from the 15th to the 16th.

| username: flow-PingCAP | Original post link

Okay, let’s analyze it first.

| username: TiDBer_qqabzOs3 | Original post link

Any progress?

| username: flow-PingCAP | Original post link

Still under investigation. Currently, a problem with too many data fragments not being organized has been found in the logs. Whether this will cause OOM and the reasons for its occurrence are still being investigated. However, this phenomenon will definitely cause the system to be suboptimal, so you can try to manually organize the data to see if there is any improvement: ALTER TABLE xxx COMPACT TIFLASH REPLICA;

| username: TiDBer_qqabzOs3 | Original post link

Okay, I’ll give it a try. I’ve also been keeping an eye on the wal directory. There is basically always a log_xxxx_1 file, and its size continues to grow.

| username: TiDBer_qqabzOs3 | Original post link

Does this cleanup require executing the cleanup command individually for each TiFlash table? Or can it analyze which table it is?

| username: flow-PingCAP | Original post link

I suggest running through all the tables, one by one, in series.