TiFlash startup consumes a large amount of memory, directly causing OOM

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash启动消耗大量内存,直接导致OOM

| username: wwb519

[TiDB Usage Environment] Production Environment / Testing / POC
Production Environment
[TiDB Version]
v6.1.5
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]


What data does TiFlash need to load when starting, and why does it require such a large amount of memory? Are there any parameters to limit the memory consumption during startup?

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

Are there any others on this machine? Only TiFlash?

| username: wwb519 | Original post link

There is only 1 TiFlash.

| username: 江湖故人 | Original post link

It is possible that a table with particularly frequent writes or changes has added TiFlash.

| username: zhanggame1 | Original post link

Is it deployed in a hybrid manner?
You can configure parameters to limit:
max_memory_usage
max_memory_usage_for_all_queries

Refer to TiFlash Configuration Parameters | PingCAP Documentation Center

| username: tidb菜鸟一只 | Original post link

Have you adjusted the synchronization speed of TiFlash tables? In theory, the default synchronization speed should use very few resources and shouldn’t cause OOM. https://docs.pingcap.com/zh/tidb/v6.1/create-tiflash-replicas#加快-tiflash-副本同步速度

| username: wwb519 | Original post link

Adding these two parameters had no effect.

| username: wwb519 | Original post link

The IO is relatively high during startup, not sure if it’s related.

| username: wwb519 | Original post link

“tidb菜鸟一只, post:6, topic:1020980”:
“Logically speaking, the default synchronization speed should use very few resources and shouldn’t cause an OOM.”

I haven’t made any adjustments; everything is set to the default configuration.

| username: wangccsy | Original post link

Modify the configuration.

| username: wwb519 | Original post link

What configuration needs to be modified?

| username: dba远航 | Original post link

If there is only one TiFlash, then all queries, including lightweight SQL, will go through it. Can it handle the load easily?

| username: wwb519 | Original post link

The server configuration is relatively high, and the query volume itself is not large.

| username: 有猫万事足 | Original post link

What is the operation process like? Without adding any TiFlash replicas, you directly expanded a TiFlash node, and then it kept encountering OOM?

| username: Jellybean | Original post link

Based on the current situation described by the original poster, it is possible that a large amount of data migration or SQL execution occurred when TiFlash started, resulting in high IO and high memory usage.

The information is a bit limited. Could you please upload the entire TiFlash log file after desensitization, and also describe the number of regions in TiKV and the data synchronization status of the tables to TiFlash?
Additionally, please confirm whether there was any data being synchronized to the TiFlash tables at the time of startup.

| username: wwb519 | Original post link

It wasn’t a scaling operation. I manually stopped TiFlash for about 10 minutes, then restarted the operating system, and an OOM (Out of Memory) error occurred.

| username: wwb519 | Original post link

There is a table being synchronized, with a total data volume of around 1TB.


tiflash3.log.tar.gz (55.8 MB)

| username: TiDBer_小阿飞 | Original post link

Is there too much data or too many tables being synchronized to TiFlash?

| username: 有猫万事足 | Original post link

If you have extra resources, you can try scaling up directly. My feeling is that TiFlash doesn’t consume much memory when writing, but it’s hard to say for reading.

It’s best to act as if TiFlash doesn’t exist and see if the execution plan can run without using the TiFlash engine.

However, I checked and found that this parameter for selecting the query engine does not support global modification.
You may need to modify the parameter in the configuration file online through set config.

The parameter name is as follows,

Change the above parameter to [“tikv”, “tidb”], which means that TiFlash does not exist, and queries do not use TiFlash.
Then try restarting TiFlash to see if it can come up. If it does, change the parameter back.

If it doesn’t come up, it might really be that there is too much data to synchronize. You may need to remove some TiFlash replicas of tables to reduce some synchronization write pressure and try again.

That’s the general idea.

| username: TIDB-Learner | Original post link

There is too much synchronized data to write to disk in a timely manner.