Are there any parameters in TiFlash to limit maximum memory usage to prevent OOM and to control synchronization speed to prevent excessive resource usage from making queries completely unresponsive?

translator_bot · June 22, 2024, 4:29pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 有没有参数可以限制内存最大使用防止oom 还有同步的速度，防止同步占用资源过多导致查询完全无法查询

| username: TiDBer_ZsnVPQB4

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.1
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
Question:

Are there any parameters in TiFlash to limit the maximum memory usage to prevent OOM? TiFlash keeps restarting due to OOM. It quickly consumes memory upon startup, not due to queries, but likely due to loading a large amount of data into memory, causing OOM. The following parameters have been adjusted but the memory usage still keeps increasing until OOM.

Memory cache size limit for data block metadata, usually does not need modification

mark_cache_size = 5368709120

Memory cache size limit for data block min-max index, usually does not need modification

minmax_index_cache_size = 5368709120

Memory cache size limit for DeltaIndex, default is 0, meaning no limit

delta_index_cache_size = 0

# Maximum memory usage for processing a single query. Zero means unlimited.
profiles.default.max_memory_usage: 0
# Maximum memory usage for processing all concurrently running queries on the server. Zero means unlimited.
profiles.default.max_memory_usage_for_all_queries: 0

Also, is there a way to limit the speed of adding new tables to TiFlash to prevent excessive resource usage, which makes business queries completely unresponsive?

[Attachment: Screenshot/Logs/Monitoring]

translator_bot · June 22, 2024, 4:29pm

| username: tidb菜鸟一只 | Original post link

When adding a TiFlash replica, each TiKV instance will perform a full table data scan and send the scanned data snapshot to TiFlash to form the replica. By default, to reduce the impact on TiKV and TiFlash online services, the speed of adding TiFlash replicas is slow and consumes fewer resources. Generally, parameters are adjusted to speed up the process. If you want to slow it down, you can refer to this and reduce the parameters accordingly:

translator_bot · June 22, 2024, 4:29pm

| username: WalterWj | Original post link

If this adjustment still results in OOM, then upgrade

translator_bot · June 22, 2024, 4:29pm

| username: ljluestc | Original post link

In TiFlash, there are parameters that can limit the maximum memory usage to prevent OOM errors caused by repeated restarts. Here are some related parameters:

profiles.default.max_memory_usage:
profiles.default.max_memory_usage_for_all_queries:
delta_index_cache_size:
mark_cache_size:
minmax_index_cache_size:

In addition to these parameters, TiFlash also limits the speed of adding new table synchronization to prevent synchronization from occupying too many resources and affecting query performance. You can use the max_delta_schema_sync_threads parameter to adjust this limit. By default, this parameter is set to 10, which should be sufficient for most systems. If you need to increase the table synchronization speed, you can increase this value, but be aware that doing so may increase resource usage and affect query performance.

translator_bot · June 22, 2024, 4:29pm

| username: TiDBer_ZsnVPQB4 | Original post link

I checked this parameter and it seems that my version does not have these adjustment parameters. Now it feels like TiFlash needs to read a large amount of data into memory for preheating, resulting in 16G not being enough and causing an OOM (Out of Memory) error. Because there is a process, but the port is not listening, it caused an OOM error.

translator_bot · June 22, 2024, 4:29pm

| username: TiDBer_ZsnVPQB4 | Original post link

I checked, and it seems that my version does not have this parameter yet. Basically, as soon as I add a large table to TiFlash for synchronization, the CPU (4 cores) gets fully utilized, and the SSD I/O also reaches 100%, with 100MB per second. This affects the query performance, as synchronization takes up half of the resources. It would be better if the synchronization could be done slowly in the background.

translator_bot · June 22, 2024, 4:29pm

| username: tidb菜鸟一只 | Original post link

In version 5.2.1, only TiKV parameters can be modified online, others need to be modified using configuration files.

TiKV:
server.snap-max-write-bytes-per-sec: 300MiB # Default 100MiB

TiFlash-learner:
raftstore.snap-handle-pool-size: 10 # Default 2, can be adjusted to machine’s total CPU count × 0.6 or higher
raftstore.apply-low-priority-pool-size: 10 # Default 1, can be adjusted to machine’s total CPU count × 0.6 or higher
server.snap-max-write-bytes-per-sec: 300MiB # Default 100MiB

translator_bot · June 22, 2024, 4:29pm

| username: TiDBer_ZsnVPQB4 | Original post link

Understood, so is there a place or method to check this value before modifying the configuration file, so that the previous value can be retained before modification? This way, if there is an issue, it can be reverted, or you can check if the current value is reasonable before modifying, and see if the modification takes effect after changing it.

translator_bot · June 22, 2024, 4:29pm

| username: tidb菜鸟一只 | Original post link

Use tiup cluster edit-config <cluster-name> to check if the original configuration contains these parameters. If not, they are default values. If they are present, output them to a document for backup.

translator_bot · June 22, 2024, 4:29pm

| username: TiDBer_ZsnVPQB4 | Original post link

Got it, thanks.