[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.1
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
Question:
Are there any parameters in TiFlash to limit the maximum memory usage to prevent OOM? TiFlash keeps restarting due to OOM. It quickly consumes memory upon startup, not due to queries, but likely due to loading a large amount of data into memory, causing OOM. The following parameters have been adjusted but the memory usage still keeps increasing until OOM.
Memory cache size limit for data block metadata, usually does not need modification
mark_cache_size = 5368709120
Memory cache size limit for data block min-max index, usually does not need modification
minmax_index_cache_size = 5368709120
Memory cache size limit for DeltaIndex, default is 0, meaning no limit
delta_index_cache_size = 0
# Maximum memory usage for processing a single query. Zero means unlimited.
profiles.default.max_memory_usage: 0
# Maximum memory usage for processing all concurrently running queries on the server. Zero means unlimited.
profiles.default.max_memory_usage_for_all_queries: 0
Also, is there a way to limit the speed of adding new tables to TiFlash to prevent excessive resource usage, which makes business queries completely unresponsive?
When adding a TiFlash replica, each TiKV instance will perform a full table data scan and send the scanned data snapshot to TiFlash to form the replica. By default, to reduce the impact on TiKV and TiFlash online services, the speed of adding TiFlash replicas is slow and consumes fewer resources. Generally, parameters are adjusted to speed up the process. If you want to slow it down, you can refer to this and reduce the parameters accordingly:
In TiFlash, there are parameters that can limit the maximum memory usage to prevent OOM errors caused by repeated restarts. Here are some related parameters:
In addition to these parameters, TiFlash also limits the speed of adding new table synchronization to prevent synchronization from occupying too many resources and affecting query performance. You can use the max_delta_schema_sync_threads parameter to adjust this limit. By default, this parameter is set to 10, which should be sufficient for most systems. If you need to increase the table synchronization speed, you can increase this value, but be aware that doing so may increase resource usage and affect query performance.
I checked this parameter and it seems that my version does not have these adjustment parameters. Now it feels like TiFlash needs to read a large amount of data into memory for preheating, resulting in 16G not being enough and causing an OOM (Out of Memory) error. Because there is a process, but the port is not listening, it caused an OOM error.
I checked, and it seems that my version does not have this parameter yet. Basically, as soon as I add a large table to TiFlash for synchronization, the CPU (4 cores) gets fully utilized, and the SSD I/O also reaches 100%, with 100MB per second. This affects the query performance, as synchronization takes up half of the resources. It would be better if the synchronization could be done slowly in the background.
TiFlash-learner:
raftstore.snap-handle-pool-size: 10 # Default 2, can be adjusted to machine’s total CPU count × 0.6 or higher
raftstore.apply-low-priority-pool-size: 10 # Default 1, can be adjusted to machine’s total CPU count × 0.6 or higher
server.snap-max-write-bytes-per-sec: 300MiB # Default 100MiB
Understood, so is there a place or method to check this value before modifying the configuration file, so that the previous value can be retained before modification? This way, if there is an issue, it can be reverted, or you can check if the current value is reasonable before modifying, and see if the modification takes effect after changing it.
Use tiup cluster edit-config <cluster-name> to check if the original configuration contains these parameters. If not, they are default values. If they are present, output them to a document for backup.