Issue of TiDB Server OOM Restart in Version 6.5

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 6.5 tidb server oom 重启的问题

| username: Qiuchi

[Test Environment for TiDB] Testing
[TiDB Version] v6.5.0
[Reproduction Path] Execute a query with high memory usage
[Encountered Issue: Phenomenon and Impact]
TiDB server with 8 cores and 32GB RAM, cluster with three TiDB|PD nodes, and six TiKV nodes
tidb_server_memory_limit = 80%
tidb_mem_quota_query = 21474836480 (20GB)

During the query, the TiDB server log shows “global memory controller failed to kill the top-consumer in 10s”

In running_sql, there are two identical SQL entries, with the same SQL text, statistics, etc., both having mem_max of 15 GB, which exceeds the server’s physical memory. However, one is sql 0 and the other is sql 4. The log was overwritten and cannot be transmitted, but you can refer to sql 0 and sql 1 in this log.
running_sql (16.2 KB)

Can the global memory limit in version 6.5 completely prevent TiDB server restarts due to OOM issues? How can the “failed to kill” issue be resolved?

| username: 人如其名 | Original post link

Version 6.5 cannot completely prevent TiDB-server from restarting due to OOM, but it significantly reduces memory usage (paging capability) and has the ability to globally kill SQLs that consume a lot of memory. However, please note:

  1. When memory usage is high, only the SQL that consumes the most memory will be killed each time, once per minute. If memory usage increases rapidly and multiple high-memory-consuming SQLs run simultaneously, it may still lead to OOM.
  2. If the system issues a kill command but the response is not timely, it will continue to occupy memory, leading to instance OOM. In your scenario, it seems that the kill command was not timely, and in some cases, when entering the hashagg operator, the kill command is not as timely.

The high memory usage here is likely caused by the hashagg operator. The group by: dcdb.cr_trans_remain_mapping.ta_cfm_serial_no seems to have low repetition, leading to a large amount of in-memory grouping and aggregation. To avoid this:

  1. Try not to use hashagg, for example, see if streamagg can be used instead.
  2. Do not use the default concurrency method, set the hashagg aggregation operator to single-threaded with set global tidb_hashagg_final_concurrency=1; set global tidb_hashagg_partial_concurrency=1, and enable disk spilling. However, note that TiDB’s current disk spilling performance is relatively poor. Refer to this: Sort算子落盘性能太慢,希望产品层面进行优化 - TiDB 的问答社区. The official team should optimize disk spilling in the future.

Check if the cr_trans_remain_mapping.ta_cfm_serial_no field has an index. If not, try adding one to make it use the streamagg operator as much as possible.

| username: Qiuchi | Original post link

Got it, thanks.

| username: 人如其名 | Original post link

What was mentioned here is incorrect. The tidb_server_memory_limit_gc_trigger executes once per minute. The kill operation happens each time after determining that there is insufficient memory.

| username: Qiuchi | Original post link

If I have any specific questions, I would like to ask: when setting tidb_executor_concurrency to limit the concurrency of hashagg, if we do not consider the impact of the spill-to-disk feature, will the overall memory required to run the same SQL be less than before? Or does it just mean that the memory growth rate has slowed down compared to before, the execution speed has also slowed down, but the final memory usage is the same as before?

| username: 人如其名 | Original post link

This parameter controls the concurrency of almost all operators. If you only want to limit hashagg, set it according to the two parameters I mentioned above.

It’s hard to say whether more concurrency means more memory usage. For example, if your aggregation field is gender, then more concurrency means more memory, but the overall memory usage is very low. If the aggregation field has low data duplication, then a single hashtable structure in a non-concurrent situation might require more data than multiple hashtable structures in a concurrent situation. I guess generally speaking, more concurrency might use a bit more memory, but you can test it yourself to see.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.