TiFlash Frequently OOM and Restarts During Data Synchronization

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 同步数据时,tiflash频繁OOM重启

| username: gejibin

[TiDB Version] v5.4.0
When restoring from a TiDB backup to a newly created TiDB instance, after TiKV has been restored (TiFlash has not yet fully synchronized and is still syncing data), the business starts to perform write operations (by setting the engine to disable TiFlash query operations), and TiFlash continues to sync. When TiFlash has synced about 1.4T (data not fully synced), two TiFlash nodes frequently experience OOM restarts, occurring approximately every 4 minutes. Setting the following parameters does not work and OOM continues:

cop_pool_size
batch_cop_pool_size
max_memory_usage
max_memory_usage_for_all_queries

tiflash-log.zip (4.0 MB)

| username: jansu-dev | Original post link

Currently, the logs show that tidb-gemkafk19g-tiflash-0.tidb-gemkafk19g-tiflash-peer.tidb-gemkafk19g.svc is always unable to connect, causing raft information to backlog in the channel and continuously logging. We need to investigate why it can’t connect.

flow-PingCAP could you please take a look?
@flow-PingCAP

| username: flow-PingCAP | Original post link

Check the Grafana monitoring and post the following two metrics. Alternatively, you can upload a clinic report: PingCAP Clinic 快速上手指南 | PingCAP 文档中心

  1. TiFlash-Proxy-Details / Server / Store size increase

  2. raftstore-entry-cache memory usage increase. You need to manually add a monitoring item in Grafana’s TiFlash-Proxy-Details monitoring:

Metrics:

tiflash_proxy_tikv_server_mem_trace_sum{k8s_cluster=“$k8s_cluster”, tidb_cluster=“$tidb_cluster”, instance=~“$instance”, name=~“raftstore-.*”}

Legend:

{{instance}}-{{name}}

| username: flow-PingCAP | Original post link

Additionally, can you confirm if TiDB is running relatively large transactions?

| username: flow-PingCAP | Original post link

I checked the relevant records, and the common issue causing OOM restarts was fixed after version 5.4.3. You can try upgrading the TiFlash version separately.

Generally, minor versions are bugfix versions, so upgrading TiFlash alone shouldn’t be a big problem. However, to be safe, after confirming that the issue is resolved, it is recommended to upgrade the entire TiDB to version 5.4.3.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.