Encountering Performance Bottlenecks When Migrating Incremental Data with DM

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dm迁移增量数据时遇到性能瓶颈

| username: TiDBer_oFH5DGTt

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.3
[Reproduction Path]
[Encountered Problem: Phenomenon and Impact]
The source database is Alibaba Cloud’s PolarDB, and the binlog generation speed is approximately 1.5G/min.
When using DM to import incremental data, the binlog parsing speed is approximately 1G/min,
resulting in the inability to synchronize the latest data.
PS:

  1. Alibaba Cloud and DM are in two different intranet environments, and binlog is transmitted over the public network. However, the current public network bandwidth usage is only about 20%, not reaching the bottleneck.
  2. The current DM task configuration has syncers.batch set to 100 and syncers.worker-count set to 128.

[Resource Configuration]
TiDB and DM are deployed on three machines with 128 cores and 256GB of RAM, using standard SATA SSDs for TiDB.
[Attachments: Screenshots/Logs/Monitoring]

| username: guominghao | Original post link

You can upgrade DM to v6.5, which significantly improves the relay synchronization speed. Additionally, you can try enabling the syncer.multiple-rows configuration: DM Advanced Task Configuration File | PingCAP Docs

| username: dba-kit | Original post link

You can confirm whether the bottleneck is in DM, TiDB, or the machine’s CPU/IO. You can optimize according to different approaches:

  • If the bottleneck is in DM parsing BINLOG, you can split several tables with large update volumes into independent jobs to solve it.
  • If the bottleneck is in DM syncing to TiDB, it indicates that TiDB has reached its limit, and you might consider expanding TiDB (though, given your machine configuration, this is unlikely).
  • If the bottleneck is the machine’s hardware, you might consider upgrading to a higher configuration. Given your machine configuration, it seems that the SATA SSD might be the bottleneck. The binlog generation speed of 1G/min is already very high, and the DM parsing speed might indeed be the bottleneck.

In summary: I suggest first checking the machine’s IO usage rate, then analyzing which tables have large update volumes and whether they can be split into independent DM tasks so that the delay of a single table does not affect the overall performance. Additionally, you can refer to the reply above, as the three parameters mentioned are also effective in version 6.1.