After TiDB Cluster Scaling, High Disk Latency Causes Significant MQ Congestion

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB集群扩容缩容后磁盘延迟大导致MQ拥堵很大

| username: jingyesi3401

[TiDB Usage Environment] Production environment, testing

[TiDB Version] v5.1.0

[Reproduction Path] Operations performed that led to the issue
Scaling out and scaling in the TiDB cluster.

[Encountered Issue: Problem Phenomenon and Impact]
Our company’s TiDB has three TiKV nodes with relatively slow IOPS. We plan to add three new TiKV nodes and replace the existing ones. The new servers are in a hyper-converged environment. During the scaling out process, there were no issues with MQ, but during the scaling in process, MQ started to get blocked. Currently, after the scaling in is completed, disk latency has increased. The hyper-converged engineer checked the storage and said that the corresponding SSD in the hyper-converged environment is fully utilized, and we need to read data from the regular SSD in the hyper-converged backend. The hyper-converged environment is also being expanded, but we would like to understand the following:

  1. Can the hyper-converged cache be cleared one by one to allow it to rebuild?
  2. Do the following parameters need to be adjusted:
| username: tidb菜鸟一只 | Original post link

What is your MQ used for, and is it also deployed on the TiKV machine?

| username: zhanggame1 | Original post link

Earlier it was mentioned that hyper-convergence is also being expanded. During the expansion of hyper-convergence, there will be a large amount of IO, but it should be fine once the expansion and synchronization are completed. Additionally, the IO performance of hyper-convergence is not very good, so it is best to provide TiKV with a dedicated SSD.

| username: redgame | Original post link

Okay. No need.

| username: jingyesi3401 | Original post link

MQ is used for writing production data, and any performance issues with the database will cause congestion.

| username: jingyesi3401 | Original post link

After the final inspection, the cache SSD in the hyper-converged system was exhausted, causing production to directly query the 7200 RPM disks in the hyper-converged backend, resulting in a large backlog. The SSD has now been expanded, and data from the regular disks is being transferred to the SSD.

| username: jingyesi3401 | Original post link

Okay, thank you, I feel the technical support.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.