Consultation on Cluster Data Migration Solutions

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 咨询集群数据迁移方案

| username: TiDBer_Y2d2kiJh

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0 2tidb 3pd 3tikv 2ha Data volume less than 200G.
[Reproduction Path] Due to the current cluster IO bottleneck, we plan to migrate the data to a new environment. The new environment has 2tidb, 3pd, 5tikv, all using SSDs. We are now considering how to perform the migration, whether to use the scale-out and scale-in method or the backup and restore + ticdc method. The leader has given a business downtime of 5 seconds.
[Encountered Problems: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: Fly-bird | Original post link

As long as the network is interconnected, you can add and remove nodes one by one without migrating the entire system.

| username: TiDBer_Y2d2kiJh | Original post link

It’s about scaling up and down for migration.

| username: Fly-bird | Original post link

Scaling up or down does not require downtime.

| username: TiDBer_oHSwKxOH | Original post link

If your boss gives you 5 seconds, and it’s a critical business, I suggest you run away.
For critical business, you definitely need official support. For non-critical, you can handle it yourself.

| username: 像风一样的男子 | Original post link

If your boss gives you 5 seconds, and it’s a critical business, I suggest you run away.

| username: tidb菜鸟一只 | Original post link

The business downtime is 5 seconds. I estimate that the business execution stop and start service takes more than 5 seconds, right? Actually, in your situation, if the network between the new and old environments allows, it is recommended to perform online scaling. You can first scale out TiKV. If you want to minimize the impact on the business, you can scale out one node at a time, slightly lowering the balance and store limit parameters. Then, scale out TiDB and PD. After scaling out, add the new TiDB address to the HA load. Then, gradually scale in TiKV and TiDB, and finally scale in the old PD. The leader of the old PD should be scaled in last. This might cause some business backoff, but this approach should be the safest.

| username: Kongdom | Original post link

:thinking: 5 seconds, let’s use scaling up and down. 0 seconds, we need to exceed the leader’s expectations.

| username: 像风一样的男子 | Original post link

The old cluster has an IO bottleneck. When scaling up or down, even a slight hiccup can last more than 5 seconds. Be prepared to take the blame.

| username: Kongdom | Original post link

:joy: If you think about it this way, TiCDC won’t work either, it’s all IO bottlenecks.

| username: TiDBer_Y2d2kiJh | Original post link

Previously, a TiKV node was scaled up and down.

| username: TiDBer_Y2d2kiJh | Original post link

The network is connected, and we are indeed following this plan for processing. At that time, we only need to restart HA. How should we understand your statement that “this might cause some business backoff”?

| username: TiDBer_Y2d2kiJh | Original post link

IO is not that bad, just hoping for better IO.

| username: Kongdom | Original post link

In that case, it’s fine. Just follow the plan suggested by the expert above: expand one by one, then reduce the rate and proceed slowly.

| username: tidb菜鸟一只 | Original post link

Although TiDB’s PD follows the 3-node majority principle and any node can work, in reality, requests need to be sent to the leader node. Therefore, during the leader switch of PD, your cluster will not be able to find the PD leader node and will back off. Under normal circumstances, another PD node will be elected as the leader, and the back off will redirect the request from the old PD leader to the new leader, allowing the business to proceed normally. However, if your region size is large, the leader switch might not be fast enough, causing the back off to fail in finding the new leader node, potentially leading to failure. But with your data volume being less than 200G, this should not be a significant issue.

Additionally, other PD nodes can be directly scaled down to automatically switch (both TiDB and TiKV can do this, though it is recommended to operate TiKV slowly one by one). However, it is suggested to switch the PD leader first before scaling down, because each PD node has a cache of all PD information. Directly scaling down the leader node might cause inconsistencies in the PD cache, affecting the business. You can switch the leader first, and once the new PD leader node is providing services normally, scale down the old PD leader (which has now been switched to a regular node).

Regarding the IO bottleneck mentioned above, since your business can still handle the current IO limitations, you should first scale out TiKV. During the scaling out process, IO might be negatively impacted (so it is recommended to slow down the balance and store limit speeds to minimize the impact on IO). However, once the new TiKV nodes are gradually online, the IO bottleneck will be greatly alleviated. Since the new TiKV nodes use SSDs, they will share some of the IO pressure, reducing the load on the old nodes.

| username: Kongdom | Original post link

:+1: :+1: :+1:

| username: TiDBer_Y2d2kiJh | Original post link

Thank you for the guidance. Since the data volume is not large, the risk is relatively low, but we will still strictly follow the optimal plan to handle the scaling up and down of each node. For example, for TiKV, we will first scale up one node and then scale down one node slowly. For PD, we will first scale up, then switch the leader, and then scale down, configuring the corresponding balance and store limit.

| username: TiDBer_Y2d2kiJh | Original post link

It still requires restarting HA to apply the new IP, and the restart took 1 second.

| username: Kongdom | Original post link

:yum: Rigorous, and it can be considered overachieved.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.