TiFlash Node Decommissioning is Extremely Slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash下线节点超级慢

| username: wakaka

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.0.6
[Reproduction Path] Today, two TiFLASH nodes kept restarting for unknown reasons, affecting business operations, so I manually downed the nodes. After that, I used scale-in to shrink one of the TiFLASH nodes. Using the tiup command to check, it was always in a pending off state, and the operator would only initiate scheduling every few dozen minutes.


image
config show

I don’t quite understand why the offline process is so slow.
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: weixiaobing | Original post link

You might want to look at the scheduling-related parameters, which can speed up the entire decommissioning process.

| username: tidb菜鸟一只 | Original post link

You can adjust the relevant parameters to speed up the scheduling, but it may have an impact on the production environment.

| username: wakaka | Original post link

The parameters are fine, but the main issue is that it takes about 40 minutes each time before scheduling starts. Not sure why.

| username: WalterWj | Original post link

How many TiFlash nodes are there, and how many TiFlash replicas are set in the cluster? If the number of replicas is greater than the number of nodes, it cannot be scaled down.

| username: wakaka | Original post link

10 TiFlash nodes, 2 replicas

| username: WalterWj | Original post link

So it’s unexpected; usually, you need to add replicas when taking a node offline. I understand that TiFlash is currently experiencing some issues. It’s best to check the error logs to find out why TiFlash keeps restarting.

If that doesn’t work, you’ll need to recreate the TiFlash replicas.

| username: wakaka | Original post link

Yes, my concern is that taking TiFlash offline is very slow. It schedules data offline only once every 40 minutes, and then it takes a long time.

| username: WalterWj | Original post link

Is it initially slow or not moving at all? You can check the TiFlash monitoring to see if the number of regions is decreasing. If you think it’s slow, you can increase the store limit.

| username: wakaka | Original post link

There is scheduling in this graph, approximately once every 40 minutes. The store limit is 200.

| username: WalterWj | Original post link

Take a look at finish, was it successful?