Scaling Down a TiKV Node Results in Congestion Incident

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 缩容一台tikv节点,结果就出现拥堵事故

| username: zhimadi

[TiDB Usage Environment] Production Environment
[TiDB Version] V5.4.2
[Reproduction Path] Operations performed that led to the issue:

  1. The cluster originally had 6 TiKV nodes, all cloud hosts with 16 cores and 32GB configuration, running normally.
  2. Planned to upgrade the configuration of 4 nodes to 24 cores and 48GB memory one by one, then scale down 2 nodes.
  3. Currently, 4 nodes have been upgraded and 1 node has been scaled down. The 6th node has not been scaled down yet. The current cluster has 5 TiKV nodes, 4 with 24 cores, and 1 old TiKV (tikv-01) with 16 cores.
    [Encountered Issue: Problem Phenomenon and Impact]
    At 10:40 AM during peak business hours, the 16-core tikv-01 node experienced extremely high load, causing system delays and a large number of slow queries over 8 seconds. Checking the execution plan, almost all the longest processing time instances were on the tikv-01 node.

At 10:45 AM, scaling down tikv-01 was executed, resulting in a collapse. The business was stuck for an hour. It only eased when the regions on tikv-01 dropped to 0.

At 11:20 AM, an emergency expansion of a new machine was executed, but the regions have not yet fully synchronized.

Checking the tidb.log revealed a large number of requests being processed by the tikv-01 machine.

At this point, it’s too late to run away. Hence, I urgently seek help from the community friends on how to find out the cause and subsequent countermeasures.

Which parameters and logs should be checked to pinpoint the cause?

[Resource Configuration]
Disk usage on each node is between 40%-50%;
Each TiKV node has around 4.5k regions.
[Attachments: Screenshots/Logs/Monitoring]
Attachment 1:


Attachment 2:

Attachment 3:

Attachment 4:

| username: zhanggame1 | Original post link

Check if the IO is full.

| username: zhimadi | Original post link

46%-99%

| username: redgame | Original post link

The scaling down operation caused an uneven distribution of the load on other nodes, resulting in a performance decline.

| username: h5n1 | Original post link

  1. During the scaling down process, region transfers consume a significant amount of system resources.
  2. Leader switches can cause the region cache information to become outdated. Accessing the offline TiKV using outdated information leads to backoff retries. You can check the backoff-related time in the slow SQL logs.
| username: tidb菜鸟一只 | Original post link

It is recommended to scale up or down one node at a time, and wait for the regions to balance before proceeding to the next node. During business hours, you can set the store limit very low and extend the time period. Otherwise, region scheduling will consume a lot of I/O, and frequent leader switching will cause a large number of backoffs, affecting online business.

| username: zhanggame1 | Original post link

The suspected reason is that during the scale-down, the leader distribution was uneven, with too many leaders on tikv-01. This can be observed through the leader distribution in the PD monitoring on Grafana at that time.

You might consider enabling the follower read feature to alleviate the issue of leaders being concentrated on a single TiKV.

| username: Jellybean | Original post link

Expanding the cluster in a production environment is fine, but for scaling down, try to do it during off-peak periods or lower the number of scheduled regions and leaders. Otherwise, the phenomenon of cluster resources being squeezed, as described by the original poster, is likely to occur.

| username: wakaka | Original post link

IO utilization was previously between 50% and 80%. The large amount of IO caused by scaling down led to one node consistently being at 80% to 100%, resulting in a significant slowdown of many SQL queries and ultimately causing a severe issue.

| username: zhimadi | Original post link

However, it took more than 24 hours to scale down one TiKV, and there were no issues. The performance problems only appeared a day later.

| username: zhimadi | Original post link

Could it be related to the two TiKV nodes that were initially created with the cluster?

| username: zhimadi | Original post link

It was operated node by node. The performance issue only appeared more than 24 hours after the first TiKV scale-down.

| username: zhimadi | Original post link

It is under the condition that follower-read has already been enabled.

| username: zhimadi | Original post link

Is it manually executing the scheduler add evict-leader-scheduler adjustment?

| username: zhimadi | Original post link

Is there a solution?

| username: tidb菜鸟一只 | Original post link

You have the equivalent of 6 TiKV nodes. First, you upgraded the memory and CPU of 4 TiKV nodes by restarting them. After the restart was completed, you did not check whether the leaders were balanced, and then you took one of the two non-upgraded TiKV nodes offline. After 24 hours, performance issues began to appear.

So, you need to check the leader distribution at that time and whether there was any hotspot information.

| username: zhimadi | Original post link

Yes, the steps are like this. At that time, I didn’t check the leader. In this case, should I wait for a while before taking it offline? Or can I intervene manually? What command should I use?

| username: tidb菜鸟一只 | Original post link

After scaling out, you can first check the leader distribution monitoring page in Grafana. Generally, if all nodes have consistent resources, the number of leaders will also be roughly the same. In your case, where some configurations have been scaled out, there should be more leaders. You should then take offline the nodes with fewer leaders among the lower configuration nodes. Before taking them offline, you can use pdctl to execute scheduler add evict-leader-scheduler store_id to evict the leaders from the TiKV that you want to take offline.