TiKV Scaling Up and Down, Resulting in a Large Number of Slow Queries

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv扩容缩容,出现大量慢查询

| username: magongyong

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the problem occurred
Expanded new servers for zone1 replicas, then shrunk old servers, resulting in a large number of slow insert queries.
How to resolve this?

Additionally, after executing the script tiup ctl:v6.5.5 pd config set label-property reject-leader zone z1 --pd="http://10.100.140.123:2379", what is the rollback operation for this script?

[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: 像风一样的男子 | Original post link

Are you scaling up and down at the same time?

| username: magongyong | Original post link

Yes, it’s just a copy.

| username: magongyong | Original post link

How to cancel the scaling down operation, still looking for the script, haven’t found it yet.

| username: magongyong | Original post link

The script “tiup ctl:v6.5.5 pd config set label-property reject-leader zone z1” did not take effect, and the leader in zone z1 did not migrate.

| username: zhanggame1 | Original post link

What does 1 replica mean? Did you change the 3 replicas setting to 1?

| username: magongyong | Original post link

There are a total of 3 replicas, and only the server instance of the first replica has been scaled up and down.

| username: tidb狂热爱好者 | Original post link

Didn’t understand, to be honest.

| username: magongyong | Original post link

The cluster has three replicas, with the zone topology planned as z1, z2, and z3. Now, the tikv instance in z1 has been scaled up and down.

| username: 像风一样的男子 | Original post link

Too aggressive, disk resource consumption is very high. You can lower the region scheduling speed to reduce resource usage.

Adjust region scheduling speed:

  1. Log in to pd-ctl
./pd-ctl -i -u http://0.0.0.0:2379
  1. Use the following commands:
>> store limit                         // Display the speed limit for adding and removing peers for all stores
>> store limit add-peer                // Display the speed limit for adding peers for all stores
>> store limit remove-peer             // Display the speed limit for removing peers for all stores
>> store limit all 5                   // Set the speed limit for adding and removing peers for all stores to 5 per minute
>> store limit 1 5                     // Set the speed limit for adding and removing peers for store 1 to 5 per minute

Try setting it lower as per this reference.

| username: magongyong | Original post link

Set it to 1, but there are still some slow queries. How can I cancel them?

| username: zhang_2023 | Original post link

Manually allocate the region.

| username: magongyong | Original post link

I haven’t operated it before. Do you have any command instructions, experts?

| username: 小龙虾爱大龙虾 | Original post link

Version 6.5.5 no longer supports this label-property configuration, right?

What’s the background for scaling up or down? Cross-data center deployment? Expanding to a remote data center?
How exactly is the scaling operation performed?
Have you analyzed the reasons for slow queries? Is it because the leader is running in a remote data center?

| username: magongyong | Original post link

Sure, I will modify the parameters later.
The cause of the problem has been found; it was due to scaling down, while scaling up has no impact.
When the business SQL inserts data, it cannot find the leader, and the backoff retry causes this. Directly stopping the scaled-down node will solve the issue.

| username: magongyong | Original post link

To summarize and make a record:

  1. For scaling in TiKV in an online business, do not directly use tiup cluster scale-in <cluster-name> --node 10.100.100.101:20161 --node 10.100.100.102:20161, as this can cause widespread slow DML queries and backoff retries.
  2. Scaling out TiKV has almost no impact or very low performance impact.
  3. For TiKV decommissioning, first manually migrate the leader: tiup ctl:v6.5.5 pd store weight 69804641 0 1 --pd="http://10.100.100.111:2379".
  4. Then proceed with the scale-in operation.
  5. For emergency operations, directly stop the decommissioned TiKV instance: tiup cluster stop <cluster-name> --node 10.100.100.101:20161 --node 10.100.100.102:20161.
| username: 逍遥_猫 | Original post link

As long as the cluster is available, you can directly execute step 5 to stop.

| username: magongyong | Original post link

Yes, that’s right. This is how we handled the emergency today, and it took effect immediately.

| username: dba远航 | Original post link

Scaling in or out will increase system I/O, which can cause SQL execution to slow down.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.