After TiKV Scaling Down, IO Utilization Reaches 100% and a Large Number of Regions Migrate

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV 缩容后, io占比100%, 及大量的region 迁移

| username: fish

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.3
[Reproduction Path]:
Cluster with 5 servers, each server deploys 3 TiKV instances, with two TiKV instances on mechanical disks and one TiKV instance on an SSD. There are a total of 15 TiKV nodes. On December 15th between 6:00 and 7:00, one mechanical disk TiKV was scaled down due to memory issues. On the morning of the 17th between 8:20 and 8:30, another mechanical disk TiKV was scaled down. After checking the monitoring, from the 17th onwards, the SSD disk IO on the 5 servers has been consistently at 100%. Checking the logs on the TiKV instances on the SSD disks, there are a large number of [“failed to send extra message”] errors and many regions cannot find a leader. However, based on the region ID, the leader can be found, and there are frequent election info messages [“starting a new election”]. Solution: Stopping the business application and restarting the TiDB cluster did not resolve the 100% IO issue.

[Encountered Problem: Symptoms and Impact]: Checking the logs on the TiKV instances on the SSD disks, there are a large number of [“failed to send extra message”] errors and many regions cannot find a leader. However, based on the region ID, the leader can be found, and there are frequent election info messages [“starting a new election”]. Solution: Stopping the business application and restarting the TiDB cluster did not resolve the 100% IO issue.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]




Could everyone please take a look when you have time and see if there are any troubleshooting ideas and solutions?

| username: Billmay表妹 | Original post link

Try scaling up first and see.

| username: Billmay表妹 | Original post link

Check out this article.

| username: Jellybean | Original post link

Based on your description and the observed phenomena, it seems that after you scaled down, a large number of data Regions need to be migrated, and there was an issue during the process of Region Peer replication and migration.

  1. From the above image, it can be seen that when the Region tries to migrate data to the TiKV node with store id = 546213217, an exception occurred. The reason for the exception is shown as “Full.” Please check the disk space of this node.

    • You can view the specific machine IP through tiup ctl:{version} pd -u http://{pd-ip}:{pd-port} store {store-id}.
  2. Your cluster involves a mix of SSD and regular HDD nodes. Are you using this machine for cold and hot data archiving storage? What are the plans and designs regarding Placement Rules?

  3. Confirm the size of the data on the scaled-down nodes, the number of Regions involved in the migration, and the Leaders.

  4. Since the storage layer involves a large amount of data migration, restarting the cluster will not have much effect. You need to solve the data migration issue.

    • If possible, first scale up to increase the available space in the cluster, allowing the cluster to perform data migration normally. This might be the fastest way to recover.
    • Secondly, check the cluster’s high space ratio value through tiup ctl:{version} pd -u http://{pd-ip}:{pd-port} config show, and then determine from the monitoring whether it needs to be appropriately increased to relax space-related scheduling restrictions.

You can follow the above ideas and handling methods to troubleshoot and resolve the issue.

| username: dba远航 | Original post link

When TiKV is scaled down, the LEADERs on the original machine will all be transferred, which will naturally generate a large amount of IO.

| username: zhanggame1 | Original post link

Isn’t it easy for a mechanical disk to reach 100% IO?

| username: xfworld | Original post link

There are a total of 15 TiKV nodes. Between 6:00 and 7:00 on December 15th, one mechanical disk TiKV was reduced due to memory issues. Between 8:20 and 8:30 on the morning of the 17th, another mechanical disk TiKV was reduced.


The biggest issue is whether the region leader transfer was manually handled before the reduction. If not, the consequences of this operation could be quite severe.

Restarting won’t solve the problem. The most likely scenarios are:

  1. The region leader cannot be recovered, which could lead to permanent data loss.
  2. The region replica cannot elect a leader, posing a risk to the cluster’s operation.
  3. Risky measures need to be considered to recover these regions with potential data loss…
| username: TIDB-Learner | Original post link

Such operations carry certain risks, so take this as a warning.

| username: 江湖故人 | Original post link

Each machine runs 3 TiKVs, probably didn’t set up the labels properly, disaster recovery is poorly done, and then they casually started scaling down.

| username: Jellybean | Original post link

It seems that the original poster still has a lot of room for improvement in understanding the reasons and principles behind these operations. After a series of actions, the problem has become even more complicated.

| username: tidb菜鸟一只 | Original post link

  1. First, confirm whether your TiKV nodes have been labeled. After two TiKV nodes are taken offline, is it possible that the corresponding labeled nodes or space are insufficient, making it impossible to migrate the corresponding leaders?
  2. On December 15th between 6:00 and 7:00, you scaled down a TiKV with a mechanical disk. At that time, were you sure that all the leaders on that node had already migrated to other nodes? Could it be that the migration was not completed and you forced the scale-down? Then, between 8:20 and 8:30 on the morning of the 17th, you scaled down another node. At this time, the region replicas of the previous node might not have been fully replenished. The second node you scaled down happened to have replicas of the regions from the previous scale-down, resulting in some regions having an insufficient number of replicas…
| username: dba远航 | Original post link

No results yet?