Scheduling Timeout After TiDB Upgrade, Increase in Abnormal Regions

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB升级后调度超时,异常region增多

| username: realcp1018

[Cluster Version] Upgrading from v3.0.12 to v4.0.16
[Reproduction Path]
Upgrading a large data cluster (approximately 80TB of data, around 120 TiKV instances).
The initial phase of the upgrade was smooth, with each instance’s evict leader taking about 1 minute. However, when there were about 30 instances left, the leader transfer started to slow down, resulting in 600s timeouts. To speed up the upgrade, I manually restarted the instances on the corresponding nodes, and the upgrade process completed normally.
[Encountered Issue: Problem Phenomenon and Impact]
After the upgrade, monitoring revealed an increase in abnormal peers on the PD scheduling interface, as shown in the image below.


Upon checking the PD leader logs and conducting manual tests, it was found that regular leader transfer operators and remove peer operations had a high probability of timing out. The image below shows the execution log of a randomly selected operator:
First, the timeout process displayed by the PD leader:

Then, the operations related to this region displayed by the target store:

It stops at “deleting applied snap file,” and it’s unclear whether it got stuck or if the log only shows this much.
Due to high disk usage, I performed an expansion, but the balancing speed of the new nodes is extremely slow.
[My Guess and My Question]
I personally suspect that the forced restart caused a large-scale region peer replacement, and the low rate of this replacement led to the accumulation of extra-peer and learner-peer.
Is there any way to address the operator timeout issue mentioned above? How can we speed up region scheduling? Simply adjusting the region scheduling parameters currently only leads to more accumulation without effectively speeding up the process.
[Other Attachments]
Current PD configuration:

» config show
{
  "replication": {
    "enable-placement-rules": "false",
    "location-labels": "host",
    "max-replicas": 3,
    "strictly-match-label": "false"
  },
  "schedule": {
    "enable-cross-table-merge": "false",
    "high-space-ratio": 0.75,
    "hot-region-cache-hits-threshold": 3,
    "hot-region-schedule-limit": 4,
    "leader-schedule-limit": 8,
    "leader-schedule-policy": "count",
    "low-space-ratio": 0.8,
    "max-merge-region-keys": 200000,
    "max-merge-region-size": 20,
    "max-pending-peer-count": 16,
    "max-snapshot-count": 3,
    "max-store-down-time": "30m0s",
    "merge-schedule-limit": 16,
    "patrol-region-interval": "100ms",
    "region-schedule-limit": 256,
    "replica-schedule-limit": 64,
    "split-merge-interval": "1h0m0s",
    "tolerant-size-ratio": 0
  }
}
» scheduler show
[
  "label-scheduler",
  "balance-region-scheduler",
  "balance-hot-region-scheduler",
  "balance-leader-scheduler"
]
| username: h5n1 | Original post link

The scheduler-limit parameter limits the speed of operator generation. There is a pd-ctl store limit that limits the speed of operator consumption, you can try increasing it. Additionally, it might be a good idea to stop the leader and hot-region schedulers first to reduce scheduling volume and conflicts.

| username: realcp1018 | Original post link

Adding a question:
I noticed the parameter “max-snapshot-count”: 3 when configuring PD, and I can’t understand it.
I have searched a lot of information but still haven’t found a clear explanation. The code seems to say it’s a snapshot of the PD cluster config, but it’s confusing when placed in the schedule section.
What is the relationship between this snapshot and the snapshot in the TiKV logs?

| username: h5n1 | Original post link

When a region is migrated, a snapshot file of the region is first generated and then sent to the target TiKV. After that, the raft log is followed from this snapshot.

| username: realcp1018 | Original post link

Currently, due to the backlog of extra-peer, I have adjusted the store limit all to the upper limit of 200 and lowered the value of the region schedule limit in order to quickly consume it. This part has been adjusted.

| username: realcp1018 | Original post link

Thank you for the explanation. I’ll make an adjustment. I see that the default value in the new version is 64, while it was 3 in version 4.0. I want to adjust it to 64 and see the effect.

| username: h5n1 | Original post link

Are the disk I/O, CPU, and network busy?

| username: realcp1018 | Original post link

Supplementary question:
In the operator execution log I added in this article, it shows that the add peer operation timed out and failed, while the corresponding store shows “deleting applied snap file” and then there are no more logs for this region. Under normal circumstances, should there be a log like “snap file deleted”? Could it be that the operator timed out due to the snap file deletion timing out?
Supplement: I checked the snap folder and found nothing, so it should have been deleted normally.

| username: realcp1018 | Original post link

Disk CPU usage hovers between 20% - 30%, which is not considered busy. The network’s inbound/outbound is also less than 20mb/s.

| username: h5n1 | Original post link

Did you expand the file system or add TiKV? Was this done after discovering a large number of extra peers? Forcibly restarting TiKV, which involves Leader election switching, shouldn’t cause region migration. Regions will only be automatically replicated to other nodes after TiKV reaches the max store down time. If a large number of regions start migrating right after the upgrade without any other operations, it might be due to changes in the new version’s algorithm. Check the monitoring of leader/region balance and see if there are significant differences in scores, etc.

| username: h5n1 | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: realcp1018 | Original post link

The expansion of the TiKV instance was done after the extra-peer volume increased significantly. Apart from forcibly restarting the instance that was evicting the leader during the upgrade, no other operations were performed.

It might be as you said, because the number of leaders after the migration is not as balanced as before.

The noticeable increase in the region score in the monitoring started yesterday before I left work when I increased the region-schedule-limit parameter. The reason for the adjustment was to speed up the completion as I saw the extra-peer starting to rise.

| username: realcp1018 | Original post link

The current version 4.0.16 cannot be adjusted yet.

| username: h5n1 | Original post link

I feel like we can only wait slowly in this situation now.

| username: h5n1 | Original post link

The image is not visible. Please provide the text you need translated.

| username: realcp1018 | Original post link

I found the reason for the large-scale generation of extra-replicas. When this cluster was previously switched from Ansible to TiUP management, the sync-log parameter of TiKV was not recorded, so it was not noticed during the upgrade. Now it is found that 55% of the instances in the cluster of more than 120 instances have the sync-log=false setting, which is suspected to have been set to speed up performance in the past.
This issue has led to the current large-scale region scheduling.
As for the previous store score calculation issue, it was resolved after increasing the low/high space ratio.
The main problem now is that almost all types of operators are severely timing out (such as leader transfer/remove-extra-replica/balance region/merge region, etc.). Tracking an actual remove-extra-replica operation, it was found that the TiKV log showed that the peer had been deleted on the current store, but the PD log showed a 10-minute schedule timeout, as shown in the two images below:



Then I checked the PD heartbeat P99, which is 2 seconds, and it seems that the append log entries are blocked.

Now I have adjusted the region/replica schedule limit to be very small, and then set reject leader and restart the nodes with sync-log=false one by one. However, due to the slow leader transfer, this process will be very long.
May I ask if there is any intervention method for the current comprehensive heartbeat timeout?

| username: h5n1 | Original post link

:+1: How did you discover that the sync-log parameter was affecting it?

| username: realcp1018 | Original post link

Currently, we can only attribute it to this reason, and the problem has not been resolved yet.
“Forcing a restart of TiKV, which involves Leader election switching, should not cause region migration.”
Your reply gave a hint :handshake:, after briefly going through the Raft paper, we think that as long as sync log is performed normally, regular synchronization should be sufficient to maintain data consistency. So, we pulled the diagnostic report from the dashboard and found that there were many warnings indicating sync-log as not false. Upon further inspection, we found that many TiKV configurations were set to false, although this was not shown in the tiup edit. It seems that the later expansions with tiup were set to true, while the earlier ones were false.

| username: h5n1 | Original post link

I forgot from which version this parameter is set to true by default and can no longer be modified. I haven’t tried TiDB 3.x yet.

| username: realcp1018 | Original post link

The issue has been resolved:

  1. After restarting the PD cluster, the operator timeout phenomenon and heartbeat limit will recover in a short time, but it will reappear after 5-6 hours.
  2. Finally, a PD parameter trace-region-flow=false was modified. This parameter must be reloaded for the PD cluster to take effect, so it is recommended to use tiup edit and then reload the PD cluster. To be safe, also pre-configure it in pdctl with config set.
  3. The tikv sync-log parameter was set during the early days of Ansible management and was not unified after introducing TiUP. You need to delete the imported: true item in meta.yaml, and to be safe, delete the ansible-imported-configs and config-cache directories, then reload the cluster. Start the cluster and upgrade using the unified management’s new version default parameters.