TiDB Schedule Operator Timeout

translator_bot · June 23, 2024, 2:46am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB schedule operator timeout

| username: ojsl

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.1
[Encountered Issue] Most of the operators created by the scheduler (merge region, replace-rule-offline-peer) have timed out. Some nodes in the cluster have been taken offline, but the regions of one node have not decreased.
[Reproduction Path]
[Issue Phenomenon and Impact] The scheduling of the offline node cannot be completed. I want to know which parameter can adjust the timeout time of the operator, so I can adjust it to complete the schedule of this node.

[Attachment]

translator_bot · June 23, 2024, 2:46am

| username: xfworld | Original post link

Are the TiKV nodes too busy, causing them to be unresponsive?

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

The image is not visible. Please provide the text you need translated.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

This data is very stable most of the time, and there is still some CPU remaining in TiKV.

translator_bot · June 23, 2024, 2:46am

| username: xfworld | Original post link

Not only the CPU, but also the disk, memory, and network need to be considered.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

The memory of TiKV is relatively stable. The TiKV instance on this node has a region that has not decreased:

translator_bot · June 23, 2024, 2:46am

| username: xfworld | Original post link

There are many reasons why the offline status remains unchanged, and it is quite complex.

For example:

The new node is relatively busy and unable to respond, causing region migration scheduling issues.
Distribution issues of region leaders between new and old nodes.
PD’s scheduling strategy is not aggressive enough.

You can refer to these parameters for adjustments:

TiDB 的问答社区 – 24 Mar 22

【SOP 系列 24】TiKV/TiFlash 下线慢

🌌 运维指南 TiDB 运维手册

术语列表 Store：指代 TiKV 或 TiFlash 实例 Region：TiDB 进行数据存储的基本单位，代表了一段范围内的二进制数据，默认一个 Region 的大小为 96 MB Leader：Region 通过 Raft 共识算法在不同 Store 之间完成满足线性一致性的复制，从而达成数据上的冗余以备高可用，过程中的任意时刻一个 Region 可能是四个角色中的一种：leader，follower，candidate 和 learner。其中大多数情况下由...

阅读时间: 1 mins 🕑 赞: 2 ❤

This requires extra attention:

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

We have tried the configuration in this document, but the current issue is that most operators (merge region, replace-rule-offline-peer) will timeout no matter how high the limit is adjusted.

translator_bot · June 23, 2024, 2:46am

| username: xfworld | Original post link

Then check if the offline nodes still have regions that are leaders. If not, you can try forcing them offline.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

Another offline instance has a relatively large number of leaders, and the instance with more regions no longer has any leaders.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

When this instance first went offline, the leader didn’t change much until we restarted it, and then the leader started to decrease. Additionally, there are a few other instances where a small number of leaders are no longer changing.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

Does “forced offline” refer to --force?

translator_bot · June 23, 2024, 2:46am

| username: xfworld | Original post link

Yes, it’s “force.”

tiup scale-in --force forcibly/violently scales in (not recommended unless in special circumstances).

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

However, there are still many empty regions on the cluster that cannot be merged. Forcing this node offline will not solve my problem.

translator_bot · June 23, 2024, 2:46am

| username: xfworld | Original post link

What’s going on…

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

We replaced the TiKV nodes by first scaling out and then scaling in. To improve the scheduler speed, we increased some limit parameters. These values are the changes after increasing the limits.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

offline-peer refers to a region that is offline.

translator_bot · June 23, 2024, 2:46am

| username: Lucien-卢西恩 | Original post link

Hello~ This might require checking the placement rule and label configuration to see if there are any incorrect label configurations causing regions to not be properly scheduled under different domains of the same label layer. For example, using the Rack layer configuration, if you configure Rack1, Rack2, and Rack3, but the offline TiKV is configured as Rack2, and there are no other TiKV instances in Rack2 available to store the region, it will result in the regions of the offline TiKV not being scheduled.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

This has been checked, and the current configuration belongs to one rack.

translator_bot · June 23, 2024, 2:46am

| username: ojsl | Original post link

With the help of @jingyu ma, the store limit has been adjusted, and now the speed of the remove-orphan-peer operator is quite considerable. However, the replace-rule-offline-peer is only slightly faster than before and is still very slow. Specifically, the regions on some stores are dropping particularly slowly.