Failed to add operator, maybe already have one

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: failed to add operator, maybe already have one

| username: beacoolkid

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v4.0.7
[Reproduction Path] pd-ctl operator add transfer-peer
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]

Executing pd-ctl operator add transfer-peer reports an error: Failed! [500] “failed to add operator, maybe already have one”

All add operator operations have been stopped.

Image
The speed is very slow, it is found that it can only be executed once every 4 seconds.

Is there any way to speed up peer migration?

Or is there any command to forcefully migrate all peers on a certain storeid to another, so that there is no need to migrate one region at a time?

| username: 大鱼海棠 | Original post link

In a production environment, it is recommended to proceed slowly. In a testing environment, you can adjust as needed.

| username: 考试没答案 | Original post link

Are the CPU and memory usage high?

| username: Billmay表妹 | Original post link

You can accelerate peer migration through the following methods:

  1. Adjust the raftstore.apply-pool-size and raftstore.store-pool-size parameters of the TiKV cluster to increase the size of the apply and store thread pools, thereby improving the concurrency of apply and store.
  2. Adjust the raftstore.apply-pool-queue-capacity and raftstore.store-pool-queue-capacity parameters of the TiKV cluster to increase the queue capacity of the apply and store thread pools, thereby reducing the blocking wait time of the apply and store thread pools.
  3. Adjust the raftstore.store-max-batch-size parameter of the TiKV cluster to increase the batch write size of each store, thereby reducing the number of writes and improving write efficiency.
  4. Adjust the raftstore.apply-max-batch-size parameter of the TiKV cluster to increase the batch processing size of each apply, thereby reducing the number of processing times and improving processing efficiency.

Additionally, you can also adjust the raftstore.store-pool-size parameter of the TiKV cluster to increase the size of the store thread pool, thereby improving the concurrency of the store. At the same time, you can also adjust the raftstore.store-pool-queue-capacity parameter of the TiKV cluster to increase the queue capacity of the store thread pool, thereby reducing the blocking wait time of the store thread pool.

If the above methods do not solve the problem, you can try upgrading the TiDB cluster version to obtain better performance and faster migration speed.

| username: beacoolkid | Original post link

The machine itself has very high resources, but the utilization rate is not high.

| username: beacoolkid | Original post link

Everything has been set to 0, only admin-move-peer is left, and it’s very slow.

| username: 大鱼海棠 | Original post link

If it’s a testing environment, you should increase the scheduling above. Setting it to 0 will definitely be very slow, or you can just stop scheduling.

| username: beacoolkid | Original post link

Increasing it will result in Failed! [500] “failed to add operator, maybe already have one. A large number of balance merges will occupy add operator. I have set it to 4096, but it still reports an error.”

| username: 大鱼海棠 | Original post link

operator show, take a look

| username: beacoolkid | Original post link

| username: 大鱼海棠 | Original post link

This is config show, right?

| username: beacoolkid | Original post link

It’s strange that if executed every 4 seconds, it won’t report “Failed! [500] ‘failed to add operator, maybe already have one’”.

| username: 大鱼海棠 | Original post link

store limit 6 200 remove-peer
store limit 137863286 200 add-peer

| username: 大鱼海棠 | Original post link

Isn’t this scheduling? Just increase the scheduling speed.