The removal of a replica using the pd-ctl tool did not take effect

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 通过pd-ctl工具移除一个副本没有生效

| username: terry0219

[Test Environment for TiDB]
[TiDB Version] 7.5
[Reproduction Path] Followed the official documentation (PD Control 使用说明 | PingCAP 文档中心) to remove a replica of a region, but found it did not take effect.
Operation method:
operator add remove-peer 1777 2 // Remove a replica of Region 1777 on store 2
Then, when querying this region through tiup ctl:v7.5.0 pd region, it is still showing 3 replicas. What could be the reason? Did PD automatically replenish it?

| username: 哈喽沃德 | Original post link

This operation will be successfully added to the Operator queue waiting for execution, but PD will not immediately remove the replica from the Region. Let the bullet fly a little longer.

| username: terry0219 | Original post link

It’s been a long time, I operated it yesterday afternoon, and today it’s still showing 3 replicas.

| username: 小龙虾爱大龙虾 | Original post link

You can check the logs for PD and TiKV. PDCTL can only add scheduling, but the scheduling may not necessarily be executed.

| username: 哈喽沃德 | Original post link

When the Operator is executed, PD will attempt to remove the replica from the corresponding Store (i.e., store 2) and reduce the number of replicas for the Region by 1. During the next Region scheduling or Leader re-election, this Region will adjust its number of replicas to 2, with one replica being removed.

| username: 哈喽沃德 | Original post link

It’s also possible that PD might automatically complete the deleted replica, right?

| username: 小龙虾爱大龙虾 | Original post link

You’ve been at it for half a day, you must have already backed up the copy. You just need to put the regionid into the log search on the dashboard and take a look.

| username: 哈喽沃德 | Original post link

It’s probably been fixed; it’s been the whole night.

| username: terry0219 | Original post link

After performing the operation again, it still shows 3 replicas when checked. The logs explicitly show:

[2024/01/24 09:38:37.647 +08:00] [INFO] [audit.go:126] [“audit log”] [service-info=“{ServiceLabel:CreateOperator, Method:HTTP/1.1/POST:/pd/api/v1/operators, Component:pdctl, IP:10.0.7.64, Port:65188, StartTime:2024-01-24 09:38:37 +0800 CST, URLParam:{}, BodyParam:{"name":"remove-peer","region_id":1777,"store_id":2}}”]
[2024/01/24 09:38:37.647 +08:00] [INFO] [operator_controller.go:464] [“add operator”] [region-id=1777] [operator=“"admin-remove-peer {rm peer: store [2]} (kind:admin,region, region:1777(116, 17), createAt:2024-01-24 09:38:37.64757322 +0800 CST m=+510409.483971651, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:124, steps:[0:{remove peer on store 2}], timeout:[1m14s])"”] [additional-info=]
[2024/01/24 09:38:37.647 +08:00] [INFO] [operator_controller.go:708] [“send schedule command”] [region-id=1777] [step=“remove peer on store 2”] [source=create]
[2024/01/24 09:38:37.652 +08:00] [INFO] [region.go:749] [“region ConfVer changed”] [region-id=1777] [detail=“Remove peer:{id:83165 store_id:2 }”] [old-confver=17] [new-confver=18]
[2024/01/24 09:38:37.652 +08:00] [INFO] [operator_controller.go:611] [“operator finish”] [region-id=1777] [takes=4.801467ms] [operator=“"admin-remove-peer {rm peer: store [2]} (kind:admin,region, region:1777(116, 17), createAt:2024-01-24 09:38:37.64757322 +0800 CST m=+510409.483971651, startAt:2024-01-24 09:38:37.647637821 +0800 CST m=+510409.484036198, currentStep:1, size:124, steps:[0:{remove peer on store 2}], timeout:[1m14s]) finished"”] [additional-info=“{"cancel-reason":""}”]
[2024/01/24 09:38:38.187 +08:00] [INFO] [operator_controller.go:464] [“add operator”] [region-id=1777] [operator=“"add-rule-peer {add peer: store [2]} (kind:replica,region, region:1777(116, 18), createAt:2024-01-24 09:38:38.18792383 +0800 CST m=+510410.024322199, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:124, steps:[0:{add learner peer 83167 on store 2}, 1:{promote learner peer 83167 on store 2 to voter}], timeout:[13m38s])"”] [additional-info=]
[2024/01/24 09:38:38.188 +08:00] [INFO] [operator_controller.go:708] [“send schedule command”] [region-id=1777] [step=“add learner peer 83167 on store 2”] [source=create]
[2024/01/24 09:38:38.192 +08:00] [INFO] [region.go:749] [“region ConfVer changed”] [region-id=1777] [detail=“Add peer:{id:83167 store_id:2 role:Learner }”] [old-confver=18] [new-confver=19]
[2024/01/24 09:38:38.192 +08:00] [INFO] [operator_controller.go:708] [“send schedule command”] [region-id=1777] [step=“add learner peer 83167 on store 2”] [source=heartbeat]
[2024/01/24 09:38:40.334 +08:00] [INFO] [operator_controller.go:708] [“send schedule command”] [region-id=1777] [step=“promote learner peer 83167 on store 2 to voter”] [source=heartbeat]
[2024/01/24 09:38:40.339 +08:00] [INFO] [region.go:749] [“region ConfVer changed”] [region-id=1777] [detail=“Remove peer:{id:83167 store_id:2 role:Learner },Add peer:{id:83167 store_id:2 }”] [old-confver=19] [new-confver=20]
[2024/01/24 09:38:40.339 +08:00] [INFO] [operator_controller.go:611] [“operator finish”] [region-id=1777] [takes=2.151147283s] [operator=“"add-rule-peer {add peer: store [2]} (kind:replica,region, region:1777(116, 18), createAt:2024-01-24 09:38:38.18792383 +0800 CST m=+510410.024322199, startAt:2024-01-24 09:38:38.188090591 +0800 CST m=+510410.024488967, currentStep:2, size:124, steps:[0:{add learner peer 83167 on store 2}, 1:{promote learner peer 83167 on store 2 to voter}], timeout:[13m38s]) finished"”] [additional-info=“{"cancel-reason":""}”]

| username: terry0219 | Original post link

Looking at the logs, it seems like PD automatically filled in the replicas again. If I want to achieve this requirement of having 2 replicas, how should I proceed?

| username: 随便改个用户名 | Original post link

Study check-in

| username: terry0219 | Original post link

I understand, it’s automatically completed by PD. There is a parameter in PD called patrol-region-interval, which defaults to 10ms, so it quickly completes the replicas. Adjusting this value to be larger will solve the issue.

| username: h5n1 | Original post link

It’s not about the number of replicas; you should check if there is still a replica of 1777 on store 2. It will automatically be added to another store.

| username: 小龙虾爱大龙虾 | Original post link

To achieve 2 replicas, just set the number of replicas to 2. Otherwise, it will be supplemented later. However, no one sets it to 2 replicas because the Raft consensus algorithm requires a majority to provide services normally. If one of the 2 replicas is lost, it will not be able to serve.

| username: terry0219 | Original post link

Got it, I was just testing to see.

| username: 小龙虾爱大龙虾 | Original post link

Looking at the logs, another replica was directly added. You can also observe the region peer ID to see if there are any changes.

| username: terry0219 | Original post link

After setting config set patrol-region-interval 10m to increase the inspection interval, I found that the number of replicas for region id: 1777 became 2, indicating that the operator has taken effect.

| username: terry0219 | Original post link

Yes, it was auto-completed.

| username: wangccsy | Original post link

Beginner learning.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.