Are there any risks associated with using --force when scaling in a TiUP cluster?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiup cluster scale-in 加–force有什么风险吗?

| username: hacker_77powerful

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.2
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
After scaling in and out the PD nodes of TiDB, it still cannot start. I found online that it requires the --force parameter, but there are security risks when executing it. What are these risks?

tiup cluster scale-in tidb --node 192.168.209.5:22379 --force

| username: hacker_77powerful | Original post link

Supplement: This is a screenshot of the current service status.

| username: Jellybean | Original post link

This is a high-risk operation that will directly and forcibly erase the deployment directory and data of the target node. Proceed with caution.

First, confirm what issues occurred during the operation and address them based on the specific problems.

| username: TiDBer_QYr0vohO | Original post link

Sorry, I can’t assist with that.

| username: hacker_77powerful | Original post link

The key is whether adding --force can solve the problem of PD still being down.

| username: TiDBer_0p0BD6le | Original post link

Did the force shrink execute successfully? Normally, this node would be removed.

| username: hacker_77powerful | Original post link

I didn’t dare to execute the scale-down with --force. The scale-down and scale-up without --force were both successful, but the PD on this node just couldn’t start.

| username: zhang_2023 | Original post link

The risk is quite high.

| username: zhanggame1 | Original post link

Cannot start, check the logs. You can look at the systemctl logs.

| username: Ming | Original post link

The node that has been scaled down can indeed only be forced down using the --force option (I see that the forced scale-down is for PD, and the leader has already switched, and the cluster is currently in a normal state, so I think the impact is not significant). Alternatively, we can first confirm the reason for the PD downtime to see if it can be restored to the up state. You can manually execute the startup in the directory of the downed PD to see what the error is.

| username: Ming | Original post link

However, it is recommended to first add a PD node for scaling, and then proceed with forced scaling down.

| username: Jellybean | Original post link

Have you checked the PD logs? Go and confirm.

| username: hacker_77powerful | Original post link

[root@host-192-168-209-5 log]# tail -n 100 pd.log
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:89] [“Welcome to Placement Driver (PD)”]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:90] [PD] [release-version=v7.1.0]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:91] [PD] [edition=Community]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:92] [PD] [git-hash=1ff614d90412396c9ebaad76a30d31e683c34adc]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:93] [PD] [git-branch=heads/refs/tags/v7.1.0]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:94] [PD] [utc-build-time=“2023-05-25 02:10:43”]
[2024/04/17 16:31:54.657 +08:00] [INFO] [metricutil.go:86] [“disable Prometheus push client”]
[2024/04/17 16:31:54.657 +08:00] [INFO] [join.go:218] [“failed to open directory, maybe start for the first time”] [error=“open /data/software/tidb-data/tidb/tidb-data/pd-22379/member: no such file or directory”]
[2024/04/17 16:31:54.667 +08:00] [INFO] [server.go:242] [“PD config”] [config=“{"client-urls":"http://0.0.0.0:22379","peer-urls":"http://0.0.0.0:22380","advertise-client-urls":"http://192.168.209.5:22379","advertise-peer-urls":"http://192.168.209.5:22380","name":"pd-1","data-dir":"/data/software/tidb-data/tidb/tidb-data/pd-22379","force-new-cluster":false,"enable-grpc-gateway":true,"initial-cluster":"pd-192.168.209.7-22379=http://192.168.209.7:22380,pd-1=http://192.168.209.5:22380,pd-192.168.209.6-22379=http://192.168.209.6:22380","initial-cluster-state":"existing","initial-cluster-token":"pd-cluster","join":"http://192.168.209.6:22379,http://192.168.209.7:22379","lease":3,"log":{"level":"info","format":"text","disable-timestamp":false,"file":{"filename":"/data/software/tidb-7.1.0/tidb-deploy/pd-22379/log/pd.log","max-size":0,"max-days":0,"max-backups":0},"development":false,"disable-caller":false,"disable-stacktrace":false,"disable-error-verbose":true,"sampling":null,"error-output-path":""},"tso-save-interval":"3s","tso-update-physical-interval":"50ms","enable-local-tso":false,"metric":{"job":"pd-1","address":"","interval":"15s"},"schedule":{"max-snapshot-count":64,"max-pending-peer-count":64,"max-merge-region-size":20,"max-merge-region-keys":0,"split-merge-interval":"1h0m0s","swtich-witness-interval":"1h0m0s","enable-one-way-merge":"false","enable-cross-table-merge":"true","patrol-region-interval":"10ms","max-store-down-time":"30m0s","max-store-preparing-time":"48h0m0s","leader-schedule-limit":4,"leader-schedule-policy":"count","region-schedule-limit":2048,"witness-schedule-limit":4,"replica-schedule-limit":64,"merge-schedule-limit":8,"hot-region-schedule-limit":4,"hot-region-cache-hits-threshold":3,"store-limit":{},"tolerant-size-ratio":0,"low-space-ratio":0.8,"high-space-ratio":0.7,"region-score-formula-version":"v2","scheduler-max-waiting-operator":5,"enable-remove-down-replica":"true","enable-replace-offline-replica":"true","enable-make-up-replica":"true","enable-remove-extra-replica":"true","enable-location-replacement":"true","enable-debug-metrics":"false","enable-joint-consensus":"true","enable-tikv-split-region":"true","schedulers-v2":[{"type":"balance-region","args":null,"disable":false,"args-payload":""},{"type":"balance-leader","args":null,"disable":false,"args-payload":""},{"type":"balance-witness","args":null,"disable":false,"args-payload":""},{"type":"hot-region","args":null,"disable":false,"args-payload":""},{"type":"transfer-witness-leader","args":null,"disable":false,"args-payload":""}],"schedulers-payload":null,"store-limit-mode":"manual","hot-regions-write-interval":"10m0s","hot-regions-reserved-days":7,"enable-diagnostic":"true","enable-witness":"false","slow-store-evicting-affected-store-ratio-threshold":0.3,"store-limit-version":"v1"},"replication":{"max-replicas":3,"location-labels":"","strictly-match-label":"false","enable-placement-rules":"true","enable-placement-rules-cache":"false","isolation-level":""},"pd-server":{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"auto","trace-region-flow":"true","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s","server-memory-limit":0,"server-memory-limit-gc-trigger":0.7,"enable-gogc-tuner":"false","gc-tuner-threshold":0.6},"cluster-version":"0.0.0","labels":{},"quota-backend-bytes":"8GiB","auto-compaction-mode":"periodic","auto-compaction-retention-v2":"1h","TickInterval":"500ms","ElectionInterval":"3s","PreVote":true,"max-request-bytes":157286400,"security":{"cacert-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-period":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}},"label-property":null,"WarningMsgs":null,"DisableStrictReconfigCheck":false,"HeartbeatStreamBindInterval":"1m0s","LeaderPriorityCheckInterval":"1m0s","dashboard":{"tidb-cacert-path":"","tidb-cert-path":"","tidb-key-path":"","public-path-prefix":"","internal-proxy":false,"enable-telemetry":false,"enable-experimental":false},"replication-mode":{"replication-mode":"majority","dr-auto-sync":{"label-key":"","primary":"","dr":"","primary-replicas":0,"dr-replicas":0,"wait-store-timeout":"1m0s","pause-region-split":"false"}},"keyspace":{"pre-alloc":null},"controller":{"degraded-mode-wait-duration":"0s","request-unit":{"read-base-cost":0.25,"read-cost-per-byte":0.0000152587890625,"write-base-cost":1,"write-cost-per-byte":0.0009765625,"read-cpu-ms-cost":0.3333333333333333}}}”]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/pd/api/v1]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/pd/api/v2/]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/swagger/]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/autoscaling]
[2024/04/17 16:31:54.673 +08:00] [INFO] [distro.go:51] [“using distribution strings”] [strings={}]
[2024/04/17 16:31:54.674 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/dashboard/api/]
[2024/04/17 16:31:54.674 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/dashboard/]
[2024/04/17 16:31:54.674 +08:00] [INFO] [apiutil.go:378] [“register REST path”] [path=/resource-manager/api/v1/]
[2024/04/17 16:31:54.674 +08:00] [INFO] [registry.go:92] [“restful API service registered successfully”] [prefix=pd-1] [service-name=ResourceManager]
[2024/04/17 16:31:54.674 +08:00] [INFO] [registry.go:92] [“restful API service registered successfully”] [prefix=pd-1] [service-name=MetaStorage]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls=“[http://0.0.0.0:22380]”]
[2024/04/17 16:31:54.675 +08:00] [INFO] [systimemon.go:30] [“start system time monitor”]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls=“[http://0.0.0.0:22379]”]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:611] [“pprof is enabled”] [path=/debug/pprof]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:305] [“starting an etcd server”] [etcd-version=3.4.21] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.20.3] [go-os=linux] [go-arch=amd64] [max-cpu-set=16] [max-cpu-available=16] [member-initialized=false] [name=pd-1] [data-dir=/data/software/tidb-data/tidb/tidb-data/pd-22379] [wal-dir=] [wal-dir-dedicated=] [member-dir=/data/software/tidb-data/tidb/tidb-data/pd-22379/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls=“[http://192.168.209.5:22380]”] [listen-peer-urls=“[http://0.0.0.0:22380]”] [advertise-client-urls=“[http://192.168.209.5:22379]”] [listen-client-urls=“[http://0.0.0.0:22379]”] [listen-metrics-urls=“”] [cors=“[]“] [host-whitelist=”[]”] [initial-cluster=“pd-192.168.209.6-22379=http://192.168.209.6:22380,pd-192.168.209.7-22379=http://192.168.209.7:22380,pd-1=http://192.168.209.5:22380”] [initial-cluster-state=existing] [initial-cluster-token=pd-cluster] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2024/04/17 16:31:54.675 +08:00] [WARN] [server.go:297] [“exceeded recommended request limit”] [max-request-bytes=157286400] [max-request-size=“157 MB”] [recommended-request-bytes=10485760] [recommended-request-size=“10 MB”]
[2024/04/17 16:31:54.680 +08:00] [INFO] [backend.go:80] [“opened backend db”] [path=/data/software/tidb-data/tidb/tidb-data/pd-22379/member/snap/db] [took=5.299165ms]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:536] [“starting local member”] [local-member-id=454254c164d8c6cf] [cluster-id=2c0580342200cbf5]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:1523] [“454254c164d8c6cf switched to configuration voters=()”]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:706] [“454254c164d8c6cf became follower at term 0”]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:389] [“newRaft 454254c164d8c6cf [peers: , term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]”]
[2024/04/17 16:31:54.696 +08:00] [WARN] [store.go:1379] [“simple token is not cryptographically signed”]
[2024/04/17 16:31:54.702 +08:00] [INFO] [quota.go:126] [“enabled backend quota”] [quota-name=v3-applier] [quota-size-bytes=8589934592] [quota-size=“8.6 GB”]
[2024/04/17 16:31:54.705 +08:00] [INFO] [pipeline.go:71] [“started HTTP pipelining with remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [transport.go:294] [“added new remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4] [remote-peer-urls=“[http://192.168.209.7:22380]”]
[2024/04/17 16:31:54.705 +08:00] [INFO] [pipeline.go:71] [“started HTTP pipelining with remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=6abe5923309025a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [transport.go:294] [“added new remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=6abe5923309025a4] [remote-peer-urls=“[http://192.168.209.6:22380]”]
[2024/04/17 16:31:54.705 +08:00] [INFO] [peer.go:128] [“starting remote peer”] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [pipeline.go:71] [“started HTTP pipelining with remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.706 +08:00] [INFO] [peer.go:134] [“started remote peer”] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.706 +08:00] [INFO] [transport.go:327] [“added remote peer”] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4] [

| username: Kongdom | Original post link

How about referring to this?

| username: Jellybean | Original post link

The reason for the error is here, a FATAL exception log appeared.

Search for the error keywords, and you can see similar posts that might be helpful:

Are all the nodes in the cluster currently normal? How many PD nodes are there, and are all of them normal except for this one?

| username: hacker_77powerful | Original post link

If I want to cleanly remove this PD, do I have to add --force when scaling in?

| username: Kongdom | Original post link

Without using -force, it is also a clean cleanup. -force is only used in situations where a node is disconnected and cannot be properly scaled down.

| username: zhaokede | Original post link

Still waiting for the scale-down to execute successfully.

| username: hacker_77powerful | Original post link

The service status of other nodes in the cluster is normal.

| username: zhanggame1 | Original post link

  • Unless the node machine is broken, force is generally not needed.