After scaling up or down TiDB, it remains in a down state

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb做了扩缩容之后,依然为down的状态

| username: hacker_77powerful

【TiDB Usage Environment】Production Environment
【TiDB Version】7.1
【Reproduction Path】Operations performed that led to the issue
【Encountered Issue: Symptoms and Impact】One PD service node failed to start due to a full file system. Followed the scale-in and scale-out method, but the service still couldn’t start.
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

deploy_dir: “/data/software/tidb-7.1.0/tidb-deploy/pd-22379”
data_dir: “/data/software/tidb-data/tidb/tidb-data/pd-22379”
log_dir: “/data/software/tidb-7.1.0/tidb-deploy/pd-22379/log”
tiup cluster scale-out tidb scale-out.yml -p -i /home/root/.ssh/gcp_rsa
[FATAL] [main.go:232] [“run server failed”] [error=“[PD:server:ErrCancelStartEtcd]etcd start canceled”] [stack=“main.start
n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:232\nmain.createServerWrapper\n\t/home/jenkins/agent/worksp
ace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:147\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.
0.0/command.go:846\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command
).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/
cmd/pd-server/main.go:56\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

[INFO] [etcd.go:305] [“starting an etcd server”] [etcd-version=3.4.21] [git-sha=“Not provided (use ./build instead of go bu
ild)”] [go-version=go1.20.3] [go-os=linux] [go-arch=amd64] [max-cpu-set=16] [max-cpu-available=16] [member-initialized=true] [name=pd-1] [data-dir=/data/sof
tware/tidb-data/tidb/tidb-data/pd-22379] [wal-dir=] [wal-dir-dedicated=] [member-dir=/data/software/tidb-data/tidb/tidb-data/pd-22379/member] [force-new-clu
ster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [i nitial-advertise-peer-urls=“[http://192.168.209.5:22380]”] [listen-peer-urls=“[http://0.0.0.0:22380]”] [advertise-client-urls=“[http://192.168.209.5:22379]”
] [listen-client-urls=“[http://0.0.0.0:22379]”] [listen-metrics-urls=“”] [cors=“[]“] [host-whitelist=”[]”] [initial-cluster=] [initial-cluster-state=new
] [initial-cluster-token=] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corru
pt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [disco
very-url=] [discovery-proxy=]
[2024/04/11 17:39:02.624 +08:00] [WARN] [server.go:297] [“exceeded recommended request limit”] [max-request-bytes=157286400] [max-request-size=“157 MB”] [re
commended-request-bytes=10485760] [recommended-request-size=“10 MB”]
[2024/04/11 17:39:02.624 +08:00] [INFO] [backend.go:80] [“opened backend db”] [path=/data/software/tidb-data/tidb/tidb-data/pd-22379/member/snap/db] [took=1
73.215µs]
[2024/04/11 17:39:02.624 +08:00] [INFO] [raft.go:586] [“restarting local member”] [cluster-id=2c0580342200cbf5] [local-member-id=b43ecfd4b44129fc] [commit-i
ndex=0]
[2024/04/11 17:39:02.624 +08:00] [INFO] [raft.go:1523] [“b43ecfd4b44129fc switched to configuration voters=()”]
[2024/04/11 17:39:02.624 +08:00] [INFO] [raft.go:706] [“b43ecfd4b44129fc became follower at term 2”]
[2024/04/11 17:39:02.624 +08:00] [INFO] [raft.go:389] [“newRaft b43ecfd4b44129fc [peers: , term: 2, commit: 0, applied: 0, lastindex: 0, lastterm: 0]”]
[2024/04/11 17:39:02.627 +08:00] [WARN] [store.go:1379] [“simple token is not cryptographically signed”]

| username: xfworld | Original post link

How many PDs are configured in your cluster?

| username: Billmay表妹 | Original post link

[Resource Allocation] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
Take a screenshot and have a look~

| username: zhaokede | Original post link

Shouldn’t the problematic PD be scaled down?

| username: QH琉璃 | Original post link

Is the compression already completed?

| username: Kamner | Original post link

Take a look at the cluster topology and the operations you have performed.

| username: shigp_TIDBER | Original post link

Is it really that bad? Are there any other operations involved?

| username: hacker_77powerful | Original post link

Followed the steps for scaling out, but the PD service is still down.

| username: hacker_77powerful | Original post link

Error log as follows:

[root@host-192-168-209-5 log]# tail -n 100 pd.log 
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:89] ["Welcome to Placement Driver (PD)"]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:90] [PD] [release-version=v7.1.0]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:91] [PD] [edition=Community]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:92] [PD] [git-hash=1ff614d90412396c9ebaad76a30d31e683c34adc]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:93] [PD] [git-branch=heads/refs/tags/v7.1.0]
[2024/04/17 16:31:54.657 +08:00] [INFO] [versioninfo.go:94] [PD] [utc-build-time="2023-05-25 02:10:43"]
[2024/04/17 16:31:54.657 +08:00] [INFO] [metricutil.go:86] ["disable Prometheus push client"]
[2024/04/17 16:31:54.657 +08:00] [INFO] [join.go:218] ["failed to open directory, maybe start for the first time"] [error="open /data/software/tidb-data/tidb/tidb-data/pd-22379/member: no such file or directory"]
[2024/04/17 16:31:54.667 +08:00] [INFO] [server.go:242] ["PD config"] [config="{\"client-urls\":\"http://0.0.0.0:22379\",\"peer-urls\":\"http://0.0.0.0:22380\",\"advertise-client-urls\":\"http://192.168.209.5:22379\",\"advertise-peer-urls\":\"http://192.168.209.5:22380\",\"name\":\"pd-1\",\"data-dir\":\"/data/software/tidb-data/tidb/tidb-data/pd-22379\",\"force-new-cluster\":false,\"enable-grpc-gateway\":true,\"initial-cluster\":\"pd-192.168.209.7-22379=http://192.168.209.7:22380,pd-1=http://192.168.209.5:22380,pd-192.168.209.6-22379=http://192.168.209.6:22380\",\"initial-cluster-state\":\"existing\",\"initial-cluster-token\":\"pd-cluster\",\"join\":\"http://192.168.209.6:22379,http://192.168.209.7:22379\",\"lease\":3,\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"/data/software/tidb-7.1.0/tidb-deploy/pd-22379/log/pd.log\",\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"disable-error-verbose\":true,\"sampling\":null,\"error-output-path\":\"\"},\"tso-save-interval\":\"3s\",\"tso-update-physical-interval\":\"50ms\",\"enable-local-tso\":false,\"metric\":{\"job\":\"pd-1\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":64,\"max-pending-peer-count\":64,\"max-merge-region-size\":20,\"max-merge-region-keys\":0,\"split-merge-interval\":\"1h0m0s\",\"swtich-witness-interval\":\"1h0m0s\",\"enable-one-way-merge\":\"false\",\"enable-cross-table-merge\":\"true\",\"patrol-region-interval\":\"10ms\",\"max-store-down-time\":\"30m0s\",\"max-store-preparing-time\":\"48h0m0s\",\"leader-schedule-limit\":4,\"leader-schedule-policy\":\"count\",\"region-schedule-limit\":2048,\"witness-schedule-limit\":4,\"replica-schedule-limit\":64,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":4,\"hot-region-cache-hits-threshold\":3,\"store-limit\":{},\"tolerant-size-ratio\":0,\"low-space-ratio\":0.8,\"high-space-ratio\":0.7,\"region-score-formula-version\":\"v2\",\"scheduler-max-waiting-operator\":5,\"enable-remove-down-replica\":\"true\",\"enable-replace-offline-replica\":\"true\",\"enable-make-up-replica\":\"true\",\"enable-remove-extra-replica\":\"true\",\"enable-location-replacement\":\"true\",\"enable-debug-metrics\":\"false\",\"enable-joint-consensus\":\"true\",\"enable-tikv-split-region\":\"true\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"balance-witness\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"hot-region\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"transfer-witness-leader\",\"args\":null,\"disable\":false,\"args-payload\":\"\"}],\"schedulers-payload\":null,\"store-limit-mode\":\"manual\",\"hot-regions-write-interval\":\"10m0s\",\"hot-regions-reserved-days\":7,\"enable-diagnostic\":\"true\",\"enable-witness\":\"false\",\"slow-store-evicting-affected-store-ratio-threshold\":0.3,\"store-limit-version\":\"v1\"},\"replication\":{\"max-replicas\":3,\"location-labels\":\"\",\"strictly-match-label\":\"false\",\"enable-placement-rules\":\"true\",\"enable-placement-rules-cache\":\"false\",\"isolation-level\":\"\"},\"pd-server\":{\"use-region-storage\":\"true\",\"max-gap-reset-ts\":\"24h0m0s\",\"key-type\":\"table\",\"runtime-services\":\"\",\"metric-storage\":\"\",\"dashboard-address\":\"auto\",\"trace-region-flow\":\"true\",\"flow-round-by-digit\":3,\"min-resolved-ts-persistence-interval\":\"1s\",\"server-memory-limit\":0,\"server-memory-limit-gc-trigger\":0.7,\"enable-gogc-tuner\":\"false\",\"gc-tuner-threshold\":0.6},\"cluster-version\":\"0.0.0\",\"labels\":{},\"quota-backend-bytes\":\"8GiB\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"max-request-bytes\":157286400,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null,\"SSLCABytes\":null,\"SSLCertBytes\":null,\"SSLKEYBytes\":null,\"redact-info-log\":false,\"encryption\":{\"data-encryption-method\":\"plaintext\",\"data-key-rotation-period\":\"168h0m0s\",\"master-key\":{\"type\":\"plaintext\",\"key-id\":\"\",\"region\":\"\",\"endpoint\":\"\",\"path\":\"\"}}},\"label-property\":null,\"WarningMsgs\":null,\"DisableStrictReconfigCheck\":false,\"HeartbeatStreamBindInterval\":\"1m0s\",\"LeaderPriorityCheckInterval\":\"1m0s\",\"dashboard\":{\"tidb-cacert-path\":\"\",\"tidb-cert-path\":\"\",\"tidb-key-path\":\"\",\"public-path-prefix\":\"\",\"internal-proxy\":false,\"enable-telemetry\":false,\"enable-experimental\":false},\"replication-mode\":{\"replication-mode\":\"majority\",\"dr-auto-sync\":{\"label-key\":\"\",\"primary\":\"\",\"dr\":\"\",\"primary-replicas\":0,\"dr-replicas\":0,\"wait-store-timeout\":\"1m0s\",\"pause-region-split\":\"false\"}},\"keyspace\":{\"pre-alloc\":null},\"controller\":{\"degraded-mode-wait-duration\":\"0s\",\"request-unit\":{\"read-base-cost\":0.25,\"read-cost-per-byte\":0.0000152587890625,\"write-base-cost\":1,\"write-cost-per-byte\":0.0009765625,\"read-cpu-ms-cost\":0.3333333333333333}}}"]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/pd/api/v1]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/pd/api/v2/]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/swagger/]
[2024/04/17 16:31:54.673 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/autoscaling]
[2024/04/17 16:31:54.673 +08:00] [INFO] [distro.go:51] ["using distribution strings"] [strings={}]
[2024/04/17 16:31:54.674 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/dashboard/api/]
[2024/04/17 16:31:54.674 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/dashboard/]
[2024/04/17 16:31:54.674 +08:00] [INFO] [apiutil.go:378] ["register REST path"] [path=/resource-manager/api/v1/]
[2024/04/17 16:31:54.674 +08:00] [INFO] [registry.go:92] ["restful API service registered successfully"] [prefix=pd-1] [service-name=ResourceManager]
[2024/04/17 16:31:54.674 +08:00] [INFO] [registry.go:92] ["restful API service registered successfully"] [prefix=pd-1] [service-name=MetaStorage]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[http://0.0.0.0:22380]"]
[2024/04/17 16:31:54.675 +08:00] [INFO] [systimemon.go:30] ["start system time monitor"]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:127] ["configuring client listeners"] [listen-client-urls="[http://0.0.0.0:22379]"]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:611] ["pprof is enabled"] [path=/debug/pprof]
[2024/04/17 16:31:54.675 +08:00] [INFO] [etcd.go:305] ["starting an etcd server"] [etcd-version=3.4.21] [git-sha="Not provided (use ./build instead of go build)"] [go-version=go1.20.3] [go-os=linux] [go-arch=amd64] [max-cpu-set=16] [max-cpu-available=16] [member-initialized=false] [name=pd-1] [data-dir=/data/software/tidb-data/tidb/tidb-data/pd-22379] [wal-dir=] [wal-dir-dedicated=] [member-dir=/data/software/tidb-data/tidb/tidb-data/pd-22379/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://192.168.209.5:22380]"] [listen-peer-urls="[http://0.0.0.0:22380]"] [advertise-client-urls="[http://192.168.209.5:22379]"] [listen-client-urls="[http://0.0.0.0:22379]"] [listen-metrics-urls="[]"] [cors="[*]"] [host-whitelist="[*]"] [initial-cluster="pd-192.168.209.6-22379=http://192.168.209.6:22380,pd-192.168.209.7-22379=http://192.168.209.7:22380,pd-1=http://192.168.209.5:22380"] [initial-cluster-state=existing] [initial-cluster-token=pd-cluster] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2024/04/17 16:31:54.675 +08:00] [WARN] [server.go:297] ["exceeded recommended request limit"] [max-request-bytes=157286400] [max-request-size="157 MB"] [recommended-request-bytes=10485760] [recommended-request-size="10 MB"]
[2024/04/17 16:31:54.680 +08:00] [INFO] [backend.go:80] ["opened backend db"] [path=/data/software/tidb-data/tidb/tidb-data/pd-22379/member/snap/db] [took=5.299165ms]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:536] ["starting local member"] [local-member-id=454254c164d8c6cf] [cluster-id=2c0580342200cbf5]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:1523] ["454254c164d8c6cf switched to configuration voters=()"]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:706] ["454254c164d8c6cf became follower at term 0"]
[2024/04/17 16:31:54.691 +08:00] [INFO] [raft.go:389] ["newRaft 454254c164d8c6cf [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"]
[2024/04/17 16:31:54.696 +08:00] [WARN] [store.go:1379] ["simple token is not cryptographically signed"]
[2024/04/17 16:31:54.702 +08:00] [INFO] [quota.go:126] ["enabled backend quota"] [quota-name=v3-applier] [quota-size-bytes=8589934592] [quota-size="8.6 GB"]
[2024/04/17 16:31:54.705 +08:00] [INFO] [pipeline.go:71] ["started HTTP pipelining with remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [transport.go:294] ["added new remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4] [remote-peer-urls="[http://192.168.209.7:22380]"]
[2024/04/17 16:31:54.705 +08:00] [INFO] [pipeline.go:71] ["started HTTP pipelining with remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=6abe5923309025a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [transport.go:294] ["added new remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=6abe5923309025a4] [remote-peer-urls="[http://192.168.209.6:22380]"]
[2024/04/17 16:31:54.705 +08:00] [INFO] [peer.go:128] ["starting remote peer"] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [pipeline.go:71] ["started HTTP pipelining with remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.705 +08:00] [INFO] [stream.go:166] ["started stream writer with remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.706 +08:00] [INFO] [peer.go:134] ["started remote peer"] [remote-peer-id=359d5f2f171f90a4]
[2024/04/17 16:31:54.706 +08:00] [INFO] [transport.go:327] ["added remote peer"] [local-member-id=454254c164d8c6cf] [remote-peer-id=359d5f2f
| username: hacker_77powerful | Original post link

I’ll bump this up myself.

| username: TiDBer_JUi6UvZm | Original post link

Are there many such errors? Have you analyzed the cause of this?

| username: TiDBer_JUi6UvZm | Original post link

Is the firewall on the server where the new PD node is located turned off? Check for network-related issues.

| username: WalterWj | Original post link

Was the directory deleted?

| username: shigp_TIDBER | Original post link

There are quite a lot of logs, it’s hard to go through them all, but I’m still very interested in the final resolution of this issue.

| username: oceanzhang | Original post link

Did you find the reason in the end? I encountered this issue last time as well.

| username: 舞动梦灵 | Original post link

How many PDs are there, should we scale down first or scale up first? Personally, I feel that if there are 3 PDs, scaling down might cause other issues. TiDB servers can be scaled down to 1 without any problem, but PDs are officially recommended to have at least 3.

| username: TiDBer_yyy | Original post link

Is the file system on 192.168.209.5:22379 corrupted? How did it happen?
For example: scaling up or scaling down, please describe in detail.

| username: 舞动梦灵 | Original post link

If it’s a TiDB server, it’s sufficient to ensure at least one instance. For TiKV and PD, ensure at least three instances. If it’s not for resource-saving downsizing, it’s recommended to first scale out by adding a new server, then scale in the problematic server. For example, if one PD server has an issue, add a new PD server first. Once everything is stable, scale in the problematic node. Each scale-out or scale-in operation for PD servers requires executing the update configuration command:

tiup cluster reload <cluster-name> --skip-restart

References:

| username: zhang_2023 | Original post link

Was the scaling up or down successful?

| username: hacker_77powerful | Original post link

Amazingly, after a period of time, the PD that was down recovered on its own.