Scaling Issues

translator_bot · June 22, 2024, 1:11pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容问题

[Test Environment for TiDB] Testing
[TiDB Version] 6.5.1
[Reproduction Path] Adding a PD
[Encountered Problem: Phenomenon and Impact] When executing SCALE-OUT, an error occurs: “error”: “no endpoint available, the last err was: Get "http://192.168.46.101:2379/pd/api/v1/config/replicate\”: dial tcp 192.168.46.101:2379: connect: connection refused
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 1:11pm

| username: CuteRay | Original post link

Could you please share the cluster topology and the configuration file for scaling out?

translator_bot · June 22, 2024, 1:11pm

| username: TiDBer_Terry261 | Original post link

At the beginning, there were 3 PD servers. Yesterday, two of the PD servers failed simultaneously, so we prepared to add another PD server to get the cluster running again.

translator_bot · June 22, 2024, 1:11pm

| username: TiDBer_Terry261 | Original post link

topology.yaml (11.1 KB)

translator_bot · June 22, 2024, 1:11pm

| username: Kongdom | Original post link

Does it mean that the cluster is not in an up state now?

translator_bot · June 22, 2024, 1:11pm

| username: CuteRay | Original post link

It looks like your cluster is in a stopped state, so you can’t scale it. You need to start the cluster first, and then scale the PD nodes.

translator_bot · June 22, 2024, 1:11pm

| username: TiDBer_Terry261 | Original post link

Currently, there is only one PD left. I have successfully removed the two problematic PDs by performing a SCALE-IN. However, when starting the cluster, the TIKV nodes still try to connect to a PD that no longer exists, so none of the TIKV nodes can start.

translator_bot · June 22, 2024, 1:11pm

| username: CuteRay | Original post link

When you originally scaled down, the cluster was in a running state, right?
Also, start it first, there’s no way to fix it without starting it.

translator_bot · June 22, 2024, 1:11pm

| username: tidb菜鸟一只 | Original post link

Expand while keeping one PD online.

translator_bot · June 22, 2024, 1:11pm

| username: Kongdom | Original post link

Could you please share the error message for us to take a look?

translator_bot · June 22, 2024, 1:11pm

| username: 考试没答案 | Original post link

Display the status.

translator_bot · June 22, 2024, 1:11pm

| username: 考试没答案 | Original post link

Let’s see if a single PD can start successfully.

translator_bot · June 22, 2024, 1:11pm

| username: h5n1 | Original post link

The probability is high. Check the PD address specified in the run_tikv.sh file under the deployment directory of TiKV and modify it to the current one. If it still doesn’t work, you might need to use pd-recover for recovery.