Unable to start new TiFlash node after TiDB scaling

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB 扩容新的 tiflash 节点启不起来。

| username: TiDBer_BMMBGarU

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.5.1
[Reproduction Path] The original TiDB cluster had only one TiFlash node, which was shared with the TiKV node; now expanding by adding one more TiFlash node.
[Encountered Problem: Phenomenon and Impact] The newly added TiFlash node fails to start.
The log contains the following error, and the log can also be seen in the attachment.
TiFlash_ErrorLog.tar.gz (6.2 MB)

[2024/06/15 14:06:43.834 +08:00] [ERROR] [Server.cpp:389] ["/workspace/source/tiflash/contrib/grpc/src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc, line number: 48, log msg :
{\"created\":\"@1718431603.834348337\",\"description\":\"No address added out of total 1 resolved\",\"file\":\"/workspace/source/tiflash/contrib/grpc/src/core/ext/transport/chttp2/server/cht
tp2_server.cc\",\"file_line\":936,\"referenced_errors\":[{\"created\":\"@1718431603.834330084\",\"description\":\"Unable to configure socket\",\"fd\":38,\"file\":\"/workspace/source/tiflash/
contrib/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc\",\"file_line\":218,\"referenced_errors\":[{\"created\":\"@1718431603.834325197\",\"description\":\"Cannot assign requested a
ddress\",\"errno\":99,\"file\":\"/workspace/source/tiflash/contrib/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc\",\"file_line\":191,\"os_error\":\"Cannot assign requested address
\",\"syscall\":\"bind\"}]}]}"] [source=grpc] [thread_id=1]

[2024/06/15 14:16:58.229 +08:00] [ERROR] [<unknown>] ["DB::Exception: Exception happens when start grpc server, the flash.service_addr may be invalid, flash.service_addr is 100.112.1.220:23930"] [source=Application] [thread_id=1]

[Resource Configuration]

[Attachment: Screenshot/Log/Monitoring]

| username: Kongdom | Original post link

Check if port 23930 is occupied, or if the firewall is turned off. It seems that the node is not accessible. The default port should be 3930.

| username: ziptoam | Original post link

The possibility of network and port issues is relatively high. You can first test whether the port is accessible.

| username: tidb菜鸟一只 | Original post link

Try manually starting TiFlash on the machine with IP 100.112.1.220 to see if it can start.

| username: TiDBer_BMMBGarU | Original post link

The port is not occupied, and the firewall is also turned off.

| username: Kongdom | Original post link

Can you ping this port 23930?

| username: Jellybean | Original post link

When starting, the system cannot access the TiFlash node at 100.112.1.220:23930 and cannot establish a connection properly. Considering that there are already normally started TiFlash nodes, it is unlikely to be an issue with the cluster’s own software or version.

You can try the following troubleshooting directions:

  1. Check if the address and port of flash.service_addr are configured correctly and if there are any conflicts to avoid misconfiguration.
  2. Check for network connectivity issues, such as being blocked by blacklist/whitelist settings, firewalls, or other policies.
  3. Check if the user password for accessing the target machine has expired or if there are other network restrictions preventing access.
| username: 友利奈绪 | Original post link

Take a look at the specific deployment process. There are many steps to add new nodes. Check if anything is missing, such as password-free setup and port occupation.

| username: jiayou64 | Original post link

Before expanding the deployment, use the command to check:

tiup cluster check tidb_cluster scale-out.yaml --cluster --user tidb
| username: lemonade010 | Original post link

tiup cluster display <cluster-name> to check if there are any issues with the configuration file.

| username: zhaokede | Original post link

Check if the configuration or network is accessible.

| username: TiDBer_BMMBGarU | Original post link

I just searched and found that others have encountered the same problem. I don’t know if they have solved it yet.

| username: TiDBer_BMMBGarU | Original post link

The firewall is not enabled, and the ports are not occupied.

| username: Jellybean | Original post link

You can try scaling it down and then scaling it up again. If the problem persists, please repost the tiflash.log logs.

| username: TiDBer_BMMBGarU | Original post link

I have tried scaling in, but the status is ‘N/A’. After scaling in with tiup cluster edit-config, the TiFlash node shows “offline:true”. Even after cleaning up data with tiup cluster prune, the topology remains the same with the status ‘N/A’, and edit-config still shows the TiFlash node as “offline:true”.

image

Today, I tried changing the port, but the result is still the same.

| username: TIDB-Learner | Original post link

Check the network, file permissions, ports, etc.

| username: TiDBer_BMMBGarU | Original post link

Permissions, ports, and firewalls have all been checked and there are no issues. Initially suspecting that it might be a problem with network address translation across different network segments, it is currently being processed. Updates will be provided as results come in.