[TiDB Usage Environment] Production Environment
[TiDB Version] 7.5.1
[Reproduction Path] The original TiDB cluster had only one TiFlash node, which was shared with the TiKV node; now expanding by adding one more TiFlash node.
[Encountered Problem: Phenomenon and Impact] The newly added TiFlash node fails to start.
The log contains the following error, and the log can also be seen in the attachment. TiFlash_ErrorLog.tar.gz (6.2 MB)
[2024/06/15 14:06:43.834 +08:00] [ERROR] [Server.cpp:389] ["/workspace/source/tiflash/contrib/grpc/src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc, line number: 48, log msg :
{\"created\":\"@1718431603.834348337\",\"description\":\"No address added out of total 1 resolved\",\"file\":\"/workspace/source/tiflash/contrib/grpc/src/core/ext/transport/chttp2/server/cht
tp2_server.cc\",\"file_line\":936,\"referenced_errors\":[{\"created\":\"@1718431603.834330084\",\"description\":\"Unable to configure socket\",\"fd\":38,\"file\":\"/workspace/source/tiflash/
contrib/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc\",\"file_line\":218,\"referenced_errors\":[{\"created\":\"@1718431603.834325197\",\"description\":\"Cannot assign requested a
ddress\",\"errno\":99,\"file\":\"/workspace/source/tiflash/contrib/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc\",\"file_line\":191,\"os_error\":\"Cannot assign requested address
\",\"syscall\":\"bind\"}]}]}"] [source=grpc] [thread_id=1]
[2024/06/15 14:16:58.229 +08:00] [ERROR] [<unknown>] ["DB::Exception: Exception happens when start grpc server, the flash.service_addr may be invalid, flash.service_addr is 100.112.1.220:23930"] [source=Application] [thread_id=1]
When starting, the system cannot access the TiFlash node at 100.112.1.220:23930 and cannot establish a connection properly. Considering that there are already normally started TiFlash nodes, it is unlikely to be an issue with the cluster’s own software or version.
You can try the following troubleshooting directions:
Check if the address and port of flash.service_addr are configured correctly and if there are any conflicts to avoid misconfiguration.
Check for network connectivity issues, such as being blocked by blacklist/whitelist settings, firewalls, or other policies.
Check if the user password for accessing the target machine has expired or if there are other network restrictions preventing access.
Take a look at the specific deployment process. There are many steps to add new nodes. Check if anything is missing, such as password-free setup and port occupation.
I have tried scaling in, but the status is ‘N/A’. After scaling in with tiup cluster edit-config, the TiFlash node shows “offline:true”. Even after cleaning up data with tiup cluster prune, the topology remains the same with the status ‘N/A’, and edit-config still shows the TiFlash node as “offline:true”.
Today, I tried changing the port, but the result is still the same.
Permissions, ports, and firewalls have all been checked and there are no issues. Initially suspecting that it might be a problem with network address translation across different network segments, it is currently being processed. Updates will be provided as results come in.