TiDB Cluster Backup Exception Causes TiDB Restart Failure - Backup Stream Encountered Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群备份异常后,tidb重启启动不了 backup stream meet error

| username: 末0_0想

[TiDB Usage Environment] Testing
[TiDB Version] V6.5.0
[Reproduction Path] After using dumpling for backup, the TiDB service crashed. After restarting the TiDB cluster, it failed to restart. Upon investigating the relevant nodes, the following error was found. How should I handle this?

[2023/05/18 10:13:35.310 +08:00] [WARN] [errors.rs:155] ["backup stream meet error"] [verbose_err="Etcd(GRpcStatus(Status { code: Unknown, message: \"Service was not ready: buffered service failed: load balancer discovery error: transport error: transport error\", source: None }))"] [err="Etcd meet error grpc request error: status: Unknown, message: \"Service was not ready: buffered service failed: load balancer discovery error: transport error: transport error\", details: [], metadata: MetadataMap { headers: {} }"] [context="failed to get backup stream task"]
| username: db_user | Original post link

One of your PDs is down. First, get all the PDs up, then start TiDB. If it still doesn’t work, send the error logs from the startup time.

| username: 末0_0想 | Original post link

I followed your method to stop the entire cluster, then used tiup cluster start zdww-tidb -R pd to start PD, but as shown in the picture, only 2 nodes were started.

Another node with IP 162 reported the following log:

"] [request-path=/0/members/465979c2bcb316b9/attributes] [publish-timeout=11s] [error="etcdserver: request timed out"]
[2023/05/18 10:28:38.053 +08:00] [WARN] [server.go:2098] ["failed to publish local member to cluster through raft"] [local-member-id=465979c2bcb316b9] [local-member-attributes="{Name:pd-10.18.104.162-2379 ClientURLs:[http://10.18.104.162:2379]}"] [request-path=/0/members/465979c2bcb316b9/attributes] [publish-timeout=11s] [error="etcdserver: request timed out"]
[2023/05/18 10:28:49.054 +08:00] [WARN] [server.go:2098] ["failed to publish local member to cluster through raft"] [local-member-id=465979c2bcb316b9] [local-member-attributes="{Name:pd-10.18.104.162-2379 ClientURLs:[http://10.18.104.162:2379]}"] [request-path=/0/members/465979c2bcb316b9/attributes] [publish-timeout=11s] [error="etcdserver: request timed out"]

What should I do next?

| username: db_user | Original post link

Uh, why did you stop the entire cluster? You just need to start the downed PD node directly. Currently, the PD isn’t up. Check the PD logs to see what errors are causing those two nodes to fail to start. Also, check if the network is functioning properly and if the nodes can communicate. Start by launching PD first. If your PD status is abnormal, none of the other components can start normally.

| username: 末0_0想 | Original post link

I have started PD, but the firewall was enabled after the machine rebooted.

When I used tiup cluster start zdww-tidb -R tidb to start TiDB, I encountered the following error:

Error: failed to start tidb: failed to start: 10.18.104.164 tidb-4000.service, please check the instance's log(/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s

I checked the logs on 164 and found the following errors, but I see that the firewall is disabled. Should I start TiKV first?

[2023/05/18 10:56:24.527 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.163:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.163:20160: connect: connection refused\""]
[2023/05/18 10:56:25.271 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.161:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.161:20160: connect: connection refused\""]
[2023/05/18 10:56:25.338 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.154:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.154:20160: connect: connection refused\""]
[2023/05/18 10:56:25.527 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.163:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.163:20160: connect: connection refused\""]
[2023/05/18 10:56:26.271 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.161:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.161:20160: connect: connection refused\""]
[2023/05/18 10:56:26.339 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.154:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.154:20160: connect: connection refused\""]
[2023/05/18 10:56:26.528 +08:00] [INFO] [region_cache.go:2539] ["[health check] check health error"] [store=10.18.104.163:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.18.104.163:20160: connect: connection refused\""]
| username: db_user | Original post link

You can just restart the entire cluster directly. First, check if the issue with port 4000 is related to the firewall. The startup sequence is PD first, then KV, and then PD again. Before starting, check the firewall and network conditions. Since the cluster cannot provide external services at this time, just restart the cluster directly.

| username: 末0_0想 | Original post link

I used
tiup cluster start zdww-tidb -R pd
tiup cluster start zdww-tidb -R tikv (there was an error)
as shown in the image below:

Then I used tiup cluster restart zdww-tidb to restart everything directly:

There is an error in the log of 161, can you help me take a look?

| username: db_user | Original post link

In your first picture, both KV and PD are functioning normally, right? You can directly start the node separately using -N. The error in the log below indicates that port 20160 on 154 is not accessible. Check if the firewall has been reactivated or if there is another issue. Ensure that the firewall is turned off and that the network is fully accessible.

| username: 末0_0想 | Original post link

I found that using tiup cluster start zdww-tidb -R tikv can start TiKV.

However, using tiup cluster restart zdww-tidb results in the following error on node 161. It says the file is corrupted, causing the TiKV node to fail to start.

The last log file is corrupted but ignored: Append:58, Corruption: Log item offset is smaller than log batch header length

Previously, I used Dumpling for backup, but it caused the server to crash.

| username: db_user | Original post link

This is not the main issue. Check the resources; it might be due to insufficient resources during startup.

| username: 末0_0想 | Original post link

Can you take a look at it? When using tiup cluster restart zdww-tidb, it gets stuck at the TiKV startup and always shows a timeout. I’ve checked the resources, and they are not too high. The firewall and other settings are also turned off.


154
154.txt (247.7 KB)
161
ttt (3).txt (611.8 KB)

| username: Billmay表妹 | Original post link

Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page.
Let’s check your resource configuration.

| username: 末0_0想 | Original post link

Also, I want to ask, when I query large data tables, I often get the error “> Lost connection to MySQL server during query”. After investigation, I found that the high IO caused TiDB to restart. How can I handle this situation?

The logs also show the following errors:
[2023/05/18 14:54:08.688 +08:00] [ERROR] [client.go:555] [“[pd] tso request is canceled due to timeout”] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2023/05/18 14:54:11.827 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=8.06741986s]
[2023/05/18 14:54:16.753 +08:00] [ERROR] [client.go:555] [“[pd] tso request is canceled due to timeout”] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]

Is there any way to limit IO or the number of processes during queries to ensure the service remains normal? The query can be slow, but at least it won’t disrupt the service.

| username: Min_Chen | Original post link

Hello,

It looks like your server hardware, including CPU and memory, isn’t very large. When using it, you can reduce the concurrency of the tools and the application to lower the concurrency and the resources required for individual SQL queries. Otherwise, the cluster might frequently crash. You might want to consider expanding your hardware.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.