TiDB cluster cannot start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 集群无法启动

| username: blaine

【TiDB Environment】Production Environment
【TiDB Version】v5.2.2
【Encountered Issue】Unable to start the cluster after scaling down TiDB
【Reproduction Path】None
【Issue Phenomenon and Impact】
After scaling down TiDB, it reports that the region cannot be found, many processes are stuck, and then the cluster cannot be successfully restarted. The following logs were found, continuously searching for the already offline TiKV.
2022/10/02 01:05:42.761 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.9:20160] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:05:42.762 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.9:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.9:20160: connect: connection refused"”]
[2022/10/02 01:05:42.762 +08:00] [INFO] [region_request.go:344] [“mark store’s regions need be refill”] [id=15698413] [addr=172.16.120.9:20160] [error=“context deadline exceeded”]
[2022/10/02 01:05:43.470 +08:00] [INFO] [coprocessor.go:812] [“[TIME_COP_PROCESS] resp_time:10.925106589s txnStartTS:436376416354041867 region_id:15666413 store_addr:172.16.120.9:20160 backoff_ms:1165 backoff_types:[regionScheduling,tikvRPC,tikvRPC,regionMiss,regionScheduling,tikvRPC,tikvRPC]”]
[2022/10/02 01:05:48.485 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.10:20161] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:05:48.486 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.10:20161] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.10:20161: connect: connection refused"”]
[2022/10/02 01:05:48.486 +08:00] [INFO] [region_request.go:344] [“mark store’s regions need be refill”] [id=15698410] [addr=172.16.120.10:20161] [error=“context deadline exceeded”]
[2022/10/02 01:05:54.494 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.9:20160] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:05:54.495 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.9:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.9:20160: connect: connection refused"”]
[2022/10/02 01:05:54.495 +08:00] [INFO] [region_request.go:344] [“mark store’s regions need be refill”] [id=15698413] [addr=172.16.120.9:20160] [error=“context deadline exceeded”]
[2022/10/02 01:05:56.012 +08:00] [INFO] [coprocessor.go:812] [“[TIME_COP_PROCESS] resp_time:12.53678461s txnStartTS:436376416354041867 region_id:15666413 store_addr:172.16.120.9:20160 backoff_ms:3700 backoff_types:[regionScheduling,tikvRPC,tikvRPC,regionMiss,regionScheduling,tikvRPC,tikvRPC,regionMiss,regionScheduling,tikvRPC,tikvRPC]”]
[2022/10/02 01:06:01.038 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.10:20161] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:06:01.039 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.10:20161] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.10:20161: connect: connection refused"”]

Found information in the store


How to delete this information so that the cluster does not search for the already offline IPs when starting. I entered the pd-ctl interaction and manually executed store delete id, it showed success, but the store still contains the information. It was not deleted.

| username: wuxiangdong | Original post link

After TiKV is taken offline, does its status change to Tombstone?

| username: h5n1 | Original post link

The store delete id only changes the TiKV to offline status. During this period, regions and leaders will be transferred. After that, it will change to tombstone status, and you can execute tiup cluster prune. The reason TiDB reports that it cannot find the region is because the TiKV went offline, causing the leader and region to transfer, which invalidates TiDB’s region cache. TiDB will automatically retry, which is the normal mechanism. From your description, it seems that the previous TiKV offline process was not completed, and now you are downsizing the TiDB server. Use tiup cluster display xx to check the cluster status first.

| username: blaine | Original post link

TiKV is currently in an offline state and did not shut down properly. TiDB cannot start, but there are no offline nodes in the cluster because my colleague used the force option.

| username: blaine | Original post link

Is there any way to remove this offline IP and bring the cluster up? It’s okay if some data is lost. The problem now is that the cluster won’t start, and the data can’t be backed up. When using BR for backup, it reports a “leader not found” error.

| username: h5n1 | Original post link

If you only forcibly took one TiKV node offline, it shouldn’t be like this since there are still 2 replicas. First, manually add the remove peer command to remove the peer on the offline node.
Refer to the script below:

for i in {offline_store_id}
do
    for j in `pd-ctl region store $i | jq ".regions[] | {id: .id}" | grep id | awk '{print $2}'`
    do
        pd-ctl operator add remove-peer $j $i
    done
    pd-ctl store $i
done

Then use pd-ctl to observe whether the store’s status and the region count decrease.

| username: blaine | Original post link

When I try to remove a peer in this way, it prompts that no leader can be found.
It’s the same as when I use BR to back up, reporting that no leader can be found.
However, the cluster status PD is normal.

| username: h5n1 | Original post link

How many TiKVs do you have, and how many are functioning normally?
I guess you need to perform multi-replica failure recovery and remove the problematic stores.
Refer to the following:

| username: blaine | Original post link

Okay, I’m trying it. Thanks, I’ll give it a try first.

| username: blaine | Original post link

I have followed the installation documentation. When not started, I found that the store on PD is empty. However, after starting TiKV, the store on PD has data. TiDB also cannot start and is still looking for the IP that has already been taken offline.

| username: blaine | Original post link

The status of an offline IP remains offline and will not change.

| username: h5n1 | Original post link

Please describe your operation process and provide the complete output of tiup display.

| username: blaine | Original post link

First, I stopped TiKV. Second, I disabled PD scheduling. Third, on a normal TiKV node, I transferred using the following command:

/opt/tidb-community-server-v5.2.2-linux-amd64/tikv-ctl --data-dir /data1/tidb-data/tikv-20160 unsafe-recover remove-fail-stores -s 15698412 --all-regions

I didn’t see any success message, but I noticed the transfer was happening. Do I need to execute this on every TiKV? Fourth, I restarted PD, and at this point, I found the store count was empty. Fifth, I started TiKV, and then I found the store had data.

| username: h5n1 | Original post link

Step three is executed on each normal TiKV.

| username: blaine | Original post link

Okay, I’ll give it a try.

| username: blaine | Original post link

Before starting TiKV


After starting TiKV, there is data. I also enabled scheduling.

| username: blaine | Original post link

What’s going on here :innocent:, but compared to last time, the data volume is a bit less.

| username: h5n1 | Original post link

What is the current status? What has been done?

| username: blaine | Original post link

It still hasn’t been removed. After executing once, a few or even hundreds of rows of data are missing.

| username: h5n1 | Original post link

The store you are looking at now is different from the previous one.