TiDB cluster cannot start

translator_bot · June 23, 2024, 2:28am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 集群无法启动

| username: blaine

【TiDB Environment】Production Environment
【TiDB Version】v5.2.2
【Encountered Issue】Unable to start the cluster after scaling down TiDB
【Reproduction Path】None
【Issue Phenomenon and Impact】
After scaling down TiDB, it reports that the region cannot be found, many processes are stuck, and then the cluster cannot be successfully restarted. The following logs were found, continuously searching for the already offline TiKV.
2022/10/02 01:05:42.761 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.9:20160] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:05:42.762 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.9:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.9:20160: connect: connection refused"”]
[2022/10/02 01:05:42.762 +08:00] [INFO] [region_request.go:344] [“mark store’s regions need be refill”] [id=15698413] [addr=172.16.120.9:20160] [error=“context deadline exceeded”]
[2022/10/02 01:05:43.470 +08:00] [INFO] [coprocessor.go:812] [“[TIME_COP_PROCESS] resp_time:10.925106589s txnStartTS:436376416354041867 region_id:15666413 store_addr:172.16.120.9:20160 backoff_ms:1165 backoff_types:[regionScheduling,tikvRPC,tikvRPC,regionMiss,regionScheduling,tikvRPC,tikvRPC]”]
[2022/10/02 01:05:48.485 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.10:20161] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:05:48.486 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.10:20161] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.10:20161: connect: connection refused"”]
[2022/10/02 01:05:48.486 +08:00] [INFO] [region_request.go:344] [“mark store’s regions need be refill”] [id=15698410] [addr=172.16.120.10:20161] [error=“context deadline exceeded”]
[2022/10/02 01:05:54.494 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.9:20160] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:05:54.495 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.9:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.9:20160: connect: connection refused"”]
[2022/10/02 01:05:54.495 +08:00] [INFO] [region_request.go:344] [“mark store’s regions need be refill”] [id=15698413] [addr=172.16.120.9:20160] [error=“context deadline exceeded”]
[2022/10/02 01:05:56.012 +08:00] [INFO] [coprocessor.go:812] [“[TIME_COP_PROCESS] resp_time:12.53678461s txnStartTS:436376416354041867 region_id:15666413 store_addr:172.16.120.9:20160 backoff_ms:3700 backoff_types:[regionScheduling,tikvRPC,tikvRPC,regionMiss,regionScheduling,tikvRPC,tikvRPC,regionMiss,regionScheduling,tikvRPC,tikvRPC]”]
[2022/10/02 01:06:01.038 +08:00] [WARN] [client_batch.go:503] [“init create streaming fail”] [target=172.16.120.10:20161] [forwardedHost=] [error=“context deadline exceeded”]
[2022/10/02 01:06:01.039 +08:00] [INFO] [region_cache.go:2251] [“[health check] check health error”] [store=172.16.120.10:20161] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.120.10:20161: connect: connection refused"”]

Found information in the store

How to delete this information so that the cluster does not search for the already offline IPs when starting. I entered the pd-ctl interaction and manually executed store delete id, it showed success, but the store still contains the information. It was not deleted.

translator_bot · June 23, 2024, 2:28am

| username: wuxiangdong | Original post link

After TiKV is taken offline, does its status change to Tombstone?

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

The store delete id only changes the TiKV to offline status. During this period, regions and leaders will be transferred. After that, it will change to tombstone status, and you can execute tiup cluster prune. The reason TiDB reports that it cannot find the region is because the TiKV went offline, causing the leader and region to transfer, which invalidates TiDB’s region cache. TiDB will automatically retry, which is the normal mechanism. From your description, it seems that the previous TiKV offline process was not completed, and now you are downsizing the TiDB server. Use tiup cluster display xx to check the cluster status first.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

TiKV is currently in an offline state and did not shut down properly. TiDB cannot start, but there are no offline nodes in the cluster because my colleague used the force option.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

Is there any way to remove this offline IP and bring the cluster up? It’s okay if some data is lost. The problem now is that the cluster won’t start, and the data can’t be backed up. When using BR for backup, it reports a “leader not found” error.

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

If you only forcibly took one TiKV node offline, it shouldn’t be like this since there are still 2 replicas. First, manually add the remove peer command to remove the peer on the offline node.
Refer to the script below:

for i in {offline_store_id}
do
    for j in `pd-ctl region store $i | jq ".regions[] | {id: .id}" | grep id | awk '{print $2}'`
    do
        pd-ctl operator add remove-peer $j $i
    done
    pd-ctl store $i
done

Then use pd-ctl to observe whether the store’s status and the region count decrease.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

When I try to remove a peer in this way, it prompts that no leader can be found.
It’s the same as when I use BR to back up, reporting that no leader can be found.
However, the cluster status PD is normal.

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

How many TiKVs do you have, and how many are functioning normally?
I guess you need to perform multi-replica failure recovery and remove the problematic stores.
Refer to the following:

TiDB 的问答社区 – 2 Mar 21

【SOP 系列 18】TiUP 环境恢复 TiKV 副本

🌌 运维指南 TiDB 运维手册

适用的场景 TiKV 的 data 目录被删除，或者 data 目录操作系统级别损坏注意仅仅针对 3 个 tikv 节点，多个 tikv 节点通常为了不影响其他节点操作，仅在执行多副本恢复后，等待副本补充到其他节点即可，无需重启集群。环境信息单机模拟 tiup cluster display tidb-test 模拟 TiKV 损坏删除两个 TiKV data 目录数据 cd /home/tidbtest/tidb-data/tikv-43642 rm...

阅读时间: 1 mins 🕑 赞: 6 ❤

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

Okay, I’m trying it. Thanks, I’ll give it a try first.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

I have followed the installation documentation. When not started, I found that the store on PD is empty. However, after starting TiKV, the store on PD has data. TiDB also cannot start and is still looking for the IP that has already been taken offline.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

The status of an offline IP remains offline and will not change.

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

Please describe your operation process and provide the complete output of tiup display.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

First, I stopped TiKV. Second, I disabled PD scheduling. Third, on a normal TiKV node, I transferred using the following command:

/opt/tidb-community-server-v5.2.2-linux-amd64/tikv-ctl --data-dir /data1/tidb-data/tikv-20160 unsafe-recover remove-fail-stores -s 15698412 --all-regions

I didn’t see any success message, but I noticed the transfer was happening. Do I need to execute this on every TiKV? Fourth, I restarted PD, and at this point, I found the store count was empty. Fifth, I started TiKV, and then I found the store had data.

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

Step three is executed on each normal TiKV.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

Okay, I’ll give it a try.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

Before starting TiKV

After starting TiKV, there is data. I also enabled scheduling.

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

What’s going on here , but compared to last time, the data volume is a bit less.

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

What is the current status? What has been done?

translator_bot · June 23, 2024, 2:28am

| username: blaine | Original post link

It still hasn’t been removed. After executing once, a few or even hundreds of rows of data are missing.

translator_bot · June 23, 2024, 2:28am

| username: h5n1 | Original post link

The store you are looking at now is different from the previous one.