After Forcing Tiflash Offline, Regions Remain in a Down State and Cannot Be Cleared, Preventing Tiflash Scale-Out Nodes from Synchronizing Data

translator_bot · June 21, 2024, 8:29pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tiflash强制下线后, 存在regions 处于down的状态无法清理, 导致tiflash scale-out的节点无法同步数据

| username: 济南小老虎

[TiDB Usage Environment] Poc
[TiDB Version] 6.5.3
[Reproduction Path] Executed scale-in --force operation on tiflash, then executed scale-out operation and found that the tiflash table was not synchronized.

[Encountered Issue: Symptoms and Impact] The scale-in node does not appear in tiup display, but it shows as offline in pd-ctl store. Many regions are in a down state.
[Resource Configuration] 4* 96c Kunpeng 512G memory nvme SSD
[Attachments: Screenshots/Logs/Monitoring]
Cluster Status:
Starting component cluster: /root/.tiup/components/cluster/v1.12.4/tiup-cluster display erptidb
Cluster type: tidb
Cluster name: erptidb
Cluster version: v6.5.3
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://192.168.255.119:2379/dashboard
Grafana URL: http://192.168.255.121:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

192.168.255.121:9093 192.168.255.121:3000 grafana 192.168.255.119:2379 pd 192.168.255.121:2379 pd 192.168.255.121:9090 prometheus 192.168.255.119:4000 tidb 192.168.255.120:4000 tidb 192.168.255.120:4001 tidb 192.168.255.121:4000 tidb 192.168.255.121:9003 tiflash 192.168.255.119:20160 tikv 192.168.255.119:20161 tikv 192.168.255.119:50160 tikv 192.168.255.120:20160 tikv 192.168.255.120:20161 tikv 192.168.255.120:40160 tikv 192.168.255.120:40161 tikv 192.168.255.121:20160 tikv 192.168.255.121:20161 tikv 192.168.255.122:30160 tikv 192.168.255.122:30161 tikv 192.168.255.122:30162 tikv 192.168.255.122:40160 tikv 192.168.255.122:40161 tikv 192.168.255.122:40162 tikv alertmanager 192.168.255.121 9093/9094 linux/aarch64 Up /nvme00/tidb/alertmanager-9093 /deploy/tidb/alertmanager-9093
192.168.255.121 3000 linux/aarch64 Up - /deploy/tidb/grafana-3000
192.168.255.119 2379/2380 linux/aarch64 Up|L|UI /nvme00/tidb/pd-2379 /deploy/tidb/pd-2379
192.168.255.121 2379/2380 linux/aarch64 Up /nvme00/tidb/pd-2379 /deploy/tidb/pd-2379
192.168.255.121 9090/12020 linux/aarch64 Up /nvme00/tidb/prometheus-9090 /deploy/tidb/prometheus-9090
192.168.255.119 4000/10080 linux/aarch64 Up - /deploy/tidb/tidb-4000
192.168.255.120 4000/10080 linux/aarch64 Up - /deploy/tidb-4000
192.168.255.120 4001/10081 linux/aarch64 Up - /deploy/tidb-4001
192.168.255.121 4000/10080 linux/aarch64 Up - /deploy/tidb/tidb-4000
192.168.255.121 9003/8125/3932/20172/20294/8236 linux/aarch64 Up /nvme02/tiflash/data/tiflash-9003 /deploy/tidb/tiflash-9003
192.168.255.119 20160/20180 linux/aarch64 Up /nvme00/tidb/tikv/data/tikv-20160 /deploy/tidb/tikv-20160
192.168.255.119 20161/20181 linux/aarch64 Up /nvme01/tidb/data/tikv-20161 /deploy/tidb/tikv-20161
192.168.255.119 50160/50180 linux/aarch64 Up /nvme03/tidb/data/tikv-50160 /deploy/tidb/tikv-50160
192.168.255.120 20160/20180 linux/aarch64 Up /nvme00/tidb/data/tikv-20160 /deploy/tidb/tikv-20160
192.168.255.120 20161/20181 linux/aarch64 Up /nvme01/tidb/data/tikv-20161 /deploy/tidb/tikv-20161
192.168.255.120 40160/40180 linux/aarch64 Up /nvme00/tidb/data/tikv-40161 /deploy/tidb/tikv-40160
192.168.255.120 40161/40181 linux/aarch64 Up /nvme01/tidb/data/tikv-40162 /deploy/tidb/tikv-40161
192.168.255.121 20160/20180 linux/aarch64 Up /nvme00/tidb/tikv/data/tikv-20160 /deploy/tidb/tikv-20160
192.168.255.121 20161/20181 linux/aarch64 Up /nvme01/tidb/data/tikv-20161 /deploy/tidb/tikv-20161
192.168.255.122 30160/30180 linux/aarch64 Up /nvme00/tidb/data/tikv-30161 /deploy/tidb/tikv-30160
192.168.255.122 30161/30181 linux/aarch64 Up /nvme01/tidb/data/tikv-30162 /deploy/tidb/tikv-30161
192.168.255.122 30162/30182 linux/aarch64 Up /nvme02/tidb/data/tikv-30162 /deploy/tidb/tikv-30162
192.168.255.122 40160/40180 linux/aarch64 Up /nvme00/tidb/data/tikv-40161 /deploy/tidb/tikv-40160
192.168.255.122 40161/40181 linux/aarch64 Up /nvme01/tidb/data/tikv-40162 /deploy/tidb/tikv-40161
192.168.255.122 40162/40182 linux/aarch64 Up /nvme02/tidb/data/tikv-40162 /deploy/tidb/tikv-40162

pd-ctl status
{
“count”: 19,
“stores”: [
{
“store”: {
“id”: 91,
“address”: “192.168.255.119:3931”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v6.5.3”,
“peer_address”: “192.168.255.119:20171”,
“status_address”: “192.168.255.119:20293”,
“git_hash”: “e63e24991079fff1e5afe03e859f743cbb6cf4a7”,
“start_timestamp”: 1694990902,
“deploy_path”: “/deploy/tidb/tiflash-9001/bin/tiflash”,
“last_heartbeat”: 1695016186000543038,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.718TiB”,
“available”: “1.07TiB”,
“used_size”: “55.47GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 5088,
“region_weight”: 1,
“region_score”: 1402721.0927894693,
“region_size”: 1152681,
“learner_count”: 5088,
“slow_score”: 1,
“start_ts”: “2023-09-18T06:48:22+08:00”,
“last_heartbeat_ts”: “2023-09-18T13:49:46.000543038+08:00”,
“uptime”: “7h1m24.000543038s”
}
},
{
“store”: {
“id”: 92,
“address”: “192.168.255.121:3931”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v6.5.3”,
“peer_address”: “192.168.255.121:20171”,
“status_address”: “192.168.255.121:20293”,
“git_hash”: “e63e24991079fff1e5afe03e859f743cbb6cf4a7”,
“start_timestamp”: 1694990955,
“deploy_path”: “/deploy/tidb/tiflash-9001/bin/tiflash”,
“last_heartbeat”: 1695016328659292069,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.718TiB”,
“available”: “821.3GiB”,
“used_size”: “50.54GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 4361,
“region_weight”: 1,
“region_score”: 1402379.9771483315,
“region_size”: 1111245,
“learner_count”: 4361,
“slow_score”: 1,
“start_ts”: “2023-09-18T06:49:15+08:00”,
“last_heartbeat_ts”: “2023-09-18T13:52:08.659292069+08:00”,
“uptime”: “7h2m53.659292069s”
}
tikv_regions_peers status

image1327×556 24.5 KB

tiflash_error.log (1.9 KB)
tiflash_stderr.log (574 bytes)

translator_bot · June 21, 2024, 8:29pm

| username: ti-tiger | Original post link

The scale-in --force operation of TiFlash may cause some regions to be in a down state because it forces the TiFlash node offline without waiting for data migration to complete. As a result, some region replicas may be lost, leading to data inconsistency and synchronization failures.

To resolve this issue, you can try the following steps:

[Use the pd-ctl tool to check which stores are in offline status and record their IDs.]
[Use the pd-ctl tool to force the offline stores to tombstone status so that they no longer participate in scheduling and data migration. The command format is: store remove-tombstone <store_id>.]
[Use the pd-ctl tool to check which regions are in a down state and record their IDs.]
[Use the pd-ctl tool to delete the down regions so that they will be recreated on other replicas. The command format is: region remove <region_id>.]
[Use TiDB Dashboard or the TiFlash-Summary monitoring panel to check the synchronization status of TiFlash nodes and tables to ensure they are all normal.]

translator_bot · June 21, 2024, 8:29pm

| username: h5n1 | Original post link

Your previous post already provided the solution.

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

It doesn’t work. Trying ti-tiger’s solution.

translator_bot · June 21, 2024, 8:29pm

| username: h5n1 | Original post link

What actions did you take?

translator_bot · June 21, 2024, 8:29pm

| username: tidb菜鸟一只 | Original post link

Directly use the pd-ctl tool to execute unsafe remove-failed-stores, then the status will change to tombstone. After that, execute store remove-tombstone. Anyway, your TiFlash is already completely unusable.

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

Trying. Is this process very long?

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

Some tables in tiflash show a 9012 timeout error.
Changing the replica of these tables from 1 to 0 and then back to 1 is ineffective.
Scale-in all three tiflash nodes.
Scale-out three new tiflash nodes with new ports and directories.
The previously scaled-in nodes are in an offline state.
The new tiflash nodes cannot synchronize data, and all tiflash tables are in a state where progress is 0. The CPU and disk usage of tiflash are almost 0.

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

Some nodes are still in the leaving state. One is performing an unsafe remove.

translator_bot · June 21, 2024, 8:29pm

| username: h5n1 | Original post link

What I mean is, what actions did you take according to the documentation provided in the previous post? Didn’t you say it didn’t work?

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

Each post has a different question.

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

The down nodes have all been handled, but TiFlash is still not synchronizing data. How should this be addressed?

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

TiFlash keeps generating such logs, indicating that table information is not being synchronized into TiFlash. How should this be handled?

translator_bot · June 21, 2024, 8:29pm

| username: 有猫万事足 | Original post link

It seems to be caused by this error.

github.com

pingcap/tiflash/blob/v6.5.3/dbms/src/TiDB/Schema/SchemaBuilder.cpp#L403C28-L403C28


      
          throw TiFlashException(fmt::format("miss table in TiKV : {}", table_id), Errors::DDL::StaleSchema);

The code location is above. The 600140 in the error is a tableid.

select * from INFORMATION_SCHEMA.TIFLASH_REPLICA where table_id=600140

Execute the above SQL to see if there is any data?

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

There is no such table.

translator_bot · June 21, 2024, 8:29pm

| username: 有猫万事足 | Original post link

I’m not sure why TiFlash needs to find the schema of this table. Let’s see what table it is. If there’s nothing again, it will be frustrating.

Is there anything in the tiflash_tikv.log file?

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

There is no such table.

translator_bot · June 21, 2024, 8:29pm

| username: 有猫万事足 | Original post link

Alright, at least I feel that the errors in the two files you uploaded can be ruled out. I looked at it and feel that the lack of synchronization is not related.

translator_bot · June 21, 2024, 8:29pm

| username: 济南小老虎 | Original post link

My head hurts. Do you have any other solutions? I’m completely out of ideas.

translator_bot · June 21, 2024, 8:29pm

| username: 有猫万事足 | Original post link

The fallback plan is to reinstall.