7.1.1 TiFlash Application Snapshot Fails to Restart After Failing Range Check

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 7.1.1 tiflash 应用快照,检查范围时不通过后无法重启

| username: Hacker_ojLJ8Ndr

The size of region 4903467509 is 54M. After setting profiles.default.dt_enable_ingest_check=false, besides skipping the check, will it automatically fix any data inconsistency issues if they occur? Or is there another solution?

| username: dba-kit | Original post link

Since TiFlash synchronizes data from the upstream TiKV as a Learner, the simplest way to rebuild the data of a specific region in TiFlash is to trigger a split operation upstream.

tiup ctl:v7.1.1 pd operator add split-region 4742192293
| username: dba-kit | Original post link

Additionally, this might be an issue that has been lingering for a while. Was the problematic region written using the physical import method of the tidb-lightning tool?

| username: tidb菜鸟一只 | Original post link

This should be a bug. Setting this parameter only ensures that the TiFlash process does not crash, but there may still be data inconsistency issues that cannot be resolved at the moment. Try upgrading to version 7.1.2…

| username: Hacker_ojLJ8Ndr | Original post link

I don’t want to rebuild. This issue is causing the node to fail to start. We didn’t use tidb-lightning; we used insert into to write data.

| username: dba-kit | Original post link

In that case, after starting TiFlash, you can actively trigger a split of the erroneous region on the upstream TiKV. The old region should be directly discarded.

| username: Hacker_ojLJ8Ndr | Original post link

We are currently considering whether the inconsistency caused by this parameter will be synchronized, otherwise, we will have to perform scaling again if we encounter it.

| username: Hacker_ojLJ8Ndr | Original post link

The node is already down, and this region is only 54M.

| username: Hacker_ojLJ8Ndr | Original post link

Moreover, this approach cannot avoid the occurrence of this problem, right?

| username: tidb菜鸟一只 | Original post link

Now, either we don’t set this parameter, but there is a risk of data inconsistency, or we find the table corresponding to the region and delete the TiFlash replica of that table first.

| username: Hacker_ojLJ8Ndr | Original post link

Maybe deleting the replica didn’t help with the scaling down because I have two replicas. As long as only one node has an issue, at least the other replica can still be used normally.

| username: AnotherCalvinNeo | Original post link

It should have been fixed in this PR: Disable ingest range check by breezewish · Pull Request #6519 · pingcap/tiflash · GitHub

| username: Hacker_ojLJ8Ndr | Original post link

After reading, I gave up adjusting dt_enable_ingest_check :face_exhaling:

| username: JaySon-Huang | Original post link

This is the region boundary split by TiKV under certain conditions, which can cause TiFlash to lose some rows after synchronizing data through apply snapshot. Therefore, an explicit check has been added to avoid data inconsistency issues in silent situations.

If you encounter this issue in version 7.1.1, you need to rebuild the TiFlash node; or temporarily disable this check through profiles.default.dt_enable_ingest_check, so that after the TiFlash node starts, you can clean up the TiFlash replica. However, this situation cannot be avoided from happening again temporarily.

Later, we handled such region boundaries in TiFlash compatibility, and the modification has now been merged into the release-7.1 branch. After the next patch version v7.1.2 is released, this issue will no longer occur.

| username: Hacker_ojLJ8Ndr | Original post link

Okay, thank you, teacher.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.