Database Execution SQL Error: MVCC

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据库执行sql报错 mvcc

| username: fly4310862

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]

Currently, the executed SQL reports an error. The TiKV error logs and TiKV monitoring in the background are as follows:

TiKV node error log:
[ERROR] [errors.rs:316] [“txn aborts”] [err_code=KV:Unknown] [err="Txn(Mvcc(DefaultNotFound { key

[mod.rs:307] [“default value not found”] [hint=load_data_from_default_cf] [key=7

Monitoring
image

image

| username: Jellybean | Original post link

This type of TiKV error is uncommon. Could you also post the error that occurred during SQL execution for us to take a look?

| username: qiuxb | Original post link

| username: wzf0072 | Original post link

Is the cluster node status, especially the TiKV node status, normal: Are there any abnormal stores or regions?

| username: wzf0072 | Original post link

Sorry, I can’t assist with that.

| username: qiuxb | Original post link

| username: qiuxb | Original post link

Error Screenshot

| username: qiuxb | Original post link

Currently, one of the two TiFlash nodes is in a pending offline state after being scaled down, and the other is in a down state. One TiKV node has been scaled down and is in a pending offline state.

| username: wzf0072 | Original post link

The number of Regions reported by the Raft leader with unresponsive peers: 30K

| username: wzf0072 | Original post link

How many TiKV nodes did you have before scaling down?

| username: BraveChen | Original post link

It is possible that it is caused by the ongoing scaling down.

| username: qiuxb | Original post link

Originally, there were three TiKV nodes. Before scaling down, an error was reported, and at that time, only one TiKV node reported “default value not found,” causing some tables to fail to insert data. After scaling up by adding one TiKV node, the problematic node was scaled down.

| username: wzf0072 | Original post link

Observe the region health panel and check if down_peer_region_count is continuously decreasing.

| username: qiuxb | Original post link

There has been no decrease, it has been increasing because the TiFlash node is down. Some queries are being executed on TiKV, causing the overall load to increase.

| username: wzf0072 | Original post link

You can refer to the expert’s article

Three Key Techniques for Handling Abnormal TiKV Scale-Down and Offline Processes

| username: qiuxb | Original post link

Currently, two TiFlash nodes are not functioning properly. How can I forcefully take the TiFlash nodes offline and redo them?

| username: wzf0072 | Original post link

| username: wzf0072 | Original post link

Remove the information of the offline TiFlash node from the TiUP topology:

tiup cluster scale-in <cluster-name> --node <pd_ip>:<pd_port> --force
| username: qiuxb | Original post link

Using the store command, you can still see the region information. How can you delete these regions?

| username: wzf0072 | Original post link

Execute tiup cluster display to check the status of the offline node and wait for its status to become Tombstone.