Cluster admin show DDL exception

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群admin show ddl异常

| username: dbaspace

[TiDB Usage Environment] Test
[TiDB Version] 4.0.8
[Reproduction Path] Operations performed that led to the issue
Initially, when executing admin show ddl jobs and insert, the cluster prompted:
Error 1105: tikv aborts txn: Txn(Mvcc(DefaultNotFound { key: [109, 68, 66, 58, 50, 57, 50, 52, 51, 255, 0, 0, 0, 0, 0, 0, 0, 0, 247, 0, 0, 0, 0, 0, 0, 0, 104, 84, 97, 98, 108, 101, 58, 50, 57, 255, 50, 54, 56, 0, 0, 0, 0, 0, 250]
Also, executing create/drop directly hangs and cannot proceed. Restarting tidb-server or adding new nodes fails to start, with the startup prompt:

After performing mvcc online repair on the cluster and repairing up to 3 TIKV nodes, executing admin commands prompts:
image
The current tidb-server prompts:


Writing to the table prompts that the database does not exist, and only select/update/delete can be performed, but not insert.
image
The current tidb-server log that has not been restarted:

[Encountered Issues: Symptoms and Impact]
1. Restarted tikv
2. MVCC online repair
3. Scaling tidb-server
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: db_user | Original post link

The phenomenon is consistent with this issue, but it seems there is no workaround mentioned above.
pitr restore fail but report successfully · Issue #39920 · pingcap/tidb (github.com)
tidb meets error: ERROR 1008 (HY000): Can't drop database ''; database doesn't exist when dropping database · Issue #39606 · pingcap/tidb (github.com)
It feels like there might be garbled statistics or indexes. Or the data file might be corrupted. You can try reloading the statistics of a specific table to see if it helps.

| username: dbaspace | Original post link

Currently, it’s impossible to operate DDL; create/drop operations are directly stuck.

| username: dbaspace | Original post link

The initial problem with the cluster was that one of the node machines went down and it took half a month to complete the offline process. After manual intervention, the offline process was successful, but the cluster has been in this state since then.

| username: db_user | Original post link

Are there three TiKV nodes, and one of them is broken and forced offline? Did you continue to expand with new TiKV nodes afterward? It seems like the data might be corrupted. Try using this command to check:

# Check regions that have lost more than half of their replicas
region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1253,4,5) then . else empty end) | length>=$total)}'

# Check regions that have lost all replicas
region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1253,4,5) then . else empty end) | length>=$total)}'

For the usage of jq, you can refer to this link.

| username: db_user | Original post link

Okay, check the cluster status and the region situation. There are many abnormal regions here.

| username: dbaspace | Original post link

Judging by the status, it seems that the number of replicas was reduced when I performed the add remove-peer operation this morning.

| username: dbaspace | Original post link

Looking at this, it seems it can’t be fixed.
PD monitoring is also OK:

| username: tidb菜鸟一只 | Original post link

Is the data in the system table normal?

| username: db_user | Original post link

  1. Set the automatic analyze to a non-current time point to ensure it won’t automatically analyze at the current time.
set global tidb_auto_analyze_start_time='00:00 +0000';
  1. Delete the statistics of the table.
DROP STATS db_name.table_name;
  1. Manually analyze the table.
analyze table db_name.table_name;

(At this point, the problem is solved, and the health status is restored to 100.)

  1. Restore automatic analyze.
set global tidb_auto_analyze_start_time=;

I don’t know if this method will work well. It’s quite a peculiar problem, somewhat similar to the loss of the frm file in MySQL.

| username: dbaspace | Original post link

This is normal.

| username: dbaspace | Original post link

Manually done. Same problem.

| username: dbaspace | Original post link

Have you encountered this problem before?
I checked and still can’t use BR to restore.

| username: db_user | Original post link

Indeed, BR cannot restore.


There is a guy with a similar operation, but no one followed up on this issue: Don’t let users drop the key system tables. · Issue #20767 · pingcap/tidb (github.com)
You can try to see if you can create it manually.

| username: dbaspace | Original post link

I have verified that any database can be successfully backed up except for the mysql database. I will try other methods to see.

| username: dbaspace | Original post link

The test environment cluster is completely ruined, and it’s unclear how this table got lost.

| username: db_user | Original post link

Fortunately, it’s a test environment. Higher versions might be more stable. I’ve encountered something similar, though it wasn’t a system table. At that time, it seemed to be resolved by rebuilding the table. It feels like the SST file might be corrupted, and your region status doesn’t look quite normal either. If there are no other issues, you might want to use a previous backup to rebuild the setup. I’ll ask @Link to take a look at this problem. I’m not sure if it’s a bug or something else causing it.

| username: dbaspace | Original post link

Another cluster broke down today, with the same issue as this one: 直接存入tikv中的数据,与tidb的关系 - TiDB 的问答社区.

| username: db_user | Original post link

Did you write it through the API interface?

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.