TiKV Unable to Provide Normal Service for an Extended Period

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV 长时间无法正常服务

| username: TiDBer_jYQINSnf

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0
[Reproduction Path] What operations were performed that caused the issue
The cluster is approximately 7TB. Last Friday, a lot of data was deleted, roughly tens of billions of rows.
Then one of the TiKV nodes failed (I didn’t see it firsthand; someone else handled it at the time).
Today, I checked and found that tikv1 has been continuously logging the following messages:

Using pd-ctl to check, I see that tikv1 has is_busy: true
Currently, there is no leader on tikv1.

The question is: Why is tikv1 always busy? How can we make tikv1 quickly catch up with the progress? The entire cluster has stopped accessing now, and we are allowed to tinker freely. Can we make tikv1 use 100% of its CPU to handle this never-ending task? How can we adjust it?

Rebuilding is the last resort. Solving the problem by rebuilding is like a network administrator in an internet cafe asking you to restart; it lacks technical depth. :grinning:

| username: tidb菜鸟一只 | Original post link

Is the status of tikv1 normal now?

| username: h5n1 | Original post link

Please describe the results of pd-ctl store 4, pd-ctl region 81637137, tiup cluster display, the previous TiKV failure phenomenon, and the handling process and results.

| username: 朵拉大虾 | Original post link

Take a look at the memory.

| username: 考试没答案 | Original post link

Check your GC time. What is the parameter size?

| username: 考试没答案 | Original post link

You can try adjusting the GC parameters to see if there is too much garbage after deletion, causing the GC to be unable to keep up.

| username: TiDBer_jYQINSnf | Original post link

tikv1: Switching between disconnect and down. The posted image shows tikv1 constantly refreshing.

| username: TiDBer_jYQINSnf | Original post link

Now tikv1, store4 is offline because it was eventually rebuilt :smiley:
Previously, it kept switching between disconnect and down.

That region is now normal, and the replica on store4 is gone.

Tiup can’t be checked because it’s managed by k8s. Other tikvs are basically normal, but tikv0, tikv1 (store4), and tikv10 keep getting disconnected.

The previous issue was:
Under heavy read and write traffic, a lot of data was deleted, causing high read and write latency, so the business side stopped the operations. The cluster was not operated on. Today, I saw that tikv1 (store4) had an OOM once. Then it kept logging the above logs.

tikv0 kept logging the following logs.

The region was also down because store4 was down.
I looked at the code for the above logs, and it said that it was because a local message was received on the network and ignored.

So the question arises again, why would local messages be sent to other nodes? Under what circumstances would they be sent?

| username: TiDBer_jYQINSnf | Original post link

Regarding memory, one node experienced an OOM (Out of Memory) issue, while the other nodes are functioning normally. Currently, the memory usage of store4 is the highest among all nodes.

| username: TiDBer_jYQINSnf | Original post link

GC is 10 minutes

| username: TiDBer_jYQINSnf | Original post link

This has been deleted for several days, and theoretically, GC should have been triggered. Now it feels like tikv1 (store4) has been offline for too long and is a bit behind schedule. It’s busy handling deletions and such.

| username: h5n1 | Original post link

Try removing the region peer with the error from store 4: pd-ctl operator add remove-peer 77813542 4. Aren’t there many regions reporting the same error?

| username: TiDBer_jYQINSnf | Original post link

That batch of logs appears intermittently, with different region IDs. After deleting them, the tikv0 logs show:

| username: TiDBer_jYQINSnf | Original post link

Changed another one again.

| username: TiDBer_jYQINSnf | Original post link

Analyzed the monitoring data:
During deletion, there were also many read and write requests, and deletion requires scanning the table, which led to the disk IO being quite full.
Then, compaction came along, further filling up the disk IO.
As a result, the write latency was relatively high.
The question is: why did this mess up? If the disk IO is full, shouldn’t it just wait for the IO to clear up slowly? It didn’t return to normal even after a weekend. What went wrong?

Some monitoring data as shown in the images:


| username: h5n1 | Original post link

  1. Check if the number of regions in store 4 is decreasing in the monitoring. If it is decreasing, it means the migration is normal. If there is no change, you can use the above add remove-peer to handle all the regions on store 4; otherwise, store 4 will remain offline.
  2. After deleting a large amount of data, GC compact large IO is normal. What type of disk do you have? It seems that the performance is somewhat lacking.
  3. In your scenario, I feel like you might have encountered a bug.
| username: TiDBer_jYQINSnf | Original post link

It is decreasing. The offline status is because store delete was executed to take store 4 offline and then rebuild it. Before executing store delete, it was sometimes disconnected, and after a while, it went down.

The disk is still fine; it’s an NVMe drive. However, because NVMe was used to create LVM and then allocated to each TiKV, no single TiKV exclusively uses one NVMe. It seems that this amount of I/O has reached its limit, but in reality, other TiKVs are also writing to the same disk.

This is what I want to understand—what kind of bug it is. It seems that TiDB tends to have issues when deleting large amounts of data. In my understanding, after executing compact, it should be able to serve normally, but in reality, the cluster keeps idling along, as if saying: “I can’t handle it, I’m giving up.” :grin:
So, are there any parameters to urge it to get to work? :laughing:

| username: h5n1 | Original post link

Is the network stable? I always feel that the network is a big issue on k8s.

| username: TiDBer_jYQINSnf | Original post link

The network is fine, we have a dedicated team managing the whole k8s setup. It’s indeed very difficult to handle everything by ourselves.

| username: 考试没答案 | Original post link

Is it restored now???