Version 5.0.4: Disk Full Due to TiKV Node's Oldest Snapshot Duration Retained for N Days

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 5.0.4 版本 某个节点tikv oldest snapshot duration 保留N天导致磁盘打满

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.4
[Encountered Problem: Phenomenon and Impact]
tikv-19 disk space usage rate is 90%, while the usage rate of other tikv nodes is <75%
[Attachments: Screenshots/Logs/Monitoring]

  • tikv-19

  • tikv-12

  • Problem: The number of regions and Leaders on tikv-19 is much smaller than on tikv-12, but the disk space usage rate on tikv-19 is much higher than on tikv-12.
    Additional Notes:

  1. Both tikv-19 and tikv-12 are cloud vendor servers, each with 1.5T of disk space.
  2. tikv-19 = store-24 tikv-12 = store-7


| username: TiDB_C罗 | Original post link

  1. Confirm the machine configuration
  2. Confirm whether the usage is all by TiKV
  3. Check the Grafana TiKV-Detail dashboard
| username: caiyfc | Original post link

First, check the disk. It seems that the disk of tikv-19 is not only used by tikv. Or you can check how much space the logs in tikv-19 are occupying. Maybe the disk usage is too high because there are too many log files.

| username: TiDBer_yyy | Original post link

  1. The configurations of machines 1 and 2 are identical, with a single instance of TiKV deployed, and only TiKV is using it.
  2. Which chart are you referring to?
| username: TiDBer_yyy | Original post link

The log files have been cleaned, rock.log-202*, raft-log-202*.

cd db # You can see there are still SST files from 2022, not sure if this is normal.

| username: TiDB_C罗 | Original post link

tikv-detail

| username: 裤衩儿飞上天 | Original post link

Different region weights

| username: TiDBer_yyy | Original post link

Yes, because the disk space keeps increasing, I changed the weight store weight 24 0.8 0.9.

| username: 裤衩儿飞上天 | Original post link

The functions of region_weight and leader_weight:

| username: tidb菜鸟一只 | Original post link

The region_size you’re looking at doesn’t seem right. The region_size for store7 is already 5T, while store24 is less than 2T… But store24’s disk usage is higher, and store7’s disk usage is lower?

| username: TiDBer_yyy | Original post link

Yes, that’s exactly the situation. Suspected that TiKV did not compact, manually executed compact, and as a result, the disk exploded and the disk space was exhausted.

| username: zhanggame1 | Original post link

Check where the GC has reached, and see if it hasn’t run for a long time.

| username: TiDBer_yyy | Original post link

GC occurs every 10 minutes, and this machine’s GC passes 100% of the time.

| username: redgame | Original post link

Pay attention to region weight

| username: TiDBer_yyy | Original post link

The region weight was modified after the issue occurred.

| username: jansu-dev | Original post link

Can this export a full set of PD metrics? I want to look at things like Region Score.

| username: TiDBer_yyy | Original post link

Sure, but the TiKV in question has already been scaled down and purged.

| username: cy6301567 | Original post link

Regularly delete logs and check for hotspots.

| username: jansu-dev | Original post link

Yes, metrics are saved by default for 30 days, so it should still be there. Historical records are persisted in Prometheus.

| username: TiDBer_yyy | Original post link

Sure, do you have an export command? I would like to learn it.