Help needed: TiCDC backup TiKV disk is full

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 求助,ticdc备库tikv硬盘满了

| username: zhanggame1

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 7.5
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
On January 12, 2024, monitoring detected an issue with ticdc synchronization.


Primary database service check: Normal

Check changfeed status

Reported insufficient tikv space, checked the secondary database and found the disk was indeed full.

Primary database disk status

Logged into one tikv and found that the disk was indeed occupied by sst files.
image

| username: 像风一样的男子 | Original post link

Are the GC settings the same for the primary and secondary clusters?

| username: 像风一样的男子 | Original post link

Use this SQL to query the size of each table and compare the differences between master and replica.

select TABLE_SCHEMA, TABLE_NAME, round(data_length/1024/1024,2), TABLE_ROWS from INFORMATION_SCHEMA.`TABLES` order by DATA_LENGTH desc
| username: zhanggame1 | Original post link

It’s different, the primary is 24 hours, the secondary is 10 minutes.
The GC of the secondary database is progressing normally.

| username: zhanggame1 | Original post link

I deleted some data from the primary database around 9 AM today, so it doesn’t match. The replica encountered synchronization issues at 2 PM. Judging by the data volume, the replica is not large.

| username: tidb狂热爱好者 | Original post link

It feels like the GC is stuck.

| username: zhanggame1 | Original post link

GC seems to be progressing normally.

| username: zhanggame1 | Original post link

From the monitoring:


| username: tidb狂热爱好者 | Original post link

To be honest, 300GB for TiKV is really a bit small. It should have already failed.

| username: tidb狂热爱好者 | Original post link

Expand to 1TB, since it’s Alibaba Cloud anyway.

| username: zhanggame1 | Original post link

For local development and testing, there isn’t much data. Our production database only has a little over 100GB of data.

| username: CuteRay | Original post link

Can you check the monitoring of the number of tikv leaders and regions in the primary and standby clusters?

| username: 像风一样的男子 | Original post link

After deleting the data, is the disk space released? If it is not released, can it be considered that the GC is stuck?

| username: zhanggame1 | Original post link

Observed for a while without releasing, then try manual compaction.

| username: 随缘天空 | Original post link

Expand it, this disk space is too small.

| username: zhanggame1 | Original post link

The data is also very small.

| username: 小龙虾爱大龙虾 | Original post link

Check from what time the data started to increase, and whether there were any anomalies in the monitoring when it started to increase.

| username: zhanggame1 | Original post link

Standby Database


Primary Database

| username: zhanggame1 | Original post link

By querying the regions’ usage with the statement:

select sum(t.APPROXIMATE_SIZE) from INFORMATION_SCHEMA.TIKV_REGION_STATUS t

Primary database: 924533
Replica database: 1030171

The difference is not very significant.

| username: tidb狂热爱好者 | Original post link

Such a difficult question.