The mvcc_db_total_size of PD is too large

translator_bot · June 21, 2024, 4:13pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD的mvcc_db_total_size过大

| username: EricSong

[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.11
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact]
There are two clusters, primary and secondary (A and B). Data consistency is maintained through dual writes by the service provider and TiCDC. However, it was recently discovered that the mvcc_db_total_size of cluster B has reached around 5G, while cluster A is less than 100M.
I would like to understand the possible reasons and how to reduce it.
[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 21, 2024, 4:13pm

| username: xfworld | Original post link

Check if the GC configurations of the two clusters are consistent…

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

I looked at the PD configuration
High ETCD cluster

  pd:
    auto-compaction-mode: revision
    auto-compaction-retention: "5"
    log.file.max-days: 3
    schedule.leader-schedule-limit: 4
    schedule.region-schedule-limit: 2048
    schedule.replica-schedule-limit: 64

Normal cluster

  pd:
    auto-compaction-mode: revision
    auto-compaction-retention: "5"
    log.file.max-days: 7
    quota-backend-bytes: 17179869184
    schedule.leader-schedule-limit: 4
    schedule.region-schedule-limit: 2048
    schedule.replica-schedule-limit: 64

So it looks like the only difference is quota-backend-bytes, which is the metadata storage size. The default value is 8G, but it seems that the normal cluster is actually larger.

translator_bot · June 21, 2024, 4:13pm

| username: xfworld | Original post link

You checked the wrong place, check this

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

It looks like the tikv_gc_safe_point is quite strange. Normally, the interval between the current time and the last run is 2 weeks, but for the high ETCD cluster, the interval is 8 weeks.

translator_bot · June 21, 2024, 4:13pm

| username: 大飞哥online | Original post link

Then you need to check why the GC didn’t execute. Look at the logs to see where it got stuck or what happened.

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

It seems there is an issue with CDC, as the safe point remains unchanged. After restarting several nodes, the safe point points to the current time, but it appears that ETCD has not changed. Does this need to wait for the next GC?

translator_bot · June 21, 2024, 4:13pm

| username: 大飞哥online | Original post link

The GC safe point time is today, and it has changed compared to the previous one.

translator_bot · June 21, 2024, 4:13pm

| username: xfworld | Original post link

If GC is functioning normally, the MVCC version information will be cleaned up properly; it just needs more time.

If GC is not functioning normally, you can troubleshoot the GC issues.

translator_bot · June 21, 2024, 4:13pm

| username: Jellybean | Original post link

This issue might be a bug in the lower version of CDC. We recently encountered a similar problem where too much etcd metadata occupied the space of the PD cluster.

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

It looks like 24 hours have passed since the GC recovery, but ETCD is still very high. I understand that there should be other reasons causing this. Could it be related to the high number of empty regions and the number of regions? I understand that if ETCD stores meta info, this is quite possible.

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

I found that the stopping of the GC safepoint is definitely related to CDC because the stopping time coincides exactly with the TSO of a stopped CDC task. After cleaning up the CDC task, the GC returned to normal, but there was still no change in ETCD.

translator_bot · June 21, 2024, 4:13pm

| username: xfworld | Original post link

Empty regions can be merged… This can reduce resource usage and speed up the process~

translator_bot · June 21, 2024, 4:13pm

| username: dengqee | Original post link

What is the upstream CDC version? Lower versions of CDC have bugs. Refer to this: TiCDC 常见问题和故障处理 | PingCAP 文档中心

translator_bot · June 21, 2024, 4:13pm

| username: Fly-bird | Original post link

Is it the CDC task that is affecting it?

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

It doesn’t seem to be the version mentioned in the documentation

Release Version: v4.0.11
Git Commit Hash: 52a6d9ea6da595b869a43e13ae2d3680354f89b8
Git Branch: heads/refs/tags/v4.0.11
UTC Build Time: 2021-02-25 16:40:37
Go Version: go version go1.13 linux/amd64

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

It should have been before, but after cleaning up the CDC tasks, the GC safepoint has returned to normal. However, the ETCD size still remains around 5G without any change.

translator_bot · June 21, 2024, 4:13pm

| username: EricSong | Original post link

I’ll try it later.