A large number of logs are generated by PD when executing DDL in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd在tidb执行ddl执行时大量日志

| username: zhanggame1

[Test Environment for TiDB] Testing
[TiDB Version] 7.1
[Reproduction Path] Operations that led to the issue
A large number of logs are generated by PD when executing DDL in TiDB, reaching several gigabytes per hour. Due to limited available space on PD, the disk was filled up.

This issue occurred once when dropping many partitions from a partitioned table, and it happened again today when DDL got stuck (eventually resolved by restarting, see the issue at ddl语句执行不完也停不下来 - #14,来自 TiDBer_oHSwKxOH - TiDB 的问答社区).

Below is one of the logs:
One Log.txt (647.9 KB)

A partial screenshot of one log:

| username: wenyi | Original post link

Could you provide the specific DDL statements?

| username: zhanggame1 | Original post link

Executed “alter table drop partition xxx” many times.
Another time was “truncate table”.

| username: TiDB_C罗 | Original post link

If it’s not an issue with the log level, then it’s a bug.

| username: zhanggame1 | Original post link

Information

| username: 有猫万事足 | Original post link

Judging from the error log location, this is just a casual output after correct execution, so the log level is info. Let’s see who calls this SetAllGroupBundles.

Fortunately, it has only one caller. And it is bound to an HTTP interface. The next question is, who is calling this HTTP interface. This interface is only triggered by a POST call. So, let’s search for POST + part of the URL. Additionally, you executed drop partition. This is something TiDB needs to do. Try looking for it inside TiDB.

https://github.com/pingcap/tidb/blob/master/domain/infosync/placement_manager.go#L78C1-L78C1

There is still only one place that matches, which is the one above. You can see that it passes parameters for all the placement rules, and each partition has one. This is the long JSON string in your log. Who is using this PutRuleBundles?

https://github.com/pingcap/tidb/blob/master/domain/infosync/info.go#L566
This time there are many suspicious places, but the most suspicious is here. In a distributed system, you obviously can’t expect every call to succeed, so it is set with a retry mechanism.

https://github.com/pingcap/tidb/blob/master/ddl/partition.go#L1896C18-L1896C38

Finally, our main character appears. This method with a retry mechanism is called during onDropTablePartition. You can see that in this method, dropping once will call PutRuleBundlesWithDefaultRetry twice, once at the beginning and once when the job is in the public state. I am not knowledgeable enough to understand why there are two calls, and I hope some experts can explain. Additionally, if you look for other call locations in this file, you will find that adding and swapping partitions will also call it.

Let me irresponsibly summarize: this log outputs the placement rule. Each partition has one. You have a lot of partitions, so each request is very large. It is called when adding or deleting partitions, and it is called twice when deleting. If it fails, it will retry. I am not sure if there are other error messages from rule_manager.go in the log. If there are, it indicates that retries might have actually happened.

My skills are limited, so I can’t guarantee this is exactly the case. But it is indeed the result of a thorough review. :joy: Feel free to correct me and add your insights.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.