Automatic collection of statistical information has never been successful. --GC life time is shorter

translator_bot · June 23, 2024, 11:37am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 自动收集统计信息，一直没成功过。–GC life time is shorter

| username: 我是咖啡哥

【TiDB Usage Environment】POC
【TiDB Version】V6.1
【Encountered Problem】
【Reproduction Path】What operations were performed to encounter the problem
【Problem Phenomenon and Impact】
[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2022-06-29 20:05:32.887 +0800 CST, GC safe point is 2022-06-29 20:15:15.686 +0800 CST
Isn’t this design somewhat unreasonable?

translator_bot · June 23, 2024, 11:37am

| username: hey-hoho | Original post link

What is the current setting for tidb_gc_life_time?

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

According to the information, version 6.1 should have resolved the GC issue during auto analyze. Previous versions did not consider the runtime of auto analyze during GC. In version 6.1, to prevent auto analyze from blocking GC for a long time, the tidb_max_auto_analyze_time variable was introduced to control the maximum analyze duration. You might need to adjust the GC safepoint time to see if it helps.

github.com/pingcap/tidb

Long auto-analyze txn blocks GC

opened 05:52AM - 31 May 22 UTC

closed 04:16AM - 10 Jun 22 UTC

you06

type/enhancement sig/transaction affects-6.1

## Enhancement After #32726, the auto-analyze txns will not be interrupted by… GC, instead, it blocks GC. The auto-analyzes will not repeatedly fail by blocking GC, but it brings some performance degradation when GC is blocked for a long time. Either make auto-analyze fast enough or not execute auto-analyze it in a txn can improve this case.

github.com/pingcap/tidb

Optimize gc safepoint advancing for internal transaction

opened 01:23AM - 02 Mar 22 UTC

closed 06:14AM - 07 Apr 22 UTC

TonsnakeLin

type/enhancement

## Enhancement TiDB advances gc safepoint every `tidb_gc_run_interval` time wit…h step `tidb_gc_life_time`. If there is a long time transaction from user client lives more than `tidb_gc_life_time`, the safpoint can't be advanced until the long transaction is finished or lives than 24h. This mechanism ensures the continuous advancement of gc safepoint and ensures that the data that active transactions need to access will not be cleared. However, Internal transactions run in TiDB don't comply with the mechanism above. If the internal transaction lives more than `tidb_gc_life_time`, maybe failed because the data it needs to access was cleared. Here is an example that an auto analyze task run failed when it lived more than 20 minutes because the data was cleared. The TiDB log printed `[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2021-11-17 14:09:54.756 +0800 CST, GC safe point is 2021-11-17 14:11:54.755 +0800 CST`

translator_bot · June 23, 2024, 11:37am

| username: 我是咖啡哥 | Original post link

I will first adjust the tidb_gc_life_time parameter and observe.

MySQL [(none)]> 
MySQL [(none)]> show variables like  '%tidb_max_auto_analyze_time%' ;
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| tidb_max_auto_analyze_time | 43200 |
+----------------------------+-------+
1 row in set (0.00 sec)

MySQL [(none)]> show variables like  '%tidb_gc_life_time%' ;
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| tidb_gc_life_time | 10m0s |
+-------------------+-------+
1 row in set (0.00 sec)

MySQL [(none)]> set global tidb_gc_life_time='20m';
Query OK, 0 rows affected (0.04 sec)

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

20 minutes is a bit short; in production, it should take at least a few hours. Some flashback features rely on this.

translator_bot · June 23, 2024, 11:37am

| username: tidb狂热爱好者 | Original post link

For a constantly deleting database, the GC setting cannot be too long, otherwise the execution speed will be very slow.

translator_bot · June 23, 2024, 11:37am

| username: Z六月星星 | Original post link

The transaction execution time has exceeded the GC setting time. Either adjust the GC time or optimize the SQL execution.

translator_bot · June 23, 2024, 11:37am

| username: 我是咖啡哥 | Original post link

After several adjustments, I finally saw a few successful attempts.

MySQL [(none)]> set global tidb_gc_life_time='20m';
Query OK, 0 rows affected (0.04 sec)

MySQL [(none)]> set global tidb_gc_life_time='30m';
Query OK, 0 rows affected (0.01 sec)

MySQL [(none)]> set global tidb_gc_life_time='60m';
Query OK, 0 rows affected (0.02 sec)

But there’s a problem. When the default tidb_gc_life_time='10m', it takes 20 minutes to report an error.
When set to 20m, it takes 30 minutes to report an error.
When set to 30m, it takes 40 minutes to report an error.
How should this be understood?

translator_bot · June 23, 2024, 11:37am

| username: 我是咖啡哥 | Original post link

This is the default auto analyze table SQL, not an error reported by business SQL.

translator_bot · June 23, 2024, 11:37am

| username: HACK | Original post link

Setting it to a few hours in production shouldn’t be an issue, right? For routine maintenance, what is the generally recommended setting? We currently have it set to the default of 10 minutes.

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

It should be related to the gc_interval interval.

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

Having too many historical versions can affect read performance, but it ultimately depends on the actual situation. The default of 10 minutes is a bit short.

translator_bot · June 23, 2024, 11:37am

| username: Z六月星星 | Original post link

Is the issue resolved?

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

Check the tidb.log for information other than the GC life time error. The auto analyze is triggered by a certain TiDB server, and the log contains the phrase “auto analyze trigger”.

translator_bot · June 23, 2024, 11:37am

| username: 我是咖啡哥 | Original post link

Manually collected once, took less than 10 minutes to complete. The large table currently has no data changes and hasn’t triggered automatic collection. TiDB experts recommend manually collecting for large tables.

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

The concurrency for automatic collection is 1. Manual collection can be controlled by several parameters: tidb_build_stats_concurrency and tidb_distsql_scan_concurrency.

translator_bot · June 23, 2024, 11:37am

| username: tiancaiamao | Original post link

This design is not very reasonable.

The reason for this issue is that we set the default GC time to not be too long (20 minutes), otherwise, too many data versions might affect performance. However, sometimes there are some long-running background tasks, such as analyze in this case, which might exceed 20 minutes when the cluster size is slightly larger. Previously, the analyze task worked on a snapshot of a version, but GC deleted the snapshot that exceeded 20 minutes, causing an error.

This issue is expected to be fixed in version 6.2.
In 6.2, analyze will use the max int64 TSO instead of a snapshot TSO. This will reduce the accuracy of the analyze, but it is acceptable for the analyze scenario.

In older versions, a workaround is to extend the GC time or avoid using auto analyze and instead perform analyze manually.

translator_bot · June 23, 2024, 11:37am

| username: h5n1 | Original post link

Master, I noticed that some issues and design docs seem to describe that in version 6.0, internal transactions like auto analyze have been considered in the management of GC, which is why the tidb_gc_max_wait_time variable was introduced later. Is that correct?

translator_bot · June 23, 2024, 11:37am

| username: tiancaiamao | Original post link

Yes, recently we have been focusing on addressing OOM-related issues, which includes changes in the statistics area.

translator_bot · June 23, 2024, 11:37am

| username: tidb狂热爱好者 | Original post link

Manual collection of large tables