[Analysis] Why Bernoulli Sampling Has Advantages in Collection Efficiency and Resource Usage

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [Analyze] 伯努利采样为什么在收集效率和资源使用上更有优势

| username: TiDBer_D7483dYr

The document mentions that resources can be understood because the number of samples for sampling is limited. But where is the efficiency improvement? I see that the collect_column_stats function in the TiKV code still performs a full table scan. It seems that the fast analyze feature has also been deprecated. Is there any other alternative solution?

| username: winoros | Original post link

If the final sample set size is n when using reservoir sampling, each region needs to sample n rows, and TiDB will perform reservoir sampling again. In this case, if we want to sample around 100,000 rows by default, each region needs to return 100,000 rows. Given that a TiDB region typically contains around 1 million rows, this means the actual sampling rate reaches 10%, which results in a lot of redundant data being sampled for very large tables.

Using Bernoulli sampling avoids this issue. In fact, when switching from reservoir to Bernoulli sampling, the sample set size increased from the default 10,000 rows to around 100,000 rows. At the same time, the speed of Analyze actually becomes faster.

On the other hand, because the original sample set size was only 10,000, constructing histograms for index statistics based solely on sample data resulted in larger errors. After expanding the sample set to 100,000, histograms for indexes can also be constructed through sampling, which means only reading table kv is needed, saving the cost of reading Index kv. (However, due to unclean code refactoring, indexes like index idx(a(10)) that include column prefixes or virtual columns still need to read Index kv. This is an area for optimization to improve collection efficiency and resource usage.)

Therefore, overall collection efficiency has improved.

It seems that fast analyze has been deprecated. Is there any alternative solution?

Alternative solutions will be tested later. However, the expectation is to avoid introducing solutions that reduce accuracy like the original fast analyze. Instead, the priority will be to improve the speed of the default Analyze (i.e., prioritize improving speed without compromising accuracy, rather than improving speed by reducing collection accuracy).

| username: TiDBer_D7483dYr | Original post link

Thank you for the response~
Previously, index histograms were not calculated by sampling, right? I remember it was scanning each index region and constructing a histogram for each index region, then merging these histograms at the SQL layer. Because scanning the index is ordered.

I understand that the main efficiency optimizations are:

  1. The number of samples merged at the SQL layer has decreased.
  2. Indexes can also be sampled, and in most scenarios, you only need to scan the record row regions once to construct various statistical information for the index.

So currently, it’s still a full table scan + some acceleration methods (mentioned above, 2, and not analyzing certain types of columns).

Have you considered collecting statistical information without scanning the entire table? This would definitely be faster, but the accuracy is indeed hard to guarantee, and even the row count information might be inaccurate.

| username: winoros | Original post link

Have you considered collecting statistics without a full table scan?

Currently, we are not inclined towards this approach. The main reason is that the NDV (number of distinct values) obtained through hash sketch-based algorithms (fmsektch, hyperloglog) based on full data and those obtained through sampling have a significant difference in accuracy. So, although many other databases are entirely based on sampling, TiDB will not consider this approach in the short term.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.