If the final sample set size is n when using reservoir sampling, each region needs to sample n rows, and TiDB will perform reservoir sampling again. In this case, if we want to sample around 100,000 rows by default, each region needs to return 100,000 rows. Given that a TiDB region typically contains around 1 million rows, this means the actual sampling rate reaches 10%, which results in a lot of redundant data being sampled for very large tables.
Using Bernoulli sampling avoids this issue. In fact, when switching from reservoir to Bernoulli sampling, the sample set size increased from the default 10,000 rows to around 100,000 rows. At the same time, the speed of Analyze actually becomes faster.
On the other hand, because the original sample set size was only 10,000, constructing histograms for index statistics based solely on sample data resulted in larger errors. After expanding the sample set to 100,000, histograms for indexes can also be constructed through sampling, which means only reading table kv is needed, saving the cost of reading Index kv. (However, due to unclean code refactoring, indexes like index idx(a(10))
that include column prefixes or virtual columns still need to read Index kv. This is an area for optimization to improve collection efficiency and resource usage.)
Therefore, overall collection efficiency has improved.
It seems that fast analyze has been deprecated. Is there any alternative solution?
Alternative solutions will be tested later. However, the expectation is to avoid introducing solutions that reduce accuracy like the original fast analyze. Instead, the priority will be to improve the speed of the default Analyze (i.e., prioritize improving speed without compromising accuracy, rather than improving speed by reducing collection accuracy).