Can the sampling ratio for collecting TiDB statistics be adjusted?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB 统计信息收集的采样比例可以调整么

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.7
[Encountered Problem: Problem Phenomenon and Impact] When manually triggering analyze table TAB, the returned result shows that auto adjust is 0.012% of the rows. Can this ratio be adjusted?

| username: xfworld | Original post link

It can be done manually or automatically, and the ratio can be adjusted according to your own needs.

However, the level of support for this feature varies across different versions. Please refer to the TiDB version documentation for adjustments.

Reference documentation:

| username: residentevil | Original post link

I thought it could be achieved through configuration, but this makes it more complicated.

| username: 大飞哥online | Original post link

It’s too much trouble to change that.

| username: Jellybean | Original post link

Of course, you can adjust them, but it is recommended to first understand why you are adjusting certain parameters and set the most appropriate values based on your own business situation.

| username: Fly-bird | Original post link

The error message indicates that the tidb-server process was killed by the OOM (Out of Memory) killer. This usually happens when the system runs out of memory and the kernel decides to kill a process to free up memory. You can check the system logs (e.g., /var/log/messages or dmesg) to confirm this. If this is the case, you may need to increase the memory available to the system or optimize the memory usage of your TiDB cluster.

| username: 像风一样的男子 | Original post link

No need to make any changes.

| username: TiDBer_小阿飞 | Original post link

It can be modified, but it’s relatively troublesome. WITH + statistics configuration item.

| username: TiDBer_vfJBUcxl | Original post link

It’s too troublesome to change.

| username: residentevil | Original post link

Couldn’t find the parameters that need adjustment.

| username: Jellybean | Original post link

The official website mentioned by the previous expert has it.

| username: Jellybean | Original post link

ANALYZE TABLE TableNameList [WITH NUM BUCKETS|TOPN|CMSKETCH DEPTH|CMSKETCH WIDTH]|[WITH NUM SAMPLES|WITH FLOATNUM SAMPLERATE];

  • WITH NUM BUCKETS is used to specify the upper limit of the number of buckets generated for the histogram.
  • WITH NUM TOPN is used to specify the upper limit of the number of TOPN generated.
  • WITH NUM CMSKETCH DEPTH is used to specify the depth of the CM Sketch.
  • WITH NUM CMSKETCH WIDTH is used to specify the width of the CM Sketch.
  • WITH NUM SAMPLES is used to specify the number of samples.
  • WITH FLOAT_NUM SAMPLERATE is used to specify the sampling rate.

WITH NUM SAMPLES and WITH FLOAT_NUM SAMPLERATE correspond to two different algorithms for collecting samples.

  • WITH NUM SAMPLES specifies the size of the sample set, which is implemented in TiDB using reservoir sampling. It is not recommended to use this method to collect statistics when the table is large because the intermediate result set in reservoir sampling can produce certain redundant results, causing additional pressure on memory and other resources.
  • WITH FLOAT_NUM SAMPLERATE is a sampling method introduced in v5.3.0, specifying the size of the sampling rate, which is a parameter in the range (0, 1]. It is implemented in TiDB using Bernoulli sampling, which is more suitable for sampling large tables and has advantages in collection efficiency and resource usage.
| username: 喵父666 | Original post link

It can be adjusted.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.