What is the data volume ratio for TiDB queries to perform a full table scan instead of using an index?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb查询数据量比例是多少走全表扫描,不走索引

| username: wenyi

May I ask, what is the threshold ratio for TiDB to decide between a full table scan and using an index? Because beyond this ratio, the cost of a full table scan is lower than using an index, as using an index would require a table lookup.
For example, if I have a table with 20 million rows and the query condition matches 5 million rows, would it default to using an index or a full table scan?

| username: wenyi | Original post link

I remember Oracle has a 30% ratio.

| username: TiDBer_小阿飞 | Original post link

Isn’t it necessary to look at the LOOKUP back table amount based on the specific query conditions to decide whether to use a full table scan or an index?

| username: 这里介绍不了我 | Original post link

This is related to the optimizer.

| username: Kongdom | Original post link

There shouldn’t be a fixed ratio for this, waiting for the source code expert to explain~

| username: tidb菜鸟一只 | Original post link

Where is 30? It would be more reasonable to do a full table scan if it’s over 10%…

| username: oceanzhang | Original post link

This also depends on whether it’s a large table or a small table, and whether it requires a table scan or not.

| username: TiDBer_jYQINSnf | Original post link

Waiting for the source code expert to explain~ +1

| username: 哈喽沃德 | Original post link

Usually, it does not exceed 40%.

| username: dba远航 | Original post link

This depends on the optimizer’s algorithm; it seems there is no fixed value.

| username: TIDB-Learner | Original post link

What proportion and what data volume are probably estimated values summarized by everyone, right? The optimizer wouldn’t rigidly set a constant value, would it?

| username: zhaokede | Original post link

It’s still more accurate to look at the execution plan.

| username: Soysauce520 | Original post link

Waiting for the source code expert to explain~ +1

| username: kelvin | Original post link

Waiting for the source code expert to explain~ +1

| username: TiDBer_iLonNMYE | Original post link

Oracle is around 2%, not 10% or 30%; it is specifically controlled by a set of cost parameters. If TiDB also has a set of cost parameters controlling the cost of FTS and Index scan, it will affect this ratio.

| username: 江湖故人 | Original post link

This should also be related to hardware. The better the disk performance and the worse the CPU performance, the more the optimizer tends to choose full table scans.

| username: 哈喽沃德 | Original post link

Test it yourself with a large table.

| username: 江湖故人 | Original post link

Wide tables are more likely to use indexes, while narrow tables are more likely to perform full table scans. In Oracle, this can range from 3% to 30%.

| username: zhang_2023 | Original post link

More than 70% perform a full table scan, but this is not fixed.

| username: 有猫万事足 | Original post link

Originally, I couldn’t find the code, but thanks to @h5n1’s reminder:

explain format=‘cost_trace’ shows the cost calculation formula in the execution plan

This issue has made some progress, and I hope it can serve as a starting point for further discussion.

I found a large table with approximately 350 million records. After analyzing, the query results based on time slices are as follows:
index scan:


table scan:

After formatting the specific formula results, index scan:


table scan:

This still looks a bit messy. Let’s highlight the key points.
In the index scan formula, the decisive factor is:

(doubleRead(tasks(36.685874119007586)*tidb_request_factor(6e+06)))

)/tidb_executor_concurrency=5.00

Because the cost setting of tidb_request_factor is very high, it is 6000000 per request.

So when an index scan occurs, its cost can basically be simplified to the above formula, with other influencing weights being very small.
The formula for the number of tasks is:

batchSize := float64(p.SCtx().GetSessionVars().IndexLookupSize)
taskPerBatch := 32.0 // TODO: remove this magic number
doubleReadTasks := doubleReadRows / batchSize * taskPerBatch

IndexLookupSize can be viewed through show variables like 'tidb_index_lookup_size', and I have set it to 20000.
So tasks = estimated rows read (22928) / tidb_index_lookup_size (20000) * taskPerBatch (32) = approximately 36.68.

Therefore, if we want to reduce the cost of an index scan, there are two effective methods: increase the values of tidb_index_lookup_size and tidb_executor_concurrency.

Similarly, let’s look at the cost calculation method for a table scan. The main part is:

(
(cpu(3.5718153e+07(total records)*filters(1)*tikv_cpu_factor(49.9))) +
(scan(3.5718153e+07(total records)*logrowsize(312)*tikv_scan_factor(40.7)))
)/(tidb_distsql_scan_concurrency)15.00

The cost impact weight of the net part is very small and can be ignored. The main influencing factors of the cost value are the total table record size and the value of tidb_distsql_scan_concurrency.
The total table record size cannot be adjusted. To reduce the cost of a table scan, the only way is to increase the value of tidb_distsql_scan_concurrency.

Back to the question in the title. This ratio is not easy to calculate, but it can be roughly estimated that when
(estimated rows for index scan / tidb_index_lookup_size * 32 * tidb_request_factor) / tidb_executor_concurrency is less than
total table records * logrowsize * tikv_scan_factor / tidb_distsql_scan_concurrency,

it will generally choose an index scan over a table scan. Additionally, to make the optimizer more likely to choose an index scan, you can achieve this by setting two parameters:

  1. Increase the value of tidb_index_lookup_size
  2. Increase the value of tidb_executor_concurrency