Querying with LIMIT m,n in large tables becomes very slow when m increases to tens of millions. How can this be optimized?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 大表中的limit m,n查询,当m增加到千万级的时候,非常慢,请问如何优化?

| username: TiDBer_dTajNAyh

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0
[Reproduction Path]
Just migrated from MySQL, many things are unclear.
Single-node TiDB database (although using 3KV) environment.
Created a large table with nearly 40 million rows of data. At this point, using select * from mytable limit m,n (pagination query), when m is greater than 10 million, the query becomes very slow. The table uses the default id primary key, auto-increment (I know TiDB does not recommend using auto-increment primary keys).
[Encountered Problem: Phenomenon and Impact]
A single query may take 30 seconds or more. How should I optimize this? I tried using MySQL techniques (such as inner join, which makes count(*) very slow), but it didn’t work.

Additionally, I noticed that count(*) still performs a full table scan, but it’s much faster than MySQL and doesn’t seem to use an index. What is the principle behind this?
Thank you!
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: liusf1993 | Original post link

You can consider using limit to query the primary key first, and then use id in to query all the data, such as:
select * from mytable where id in(
select id from mytable limit m,n).

| username: liusf1993 | Original post link

Supplement: count() is IndexFullScan, select * is TableFullScan, so count() will be faster.

| username: 啦啦啦啦啦 | Original post link

There are several optimization solutions for deep pagination issues, such as creating an independent pagination table or pagination column. You can refer to the deep pagination case studies in lesson 304 or 307, which are explained in great detail.

| username: TiDBer_dTajNAyh | Original post link

How can you tell it’s an index scan?

| username: TiDBer_dTajNAyh | Original post link

It cannot optimize the query! Because it still has to fetch data from the end! The result is even slower.

| username: TiDBer_dTajNAyh | Original post link

@啦啦啦啦啦 Could you please tell me where to find the 304 or 307 lessons? Thank you!

| username: 啦啦啦啦啦 | Original post link

https://learn.pingcap.com/learner/course
image
image
image

| username: liusf1993 | Original post link

count(*) will be optimized to count(1), which only needs to scan the primary key (similar to MySQL principles, TiDB can leverage cluster advantages and might be faster).

I used version 5.0.3 for EXPLAIN, and you can see IndexFullScan. Your version 6.5 should not have significant changes in the execution process, but the displayed information might have some adjustments.

| username: liusf1993 | Original post link

You can verify the execution time.
When using “LIMIT m, n”, if m is large, the reason for the slowness is that the database needs to scan the entire rows of the first n+m rows (TableFullScan). The larger m is, the more data it scans, and the slower it gets.
Switching to using id utilizes the primary index, reducing the overhead of scanning entire rows of data (IndexFullScan). After that, it uses the retrieved primary keys to query the complete data of n rows (PointGet).

| username: 胡杨树旁 | Original post link

Check if the table structure has no indexes?

| username: TiDBer_dTajNAyh | Original post link

You are right, I started using CSV to import tables without indexes. But it was still very fast!

| username: Raymond | Original post link

I noticed that count() is still a full table scan, but it’s much faster than MySQL and doesn’t seem to use indexes. What’s the principle behind this?
This is the advantage of distributed data. In a single-node MySQL database, count(
) can only aggregate data using one storage node. TiDB, on the other hand, has multiple data storage nodes (TiKV) that aggregate data together. Each TiKV node aggregates part of the data, and then the TiDB node summarizes the data. This way, it can fully utilize the benefits brought by operator pushdown.