Querying 100 Million Records in a Single Table

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 单表一亿数据查询

| username: TiDB小萌新

Querying a single table with 100 million records using “order by” takes 32 minutes. How can this be optimized? TiFlash has already been added.
【TiDB Environment】Production environment
【TiDB Version】5.4.0
【Issue and Impact】Single table query is very slow
【Reproduction Path】

【Resource Configuration】8 servers with 8 cores and 16GB RAM, 2 TiFlash, 3 TiKV

| username: TiDBer_jYQINSnf | Original post link

Your query is using select *, so TiFlash doesn’t have much effect. TiFlash is suitable for selecting one or a few columns. If you want to optimize TiKV, you need to improve the scan speed. Here are a few aspects to consider:

  1. Increase TiDB’s coprocessor cache. The first scan will cache the data, and if the data doesn’t change much, the second scan will hit the cache.
    TiDB 配置文件描述 | PingCAP 文档中心
  2. Increase the number of gRPC connections between TiDB and TiKV, which is equivalent to more concurrent scans.
    TiDB 配置文件描述 | PingCAP 文档中心
  3. Increase the number of threads in TiKV’s unified read pool and gRPC pool.
    TiKV 配置文件描述 | PingCAP 文档中心
  4. Increase the size of the block cache, which is equivalent to increasing RocksDB’s cache, allowing more data to be cached.
    TiKV 配置文件描述 | PingCAP 文档中心

These are the basic adjustments. If your cluster is read-heavy, increasing these parameters is appropriate. If your cluster only occasionally reads, be cautious with memory-related parameters, as you need to reserve memory for other queries.

Of course, the most effective way is to optimize the SQL to minimize or avoid scans.

Additionally, your query didn’t use TiFlash. I’m not sure if it’s because of select * or because you haven’t added TiFlash replicas. If you have confirmed that TiFlash replicas are added and it still doesn’t use TiFlash, then it means that using TiFlash is indeed more costly than using TiKV.

| username: TiDB小萌新 | Original post link

Executing this SQL without the ORDER BY clause takes 8 seconds, but with the ORDER BY clause, it doesn’t retrieve any results. Could any experts advise if there’s a database configuration that needs to be adjusted?

| username: Soysauce520 | Original post link

Are there indexes on the two conditions after the WHERE clause, and how selective are they? When looking at the execution plan, it first uses the time index for the ORDER BY.

| username: Soysauce520 | Original post link

After removing it, check the execution plan. There should be an index. Check the index name. When writing the order by, adding a hint to force the use of the index should work.

| username: TiDB小萌新 | Original post link

The two conditions after “where” do not have indexes. After enabling TiFlash, I shouldn’t need to add indexes, right?


I see that the execution plan should have indexes.

| username: TiDB小萌新 | Original post link

The original SQL is like this. This table will have a large amount of data being written and queried. What should be done in this situation?

| username: Soysauce520 | Original post link

You need to revert the table. As the friend above mentioned, you are using select *, check the execution plan without adding order by.

| username: TiDB小萌新 | Original post link

Like this, I also checked yesterday, if you don’t add order by, it can use TiFlash.

| username: h5n1 | Original post link

Recollect the statistics of the table. Is the index on create_time the only column? In what situations would this index be used, and consider whether this index can be removed.

| username: TiDB小萌新 | Original post link

This index was added at that time because of sorting. If this index is causing the issue, I can try removing it first to see if it helps.

| username: tidb菜鸟一只 | Original post link

Your SQL query does not utilize TiFlash, and indeed it doesn’t use TiFlash. The query also doesn’t correctly use the create_time column index. My suggestion is to ignore the time index and perform a full table scan, or directly create a composite index on these four fields.

| username: h5n1 | Original post link

Check if your version allows setting indexes to invisible, so you don’t have to delete them directly for testing.

| username: TiDB小萌新 | Original post link

I removed the index, and now it can be queried through TiFlash. This takes 9 seconds. Is there still room for optimization?

| username: h5n1 | Original post link

You can try expanding the CPU for TiFlash.

| username: TiDB小萌新 | Original post link

I see that the CPU utilization of TiFlash is not high right now. Do I need to add more nodes?

| username: Soysauce520 | Original post link

Manually specify using TiFlash: 使用 TiFlash | PingCAP 文档中心. If you want it to be even faster, you might need to look at the selectivity after the WHERE condition, use TiKV, add a composite index, force the use of the composite index, and reduce table lookups.

| username: TiDBer_jYQINSnf | Original post link

I noticed that TiFlash also has parameters related to improving concurrency:

   ## Introduced in v5.0, it indicates the maximum number of cop requests that TiFlash Coprocessor can execute simultaneously. If the number of requests exceeds the value specified by this configuration, the excess requests will queue up and wait. If set to 0 or not set, the default value is used, which is twice the number of physical cores.
    cop_pool_size = 0

You can take a look.

| username: TiDB小萌新 | Original post link

After I deleted the index, the TiFlash CPU suddenly spiked.

| username: TiDB小萌新 | Original post link

Removing a WHERE condition basically results in a full table scan, causing TiFlash’s CPU to be fully utilized. In this case, shouldn’t TiFlash be avoided?