How to Improve Query Performance for a TiDB Cluster with Hundreds of Millions of Rows in a Single Table

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群在单表亿级别数据下如何提高查询性能

| username: TiDBer_W8QSBmpH

How to improve query performance for a TiDB cluster with hundreds of millions of rows in a single table? Currently, my first solution is to partition the single table by time, and include time or partition information in the queries. The second solution is to increase the number of TiKV nodes. The current cluster configuration is the official standard configuration. I hope someone can provide some insights.

| username: yiduoyunQ | Original post link

It mainly depends on the specific performance requirements. If the business can accept partition pruning for optimization, that would be best.

| username: hacker_77powerful | Original post link

Try using TiFlash column storage, it’s so fast you’ll love it.

| username: Jack-li | Original post link

Integrate the TiFlash component as secondary storage to provide columnar storage and more efficient analytical query capabilities.

| username: DBAER | Original post link

You can try TiFlash for analytical purposes.

| username: zhaokede | Original post link

Use TiFlash for query statistics;
Reasonable table design and index design can improve performance.

| username: 小于同学 | Original post link

TiFlash columnar storage.

| username: TiDBer_W8QSBmpH | Original post link

This billion-row table involves join queries with other million-row tables, using indexes but not covering indexes, and the average time taken is around 8 seconds. I saw a reply mentioning the use of TiFlash. Can this optimize the join query to the millisecond level or within 2 seconds?

| username: hacker_77powerful | Original post link

Using TiFlash column storage, millisecond-level performance is not an issue.

| username: tidb菜鸟一只 | Original post link

Is the amount of data you need to query large? Is it a summary type of requirement, or do you need to display a lot of data?

| username: zhanggame1 | Original post link

No need to differentiate, try TiFlash first.

| username: 友利奈绪 | Original post link

Try TiFlash first.

| username: TiDBer_QYr0vohO | Original post link

Enable TiFlash

| username: 随缘天空 | Original post link

If it is an analytical table, you can add the TiFlash component, which will be more efficient than adding time partitions and TiKV nodes.

| username: TiDBer_W8QSBmpH | Original post link

Not many, either pagination or range, at most a few dozen.

| username: zhh_912 | Original post link

It is recommended to split the database and tables from a normative perspective.

| username: yytest | Original post link

I recommend using TiFlash, which is columnar storage and suitable for analysis.

| username: tidb菜鸟一只 | Original post link

For pagination, if you need to extract a portion of data from a large dataset, for multi-table Join pagination queries, if the filter condition is on a single table, the inner query must use a covering index, paginate first, and then Join. For range queries, you might need to add an index on the corresponding field. If there is already an index, you can consider partitioning based on the corresponding field and try partition pruning.

| username: TiDBer_H5NdJb5Q | Original post link

Mark, how to optimize queries for OLTP tasks?

| username: Kongdom | Original post link

The main consideration is not the number of rows, but the number of columns and whether it is an aggregate query. TiFlash, with its columnar storage, shows particularly significant effects on aggregate queries.