How to Optimize Query Performance for a Single Table with Over 2 Billion Rows in TiDB?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB单表数据超20亿,如果优化查询性能?

| username: TiDBer_RQobNXGv

In the production environment, the data volume of a single table in TiDB exceeds 2 billion. When using indexes, the query filtering of a single table is very slow. How can the query speed be optimized? Does anyone have good solutions?

SQL statement:

SELECT
  count(0)
FROM
  collect_pole_history_message a
WHERE
  1 = 1
  AND SITE_CODE IN (
    'XN11000121',
    'HN11000211',
    'HN11000413',
    'HN11000411',
    'HN11000313',
    'HN11000111',
    'HN11000113',
    'HN11000112',
    'HN11000213',
    'HN11000311'
  )
  AND command_code IN (
    '0x3A',
    '0x7A',
    '0x68',
    '0x10',
    '0x70',
    '0x05',
    '0x6C',
    '0x69',
    '0x11',
    '0x15',
    '0x3A',
    '0x05',
    '0x17',
    '0x15',
    '0x70',
    '0x3F',
    '0x6F',
    '0x7F',
    '0x11'
  )
  AND org_id IN (
    '61386612e4b0a48664adbc60',
    '61386cece4b0a48664adbc8b',
    '61386e00e4b0a48664adbc9b',
    '61386ea7e4b0a48664adbc9d',
    '63876c92e4b00dbb07d4dc41',
    '638d4b8fe4b00dbb07d4dcd3',
    '61415640e4b0d3a914d45efb',
    '61396d53e4b064a86f4a16ff',
    '61396d7de4b064a86f4a1702',
    '6323d716e4b00e22a4b93cf6',
    '613acabae4b0f8f969dab71f',
    '613acb35e4b0f8f969dab723'
  )
  AND occur_time >= '2023-04-22 08:48:51'
  AND occur_time <= '2023-04-23 08:48:51';

Table structure:

Execution plan:

Cluster topology:

| username: 裤衩儿飞上天 | Original post link

Post the topology information, SQL, and execution plan for us to take a look.

| username: Jellybean | Original post link

Without specific SQL execution details, discussing optimization can easily become vague.

Please provide specific SQL and table structure information, especially index information. If there are privacy concerns, you can anonymize the table fields appropriately first.

| username: TiDBer_pkQ5q1l0 | Original post link

Please post the execution plan.

| username: tidb菜鸟一只 | Original post link

Deploy 200 nodes and see if it’s still slow.

| username: TiDBer_pkQ5q1l0 | Original post link

For tables with more than 2 billion rows as mentioned above, is it suitable to use partitioned tables or split them into multiple tables?

| username: Jellybean | Original post link

There should be no need to split the table or use partition tables. We have a table with over 4 billion rows and more than 60 fields. With proper index design and usage, our daily queries are quite fast.

It’s still necessary to specifically look at his SQL and execution plan, and then optimize accordingly.

| username: TiDBer_RQobNXGv | Original post link

I’ve already added it above, please take a look. Thanks.

| username: TiDBer_RQobNXGv | Original post link

I have already added it above, please take a look, thanks.

| username: TiDBer_RQobNXGv | Original post link

I have already added the information above, please check it out. Thanks.

| username: Jiawei | Original post link

You can consider partition tables, partitioning by time fields to reduce data volume.

| username: 魔礼养羊 | Original post link

TiDB does not require partitioning; whether to partition tables depends on the situation. MySQL uses partitioning and table splitting as a way to optimize a single-instance database, reducing the computational and storage demands on a single database server. TiDB, on the other hand, is inherently a distributed database and can automatically achieve load balancing.

| username: caiyfc | Original post link

You have several duplicate indexes here. You can delete the indexes index_no and index_code. For optimization, if the selectivity is high, you can create a composite index with these four fields. Alternatively, you can add TiFlash.

| username: TiDBer_pkQ5q1l0 | Original post link

For this kind of large table aggregation, I feel that using TiFlash might be more suitable.

| username: 魔礼养羊 | Original post link

I suggest you refer to this document. For optimizing the amount of data you have, it can only be done through iteration. From your statement, I recommend either partitioning or splitting the table. Your table has many columns and rows, and without splitting, optimization is impossible unless you use a minicomputer.

https://www.toutiao.com/article/7221447689358279224/?app=news_article&timestamp=1681469611&use_new_style=1&req_id=202304141853307E0224195AF583148E7B&group_id=7221447689358279224&wxshare_count=1&tt_from=weixin&utm_source=weixin&utm_medium=toutiao_android&utm_campaign=client_share&share_token=8e161d08-ff04-4186-952d-5d6ce024076d&source=m_redirect&wid=1682216109154

| username: TiDBer_RQobNXGv | Original post link

How many nodes does TiFlash need? How much can query performance be improved?

| username: TiDBer_RQobNXGv | Original post link

I have tried partitioned tables, but the performance improvement is minimal. Moreover, if the query condition is not a partition field, the performance is even worse than without partitioning.

| username: Jellybean | Original post link

For partitioned table queries, the best practice is to include the partition column in the WHERE clause for filtering. Otherwise, it will trigger a full partition scan. Specifying the partition in the query generally results in good performance.

| username: Running | Original post link

That’s impressive. I’ve tried with a single table of 500 million, using one TiFlash, and the query performance was still okay. But 2 billion, that’s really amazing.

| username: 考试没答案 | Original post link

I have several large tables with 2 billion rows each. Optimization depends on how they are used and addressing issues where they arise. In the worst case, the business logic can be changed.