How to Optimize Join Operations for Tables with Hundreds of Millions of Rows?

translator_bot · June 22, 2024, 2:59pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 上亿大表如何优化join 操作？

| username: 海石花47

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1
[Encountered Problem: Phenomenon and Impact]
As shown below, table a has a data volume of hundreds of thousands, and after adding Update_time, it only has a data volume of a few hundred rows.
However, table b has a data volume of over a hundred million.
Both tables have indexes on the relevant fields.
Running the query directly will cause it to crash. How should I optimize the SQL? Thanks for the help from all the experts!

SELECT
    a.xxx
FROM
    a
    JOIN b ON a.order_no = b.order_no 
    AND b.df = 2 
WHERE
    a.update_time >= '2023-03-09 13:36:07'

[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]

translator_bot · June 22, 2024, 2:59pm

| username: tidb菜鸟一只 | Original post link

Is there an index on update_time? Please share the execution plan for us to take a look.

translator_bot · June 22, 2024, 2:59pm

| username: xingzhenxiang | Original post link

Step-by-step execution:

Execute SELECT
a.order_no
FROM a
WHERE a.update_time >= ‘2023-03-09 13:36:07’
Create an index on uptime and order
SELECT
b.order_no
FROM b
WHERE b.df = 2
AND b.order IN (results from step 1)
Create an index on df and order
SELECT
a.xxx
FROM a
WHERE a.order IN (results from step 2)
Create an index on order

translator_bot · June 22, 2024, 2:59pm

| username: 海石花47 | Original post link

Although I can no longer reproduce that situation, your idea looks good. Thank you.

translator_bot · June 22, 2024, 2:59pm

| username: 海石花47 | Original post link

There is an index, and the plan at that time was as follows:

translator_bot · June 22, 2024, 2:59pm

| username: 人如其名 | Original post link

There are no indexes on b.order_no and b.bf, right? Otherwise, it should use the index instead of hash join, assuming there aren’t many order_no and bf values after the join.

translator_bot · June 22, 2024, 2:59pm

| username: Raymond | Original post link

It is also possible that a.order_no = b.order_no both do not have indexes. If one of the fields has an index, it is also possible to use index join.

translator_bot · June 22, 2024, 2:59pm

| username: Jellybean | Original post link

The execution plan shows that table b is being fully scanned, and the presence of the pseudo keyword in the execution information indicates that the table statistics might be inaccurate, leading to execution issues. Therefore, you need to check the index situation and update the table statistics with the analyze table command. Once this is handled, the problem should be resolved.

translator_bot · June 22, 2024, 2:59pm

| username: waeng | Original post link

If it can be split, split it as much as possible to reduce data scanning.

translator_bot · June 22, 2024, 2:59pm

| username: Running | Original post link

Add indexes to order_no and update_time, or directly use TiFlash.

translator_bot · June 22, 2024, 2:59pm

| username: 海石花47 | Original post link

They are all single indexes, not composite indexes. But won’t there be index merging?

translator_bot · June 22, 2024, 2:59pm

| username: 海石花47 | Original post link

Both of these have indexes.

translator_bot · June 22, 2024, 2:59pm

| username: 海石花47 | Original post link

At that time, the situation was that table b had an issue with the program, resulting in a field being entirely null, and the data volume was abnormally high, reaching over a hundred million and growing rapidly. Therefore, it is estimated that the table’s health was very poor at that time.

translator_bot · June 22, 2024, 2:59pm

| username: 海石花47 | Original post link

TiFlash requires adding 3 more servers… the cost is expensive.

translator_bot · June 22, 2024, 2:59pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.