Is there room for optimization for grouping millions of data? Can it be improved to around 10 seconds? The data before the group condition filter is approximately 3 to 4 million

translator_bot · June 21, 2024, 7:48am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 几百万数据group有没有优化空间，能提升到10几秒吗？group条件过滤之前大概数据300万到400万

| username: Jjjjayson_zeng

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 21, 2024, 7:48am

| username: 像风一样的男子 | Original post link

It is recommended to use TiFlash for such large data volume queries.

translator_bot · June 21, 2024, 7:48am

| username: FutureDB | Original post link

For analytical SQL, you can create TiFlash replicas to utilize TiFlash MPP. For online SQL, you can check if the filter conditions and join conditions are using indexes, and whether the join order can be adjusted, such as having a small table drive a large table. However, to determine the specific optimization, you should first examine the execution plan.

translator_bot · June 21, 2024, 7:48am

| username: zhanggame1 | Original post link

Could you share the execution plan?

translator_bot · June 21, 2024, 7:48am

| username: tidb狂热爱好者 | Original post link

Go on the machine with 2TB of memory.

translator_bot · June 21, 2024, 7:48am

| username: wangccsy | Original post link

It should be possible.

translator_bot · June 21, 2024, 7:48am

| username: 小龙虾爱大龙虾 | Original post link

You can try it, first send the execution plan text and table structure for us to take a look.

translator_bot · June 21, 2024, 7:48am

| username: 有猫万事足 | Original post link

TiFlash + MPP. This is very suitable for your SQL.

translator_bot · June 21, 2024, 7:48am

| username: 路在何chu | Original post link

This is purely a query with a large amount of data.

translator_bot · June 21, 2024, 7:48am

| username: Jjjjayson_zeng | Original post link

I just looked at TiFlash, there are some issues, let them take a look first.

translator_bot · June 21, 2024, 7:48am

| username: 有猫万事足 | Original post link

You must use TiFlash + MPP to achieve speed.

If you only use TiFlash for scanning, there won’t be much improvement. It might even be better to optimize the index and use TiKV instead.
Don’t assume that just because the execution plan includes TiFlash, it will be sufficient.

translator_bot · June 21, 2024, 7:48am

| username: Jjjjayson_zeng | Original post link

Doesn’t TiDB’s parser itself determine whether the index or TiFlash is faster?

translator_bot · June 21, 2024, 7:48am

| username: Jjjjayson_zeng | Original post link

What does MPP refer to?

translator_bot · June 21, 2024, 7:48am

| username: 托马斯滑板鞋 | Original post link

I suggest directly providing the execution plan and SQL trace text.

translator_bot · June 21, 2024, 7:48am

| username: 有猫万事足 | Original post link

To be honest, it’s best if we discuss specific issues. If you have an execution plan, I can help you see how to optimize it.

Otherwise, if we just expand on your current problem, it will end up with us just complaining about the optimizer together.
Are you sure that’s the result you want from posting this thread?

translator_bot · June 21, 2024, 7:48am

| username: 有猫万事足 | Original post link

The first link is an introduction, and the second link is the key point. If you want TiFlash to perform hash join pushdown, you must use MPP mode. With all these associations, if there is no MPP and TiFlash replica, once the optimizer chooses to do aggregation on TiDB and scan on TiFlash, it will be inefficient and the speed will not be fast.

translator_bot · June 21, 2024, 7:48am

| username: 江湖故人 | Original post link

Won’t the execution plan be captured?

translator_bot · June 21, 2024, 7:48am

| username: 江湖故人 | Original post link

a.timecarddate can consider interval partitioning, a.payhours and associated keys can consider adding indexes, and can consider using TiFlash.

translator_bot · June 21, 2024, 7:48am

| username: Jjjjayson_zeng | Original post link

There is an issue with TiFlash, it hasn’t been fixed yet. Once it’s fixed, I’ll provide an execution plan.

translator_bot · June 21, 2024, 7:48am

| username: Kongdom | Original post link

Your SQL is a bit strange,

I don’t see any aggregate functions, but I see a GROUP BY
It’s recommended to change the LEFT JOIN to an INNER JOIN if possible
Please share the execution plan and table structure. We can’t just guess~