Is there an issue with the SQL query if querying two tables with hundreds of millions of rows (across databases) using KV is slower than using TiFlash, and are there any optimization methods?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关联两个亿级数据量的表(跨库),查询走KV的时间反而要比走Tiflash快,是不是SQL有问题,有没有优化方式

| username: xiaoxiaozuofang

[TiDB Usage Environment] Production Environment
[TiDB Version] tidb_v6.1.0
[Reproduction Path]
[Encountered Problem: When associating two tables with hundreds of millions of data (cross-database), the query using KV takes longer than using Tiflash. Is there an issue with the SQL? Are there any optimization methods?]
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]


| username: hey-hoho | Original post link

Run the SQL
Run the execution plan
Blind guessing won’t get you anywhere

| username: Kongdom | Original post link

Looking at the execution plan, TiFlash is performing a full table scan, so it’s normal for it to be slow. TiFlash is not necessarily faster than TiKV. In OLAP scenarios, because TiFlash uses columnar storage, operations like count and sum will be faster. However, for queries that only retrieve fields, it doesn’t necessarily have an advantage.

| username: Running | Original post link

Is it cached? In version 5.5, TiFlash did not have caching, and each query had to rescan the table. TiKV has a caching mechanism, so it might have used the cache. Also, check if the number of nodes is the same. For example, if there are more TiKV nodes than TiFlash nodes, this issue could occur.

| username: Running | Original post link

I feel that the CBO optimizer will default to using TiFlash as long as TiFlash is enabled, unless the statistics are inaccurate. It is best to execute ANALYZE table to re-collect the statistics.

| username: buddyyuan | Original post link

You are querying many fields. Try using operations like sum, avg, and group by, and then compare them.

| username: TiDBer_jYQINSnf | Original post link

I’m going to answer the question off the top of my head:
TiKV is fast because it uses indexes.
The purpose of TiFlash, to some extent, is to save memory. If TiKV always executes joins, the memory will be full.
TiFlash is columnar storage, fetching all values of a column at once doesn’t use much memory. With TiKV, if you do a full table scan, you really have to fetch all the table data, and the memory will explode.

Looking at the conclusion, I forcefully deduced a seemingly reasonable cause :crazy_face:

| username: tidb菜鸟一只 | Original post link

Remove the limit, then max out all the fields you’re querying now, and see which one is faster…