Requirement for TiDB to Stream Result Sets from TiKV to TiFlash for Computation

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb拿到tikv结果集流入tiflash进行计算的需求

| username: 人如其名

For a query that joins two tables, select count(*) from a, b where a.col1=b.col2 and a.col2='xxx', where both tables a and b have TiFlash engines and a.col2 has an index, and assuming table a has tens of millions of rows but only 10 rows after filtering with a.col2=‘xxx’, while table b has hundreds of millions of rows, the following scenarios arise:

  1. If both a and b use TiFlash, table a will need to perform a full table scan, which is costly.
  2. If table a uses TiKV to filter out 10 rows via the index, then all data from table b needs to be sent to the TiDB node for the Join operation. Joining 10 rows with hundreds of millions of rows is costly. Additionally, if the join is performed on the TiDB node, the parent operator of this join operator can only be executed on the TiDB node, meaning subsequent operations cannot leverage TiFlash capabilities.

Is it possible to add another mechanism where the small result set filtered by the TiKV index is sent to TiDB, and then TiDB sends it to TiFlash as part of the operator input to perform a hashJoin with the large table b? This would be more efficient.

| username: tidb菜鸟一只 | Original post link

When generating the execution plan, you need to decide whether to use TiKV or TiFlash. At that time, you don’t know exactly how much data your filter conditions can filter out; it all depends on the statistics. Moreover, if the condition a.col1=b.col2 has an index on col2 of table b, there’s no need to use TiFlash; you can directly use the TiKV index.

| username: 人如其名 | Original post link

This requirement is a bit demanding, but the main idea is to utilize TiKV’s index filtering capability along with TiFlash’s operator computation capability to make complex queries faster.

| username: kkpeter | Original post link

This sounds good, but I’m not sure about the difficulty of implementation.

| username: h5n1 | Original post link

TiKV should also implement MPP.

| username: Jasper | Original post link

Wouldn’t it be better if you just split this requirement into two SQL statements? :joy:

| username: ajin0514 | Original post link

It can be achieved directly with SQL, right?