Is there any risk in adding a table with 2.2 billion rows (partitioned table) in TiKV to TiFlash?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv内22亿的表(分区表)加入到tiflash中会有风险么?

| username: qiuxb

[TiDB Usage Environment] Production Environment / Test / Poc
Production Environment
[TiDB Version]
v4.0.8
[Reproduction Path] What operations were performed to cause the issue

[Encountered Issue: Problem Phenomenon and Impact]
Currently, there is a single table with 2.2 billion rows that requires an aggregate query. Executing it in TiKV takes 10 seconds, which does not meet performance requirements. Considering adding the table to TiFlash, are there any risks?

SELECT
sum(xx) AS xxx
FROM
t_xxx t 
WHERE
app_xxx IN (
'aaa',
'bbb',
'ccc',
'dddd', omitted hundreds more) 
AND report_time BETWEEN '2023-09-04 00:00:00' 
AND '2023-09-04 23:59:59';

report_time has a single-column index.

Execution plan is as follows:

+------------------------------+------------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StreamAgg_30                 | 1.00       | root      |                              | funcs:sum(Column#134)->Column#125                                                                                                                                                                                                                                                                                                                                                                          |
| └─TableReader_31             | 1.00       | root      |                              | data:StreamAgg_10                                                                                                                                                                                                                                                                                                                                                                                          |
|   └─StreamAgg_10             | 1.00       | cop[tikv] |                              | funcs:sum(dbname.t_xxx.xxx)->Column#134                                                                                                                                                                                                                                                                                                                                                                     |
|     └─Selection_29           | 1228580.34 | cop[tikv] |                              | ge(dbname.t_xxx.report_time, 2023-09-04 00:00:00.000000), in(dbname.t_xxx.app_package, "aaa", "bbb", "ccc", "ggg", "ddd", "eee", "fff"), le(dbname.t_xxx.report_time, 2023-09-04 23:59:59.000000) |
|       └─TableFullScan_28     | 6198910.00 | cop[tikv] | table:t, partition:p20230904 | keep order:false                                                                                                                                                                                                                                                                                                                                                                                           |
+------------------------------+------------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: 有猫万事足 | Original post link

TiFlash seems suitable for your SQL query.

However, it looks like your SQL still has room for optimization.
The execution plan shows a direct TableFullScan. Is the table too large, and is it difficult to add an index?

| username: qiuxb | Original post link

A table with 2.2 billion rows (partitioned table) currently has an index on report_time and is partitioned based on report_time. According to the execution plan, partition pruning is being done, but it is indeed scanning all partitions. Each partition has about 6 million rows. Even if a composite index on report_time and app_xxx is added, it doesn’t seem like the performance will improve significantly. The “in” query will cause index jumping scans.

| username: tidb菜鸟一只 | Original post link

Will the time range be limited to just one day? If the range is relatively small, try creating an index on the report_time field first. If the time range could be as long as a month, using TiFlash would indeed be better.

| username: qiuxb | Original post link

The time range is basically locked within one day. The table is partitioned by day, with over 6 million records per day. About 1.3 million records meet the filtering criteria. It seems that creating a composite index with three fields would be effective.

| username: h5n1 | Original post link

The version 4.0.8 is a bit low, not sure what issues might arise when adding TiFlash. Did you use TiFlash for other tables before?

| username: qiuxb | Original post link

Previously, I added a table to TiFlash when the data volume was small, only 5 million. It gradually synchronized to the current largest table, which now has 500 million. The main concern is that adding TiFlash to a table with 2.2 billion rows will significantly impact TiKV. Additionally, it hasn’t been mentioned how the synchronization is handled during this process.

| username: h5n1 | Original post link

There are related articles on the blog. TiFlash is a learner role replica of TiKV. The concern is that with such a large amount of data, TiFlash might be under significant pressure.

| username: TiDBer_GhBjYQdR | Original post link

Looking at the execution plan, it seems that the entire partitioned table is being scanned. If partitioned by day, this should be optimal. Is it necessary to place it in TiFlash? Adding a composite index would make the entire data table larger. If possible, would partitioning the table by day be better?

| username: 有猫万事足 | Original post link

I have a monthly partition table with 700 million records, with approximately 10 million records per day. I tried using a composite index for this query, and it can return results within 0.3 seconds. You can also try this to see if it solves the problem.

| username: tidb菜鸟一只 | Original post link

Your table is partitioned by day, so the index is not very meaningful. Configuring TiFlash replicas should be quite effective. There is no risk for TiKV, but TiFlash might experience a bit more pressure, which is not a concern.

| username: Kongdom | Original post link

There is no risk in adding TiFlash replicas, but it is important to note that it is best not to deploy TiFlash and TiKV nodes together.

| username: zhanggame1 | Original post link

Try using a composite index on the WHERE condition.

| username: cassblanca | Original post link

If you can take advantage of MPP parallel computing, it should be quite helpful. However, TiDB only introduced the MPP architecture starting from version 5.0. You can try the leftmost matching principle for composite indexes. The best practices for indexes also suggest that when using the IN expression in query conditions, the number of matching conditions should not exceed 300, otherwise the execution efficiency will be poor. Index Best Practices | PingCAP Documentation Center