How to Optimize TiFlash for Better Performance?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash 如何优化才能更快点?

| username: LKLK

An online business SQL query takes 55 seconds on TiKV, but only 6 seconds when forced to use TiFlash. However, the development team is concerned that with only 10 days of test data, the query might become very slow once a year’s worth of data is added. How can we optimize TiFlash to improve query speed?

  1. TiFlash has 2 nodes, with 2 replicas.
  2. The SQL query is as follows:
SELECT
  *
FROM
  (SELECT
    channel.reg_date,
    iap.iap_1w_amount,
    iap.iap_1_amount,
    cost.cost,
    channel.retention_cnt,
    channel.iap_amount,
    iap.iap_2w_user,
    retention.retention_2w_cnt,
    iap.iap_3_user,
    iap.iap_1w_user,
    channel.reg_cnt,
    channel.iap_cnt,
    retention.retention_3_cnt,
    channel.install_cnt,
    retention.retention_1_cnt,
    iap.iap_1_user,
    retention.retention_1w_cnt,
    cost.show_cnt,
    channel.char_cnt,
    cost.click_cnt,
    retention.retention_2_cnt,
    iap.iap_2_user,
    iap.iap_3_amount,
    iap.iap_2w_amount,
    iap.iap_2_amount
  FROM
    (SELECT
      acs.reg_date,
      SUM(retention_cnt) AS retention_cnt,
      SUM(iap_amount) AS iap_amount,
      SUM(reg_cnt) AS reg_cnt,
      SUM(iap_cnt) AS iap_cnt,
      SUM(install_cnt) AS install_cnt,
      SUM(char_cnt) AS char_cnt
    FROM
      ad_channel_stat_1051 acs
      LEFT JOIN ad_campaign_config c
        ON acs.campaign_id = c.campaign_id
    WHERE acs.user_type = 1
      AND acs.reg_date >= '2022-11-01'
      AND acs.reg_date <= '2022-11-30'
      AND acs.gw_id = 105102
    GROUP BY acs.reg_date) channel
    LEFT JOIN
      (SELECT
        ars.reg_date,
        SUM(retention_2w_cnt) AS retention_2w_cnt,
        SUM(retention_3_cnt) AS retention_3_cnt,
        SUM(retention_1_cnt) AS retention_1_cnt,
        SUM(retention_1w_cnt) AS retention_1w_cnt,
        SUM(retention_2_cnt) AS retention_2_cnt
      FROM
        ad_retention_stat_1051 ars
        LEFT JOIN ad_campaign_config c
          ON ars.campaign_id = c.campaign_id
      WHERE ars.user_type = 1
        AND ars.reg_date >= '2022-11-01'
        AND ars.reg_date <= '2022-11-30'
        AND ars.gw_id = 105102
      GROUP BY ars.reg_date) retention
      ON channel.reg_date = retention.reg_date
    LEFT JOIN
      (SELECT
        ais.reg_date,
        SUM(iap_1w_amount) AS iap_1w_amount,
        SUM(iap_1_amount) AS iap_1_amount,
        SUM(iap_2w_user) AS iap_2w_user,
        SUM(iap_3_user) AS iap_3_user,
        SUM(iap_1w_user) AS iap_1w_user,
        SUM(iap_1_user) AS iap_1_user,
        SUM(iap_2_user) AS iap_2_user,
        SUM(iap_3_amount) AS iap_3_amount,
        SUM(iap_2w_amount) AS iap_2w_amount,
        SUM(iap_2_amount) AS iap_2_amount
      FROM
        ad_iap_stat_1051 ais
        LEFT JOIN ad_campaign_config c
          ON ais.campaign_id = c.campaign_id
      WHERE ais.user_type = 1
        AND ais.reg_date >= '2022-11-01'
        AND ais.reg_date <= '2022-11-30'
        AND ais.gw_id = 105102
      GROUP BY ais.reg_date) iap
      ON channel.reg_date = iap.reg_date
    LEFT JOIN
      (SELECT
        acs.report_date,
        SUM(cost) AS cost,
        SUM(show_cnt) AS show_cnt,
        SUM(click_cnt) AS click_cnt
      FROM
        ad_cost_stat_1051 acs
        LEFT JOIN ad_campaign_config c
          ON acs.campaign_id = c.campaign_id
      WHERE acs.report_date >= '2022-11-01'
        AND acs.report_date <= '2022-11-30'
        AND acs.gw_id = 105102
      GROUP BY acs.report_date) cost
      ON channel.reg_date = cost.report_date) tmp
LIMIT 1000 OFFSET 0;
  1. The execution plan is as follows:
...
  1. Does adding more TiFlash nodes improve query speed?

Adding more TiFlash nodes can potentially improve query speed by distributing the workload across more nodes, thus increasing parallelism and reducing the time taken for each node to process its share of the data. However, the actual performance gain depends on various factors such as network latency, data distribution, and the specific query patterns. It is also important to ensure that the TiFlash nodes are properly configured and that the data is evenly distributed across the nodes to fully leverage the additional resources.

| username: Kongdom | Original post link

If you encounter stats:pseudo, try executing analyze on the table to collect statistical information and then try again.

| username: 裤衩儿飞上天 | Original post link

This question is so long, I can’t even read it anymore :rofl:
TL;DR

| username: Kongdom | Original post link

:rofl: I have OCD, so I only started looking at it after formatting it locally.

| username: 裤衩儿飞上天 | Original post link

How do you read the execution plan? :rofl:

| username: 特雷西-迈克-格雷迪 | Original post link

Make all tables use TiFlash (wide tables with few columns involved in select).

| username: 数据小黑 | Original post link

The provided information is indeed extensive. Based on my current understanding, here is my response:

  1. Theoretically, the more machines TiFlash is deployed on, the more CPU and memory can be utilized. However, if TiFlash is only used for table scans, it can easily hit a ceiling, leaving a lot of CPU and memory unused.
  2. More replicas do not necessarily mean faster computation. The number of replicas is not directly related to the performance of a single query.
  3. I couldn’t find any MPP-related operators in the execution plan after searching for a while. As far as I remember, MPP is available in version 5.4.3. Could you try using MPP?
    Reference: 用 EXPLAIN 查看 MPP 模式查询的执行计划 | PingCAP 文档中心
| username: wakaka | Original post link

There are also many execution plans for TiKV. I don’t know which TiFlash tables you are referring to. You need to provide the data volume and data structure of the tables.