[BUG] TiSpark 3.x version reads TiKV with less concurrency and slower speed compared to version 2.5

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [BUG] tispark 3.x版本读取tikv时,读取并发比2.5版本少,速度也慢

| username: wfxxh

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.2
【Encountered Problem】spark version: 3.x (any version of spark 3), tispark version: 3.x (any version of tispark 3)
【Problem Phenomenon and Impact】

table information:

spark configuration:

tispark 2.5 version:

tispark 3.x version:

The code is exactly the same, only the tispark version was changed.

| username: Lucien-卢西恩 | Original post link

Please confirm the following information:

  1. Is it the same Query with less concurrency, but the speed has slowed down?
  2. Confirm whether there are differences in the table statistics and the corresponding SQL execution plan in Tispark?
  3. Confirm whether the data volume of the corresponding table has changed. The best way to compare is to start both versions of Tispark simultaneously, execute the same SQL separately, and compare the execution plans and latency. Has this been tested?
| username: wfxxh | Original post link

  1. Yes, exactly the same, except for different TiSpark versions.
  2. Just a simple select *.
  3. The table hasn’t changed at all, that’s how I tested it.
| username: Lucien-卢西恩 | Original post link

Could you provide the execution plan?

| username: wfxxh | Original post link

CREATE TABLE `perio_art_project` (
  `record_id` int(11) DEFAULT NULL,
  `article_id` varchar(255) DEFAULT NULL,
  `project_seq` int(11) DEFAULT NULL,
  `project_id` varchar(255) DEFAULT NULL,
  `project_name` longtext DEFAULT NULL,
  `batch_id` int(11) DEFAULT NULL,
  `primary_partition` int(4) GENERATED ALWAYS AS ((crc32(`article_id`)) % 9999) STORED NOT NULL,
  `last_modify_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `spark_update_time` datetime DEFAULT NULL,
  KEY `article_id` (`article_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
sparkSession
  .sql(
    """select * from tidb_catalog.qk_chi.perio_art_project
      |""".stripMargin)
  .groupBy("project_name")
  .count()
  .explain()
== 2.5 Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[project_name#7], functions=[count(1)])
   +- Exchange hashpartitioning(project_name#7, 8192), true, [id=#14]
      +- HashAggregate(keys=[project_name#7], functions=[partial_count(1)])
         +- TiKV CoprocessorRDD{[table: perio_art_project] TableScan, Columns: project_name@VARCHAR(4294967295), KeyRange: [([t\200\000\000\000\000\000\003\017_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\003\017_s\000\000\000\000\000\000\000\000])], startTs: 437048830927044635} EstimatedCount:18072492
== 3.x Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[project_name#7], functions=[specialsum(count(1)#34L, LongType, 0)])
   +- Exchange hashpartitioning(project_name#7, 8192), true, [id=#13]
      +- HashAggregate(keys=[project_name#7], functions=[partial_specialsum(count(1)#34L, LongType, 0)])
         +- TiSpark RegionTaskExec{downgradeThreshold=1000000000,downgradeFilter=[]
            +- TiKV FetchHandleRDD{[table: perio_art_project] IndexLookUp, Columns: project_name@VARCHAR(4294967295): { {IndexRangeScan(Index:article_id(article_id)): { RangeFilter: [], Range: [([t\200\000\000\000\000\000\003\017_i\200\000\000\000\000\000\000\001\000], [t\200\000\000\000\000\000\003\017_i\200\000\000\000\000\000\000\001\372])] }}; {TableRowIDScan, Aggregates: Count(1), First(project_name@VARCHAR(4294967295)), Group By: [project_name@VARCHAR(4294967295) ASC]} }, startTs: 437048801724203020}
| username: wfxxh | Original post link

This execution plan seems to have a problem; it clearly doesn’t use the index, so how could it be using the index? This is the TiDB execution plan:

|Projection_4|3462592.98|root||qk_chi.perio_art_project.project_name, Column#11|
|---|---|---|---|---|
|└─HashAgg_9|3462592.98|root||group by:qk_chi.perio_art_project.project_name, funcs:count(Column#12)->Column#11, funcs:firstrow(qk_chi.perio_art_project.project_name)->qk_chi.perio_art_project.project_name|
|  └─TableReader_10|3462592.98|root||data:HashAgg_5|
|    └─HashAgg_5|3462592.98|cop[tikv]||group by:qk_chi.perio_art_project.project_name, funcs:count(1)->Column#12|
|      └─TableFullScan_8|18072492.00|cop[tikv]|table:perio_art_project|keep order:false|
| username: wfxxh | Original post link

I found the reason. In the new version, StatisticsManager.loadStatisticsInfo was not called to initialize statisticsMap, which caused StatisticsManager.getTableStatistics to return null when retrieving statistics information in TiStrategy.filterToDAGRequest. Consequently, TiKVScanAnalyzer.buildIndexScan could not correctly return calculateCostAndEstimateCount.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.