The data volume queried by TiSpark is inconsistent with the data volume queried directly by TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tispark查询数据量和直接tidb查询数据量不一致

| username: wfxxh

TiSpark version: 3.3_2.12-3.2.2
Spark version: 3.3.3
Query statement: select count(1) from zl.patent_element where class_code like 'H%' or class_code like '%\%H%';

This statement returns significantly fewer results when queried using TiSpark compared to querying directly with TiDB.

TiSpark execution plan:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[specialsum(count(1)#876L, LongType, 0)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=447]
      +- HashAggregate(keys=[], functions=[partial_specialsum(count(1)#876L, LongType, 0)])
         +- TiKV CoprocessorRDD{[table: patent_element] TableReader, Columns: class_code@VARCHAR(4294967295): { TableRangeScan: { RangeFilter: [], Range: [([t\200\000\000\000\000\000\r\311_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\r\311_s\000\000\000\000\000\000\000\000])] }, Selection: [[[class_code@VARCHAR(4294967295) STARTS_WITH "H" reg: "H%"] OR [class_code@VARCHAR(4294967295) LIKE "%\%H%" reg: "%\%H%"]]], Aggregates: Count(1) }, startTs: 448371119139061761} EstimatedCount:131451015

TiDB execution plan:

|HashAgg_13|1.00|root||funcs:count(Column#61)->Column#60|
|---|---|---|---|---|
|└─TableReader_14|1.00|root||data:HashAgg_6|
|  └─HashAgg_6|1.00|cop[tikv]||funcs:count(1)->Column#61|
|    └─Selection_12|126192974.40|cop[tikv]||or(like(zl.patent_element.class_code, H%, 92), like(zl.patent_element.class_code, %\%H%, 92))|
|      └─TableFullScan_11|131451015.00|cop[tikv]|table:patent_element|keep order:false|
| username: WalterWj | Original post link

Ensure that “verified escape rules” are correct.
For MySQL:

SELECT COUNT(1) FROM zl.patent_element WHERE class_code LIKE 'H%' OR class_code LIKE '%\\H%'

For Spark SQL:

SELECT COUNT(1) FROM zl.patent_element WHERE class_code LIKE 'H%' OR class_code LIKE '%\\\\H%'
| username: wfxxh | Original post link

I have tried \ and \\, but the data from TiSpark still does not match the direct query results from TiDB.

| username: zhaokede | Original post link

Does this query method match the data volume?

| username: wfxxh | Original post link

Doesn’t match.

| class_code  |
| G01D13/22%H02K21/14 |
| H02K%B60K%G01D13 |
| G01H/4%B70K |

My data has multiple class_codes separated by %, and I want to extract all those starting with H after splitting by %. That is, the first and second entries. Using TiDB, I can directly execute where class_code like 'H%' or class_code like '%\\%H%' to query, but with TiSpark, I can only get the second entry. Although I can use Spark functions to split and filter, our other colleagues are not familiar with Spark functions, so it’s not very versatile.

| username: WalterWj | Original post link

Using Backslash for Escaping

First, ensure that your Spark SQL configuration allows the use of backslash as an escape character. Then you can use it in your query like this:

SELECT * FROM tab WHERE c LIKE '%\\%%' ESCAPE '\\'

In the example above, \\ represents a literal backslash (because the backslash needs to be escaped), and \\%% represents the case where the string contains %. :thinking:

| username: WalterWj | Original post link

However, after checking… it seems that escaping with just one \ is enough according to the Spark 3.0 official documentation. Sigh.
https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-like.html


I recommend using

SELECT DISTINCT original_class_code
FROM (
  SELECT explode(split(class_code, '%')) AS split_code, class_code AS original_class_code
  FROM zl.patent_element
) AS exploded_table
WHERE split_code LIKE 'H%'

or try using regexp

SELECT class_code
FROM zl.patent_element
WHERE class_code REGEXP '(^H|%H)';
| username: wfxxh | Original post link

I have also tried ESCAPE, it doesn’t work.

| username: wfxxh | Original post link

The explode method doesn’t work, and our other colleagues are not familiar with Spark functions. The built-in method is not universally applicable; I’ll try REGEXP.

| username: wfxxh | Original post link

It should not be a problem with Spark. The result is correct when using Spark JDBC. It seems to be an issue with TiSpark. By the way, is TiSpark no longer maintained?

| username: WalterWj | Original post link

Maintenance only, no new features will be added. I can report this issue, or you can go to GitHub to submit an issue.

| username: WalterWj | Original post link

I have changed it to a bug report.

| username: shiyuhang0 | Original post link

Fix in this PR Fix escape for StringRegExpression by shiyuhang0 · Pull Request #2775 · pingcap/tispark · GitHub.

| username: wfxxh | Original post link

When will support for Spark 3.4 and 3.5 be available? :crazy_face:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.