Issue of Data Inconsistency Between TiSpark Reading TiDB and TiDB Data Source

translator_bot · June 20, 2024, 8:22pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tispark读取TiDB与TiDB数据源数据不一致问题

| username: 罗啰萝丶

[TiDB Usage Environment] Test
[TiDB Version] 7.0.1
[TiSpark Version] 3.1.5
[Reproduction Path] The number of tables in the database has reached 7000+, creating a new table in TiDB
spark-shell

Use spark.sql(“use tidb_catalog.xx”) to select the database to count the number of tables
Here some WARNINGS will be reported
WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
WARN Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
WARN setMetaStoreSchemaVersion called but recording version is disabled
WARN Failed to get database global_temp, returning NoSuchObjectException
Use spark.sql(“show tables”).count()

tidb client

show tables

[Encountered Problem: Problem Phenomenon and Impact] The number of tables directly read in TiDB and the number of tables read through spark-shell are inconsistent
As the number of tables in TiDB increases, the number of tables queried by spark sql decreases. This phenomenon occurs when the number of tables reaches a certain number.

-------------------Divider--------------------
Switching to jdbc connection to TiDB, the data is found to be consistent
So I suspect it is a TiSpark issue, is it due to some TiSpark configuration items?
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]