After writing data using TiSpark, unable to query by primary key

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 使用TiSpark写入数据后,无法通过主键查询

| username: Zealot

TiDB Version: 7.1.2
TiSpark Version: [tispark-assembly-3.3_2.12-3.2.2.jar]
Spark Version: 3.3.1

Phenomenon:

  1. Running “insert into select” through TiSpark to get the table dw_company_extend, at this point, the table already contains data.
  2. Executing “select id from dw_company_extend limit 1” in TiDB returns the result: 95100000101705142Q
  3. Executing “select id from dw_company_extend where id = ‘95100000101705142Q’” in TiDB returns no result.

Table DDL:
CREATE TABLE dw_company_extend (
id varchar(20) NOT NULL DEFAULT ‘’ COMMENT ‘Company ID’,
staff_size varchar(255) DEFAULT ‘’ COMMENT ‘Staff Size’,
PRIMARY KEY (id) /*T![clustered_index] NONCLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT=‘Extended information dw table’;

| username: ShawnYan | Original post link

Logically, it shouldn’t be like this. Try querying directly from TiDB to see the result.

| username: 随缘天空 | Original post link

It might be due to the differences in how TiDB and TiSpark handle strings. TiDB strings use utf8 encoding by default, while TiSpark uses latin1 encoding by default. Try adjusting the encoding.

| username: Jellybean | Original post link

This shouldn’t normally happen. Please post the execution plans for these statements.

| username: Kongdom | Original post link

How about trying to query after running EXPLAIN ANALYZE on the table? Or try querying using the MySQL client?

| username: dba远航 | Original post link

Check if there are any spaces before or after.

| username: Zealot | Original post link

The result is queried from TiDB.

| username: Zealot | Original post link

Execution plan for explain analyze:

id estRows actRows task access object execution info operator info memory disk
Point_Get_1 1.00 0 root dw_company_extend, index:PRIMARY(id) time:4.33ms, loops:1, RU:0.270654, Get:{num_rpc:1, total_time:4.29ms}, total_process_time: 62µs, total_wait_time: 3.05ms, tikv_wall_time: 3.14ms, scan_detail: {total_keys: 1, get_snapshot_time: 3.02ms, rocksdb: {block: {cache_hit_count: 8}}} N/A N/A
| username: Zealot | Original post link

This definitely doesn’t exist.

| username: Zealot | Original post link

8200 - Unsupported modify charset from utf8mb4 to latin1

| username: Zealot | Original post link

I feel that the primary key index seems not to have been created. When I insert the same primary key ID, it does not report an error.

| username: Zealot | Original post link

Does TiSpark really support this encoding?

| username: TiDBer_小阿飞 | Original post link

Is this query method able to retrieve data? Shouldn’t it be followed by “and id=id”?

| username: Zealot | Original post link

This method can retrieve data, but once you add “and id= ?”, it can’t retrieve anything.

| username: dba远航 | Original post link

Check the collation of the database: collation_database

| username: Zealot | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: dba远航 | Original post link

I tested it locally and did not find any anomalies.

mysql> CREATE TABLE dw_company_extend (
    -> id varchar(20) NOT NULL DEFAULT '' COMMENT 'Company ID',
    -> staff_size varchar(255) DEFAULT '' COMMENT 'Staff Size',
    -> PRIMARY KEY (id) /*T![clustered_index] NONCLUSTERED */
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT='Extended information dw table';
Query OK, 0 rows affected (0.19 sec)

mysql> insert into dw_company_extend values ('95100000101705142Q','Very Large');
Query OK, 1 row affected (0.01 sec)

mysql> select * from dw_company_extend where id = '95100000101705142Q';
+--------------------+------------+
| id                 | staff_size |
+--------------------+------------+
| 95100000101705142Q | Very Large |
+--------------------+------------+
1 row in set (0.00 sec)

mysql> select id from dw_company_extend limit 1;
+--------------------+
| id                 |
+--------------------+
| 95100000101705142Q |
+--------------------+
1 row in set (0.01 sec)
| username: Zealot | Original post link

The way you reproduced it is incorrect. When I normally use the MySQL client’s insert select, there is no problem either. The issue arises when using TiSpark to insert a large amount of data. When executing insert into select with a data volume of 50 million to 100 million, after successful execution, anomalies appear during queries. This problem is hard to reproduce because when the data volume is small (10 records), it works fine.

| username: Zealot | Original post link

Admin checked the table dw_company_extend. The index seems fine. Suddenly realized that the encoding of the two tables in the insert into select statement is inconsistent; one is utf8 and the other is utf8mb4. I will adjust them to be consistent and try running the data again.

| username: Zealot | Original post link

After rerunning it, everything is normal. It seems that the issue was indeed caused by the inconsistent character sets of the two tables. I wonder if this counts as a bug.