TiSpark does not support bulk writing to tables with auto-random primary keys

translator_bot · June 23, 2024, 6:47am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiSpark不支持auto random做主键的表批量写入

| username: fenghaojiang

I want to use TiSpark for ETL and batch writing after a large amount of computation, but I found that TiSpark does not support batch writing to tables with Auto Random id as the primary key. Will this feature be updated or will it not be supported?

translator_bot · June 23, 2024, 6:47am

| username: xfworld | Original post link

What version of TiDB is it? Is the table structure using clustered indexes?
What versions are TiSpark and Spark respectively?

And what is the process of data operations?

translator_bot · June 23, 2024, 6:47am

| username: fenghaojiang | Original post link

TiDB: 5.2.1
Table structure:

create table xxx
(
    id bigint PRIMARY KEY AUTO_RANDOM(8),
    uniqueKey varchar(256) null,
    xxxx....
    constraint idx_unique_key
        unique (uniqueKey)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

TiSpark: 3.1-2.5.1
Spark: 3.1.3

translator_bot · June 23, 2024, 6:47am

| username: 数据小黑 | Original post link

When TiSpark writes, it repartitions the RDD based on the estimated number of regions and then writes directly to TiKV concurrently. In this scenario, writing to an Auto_Random column should have issues. As shown in the figure, this is a check in the TiSpark code before writing to the table. If the target table has an Auto_Random column, the above information will appear. This explanation is not too technical; I will consult other experts to strive for a fundamental explanation.

translator_bot · June 23, 2024, 6:47am

| username: 数据小黑 | Original post link

Upon investigation, TiSpark 3.0.1 already supports the aforementioned auto_random. You can test it out.
Reference: tispark/CHANGELOG.md at 4e0860ad2d7dd46c5af6e2486197bceff863b183 · pingcap/tispark · GitHub

translator_bot · June 23, 2024, 6:47am

| username: OnTheRoad | Original post link

Isn’t this a new feature of TiSpark 2.3.11?

translator_bot · June 23, 2024, 6:47am

| username: fenghaojiang | Original post link

I looked at this change, and it should throw an error indicating that writing to the auto_random column is not supported.

translator_bot · June 23, 2024, 6:47am

| username: fenghaojiang | Original post link

Okay, thank you. I’ll check it.

translator_bot · June 23, 2024, 6:47am

| username: fenghaojiang | Original post link

Supports auto_random but does not support writing.

translator_bot · June 23, 2024, 6:47am

| username: yilong | Original post link

What is your insert statement? Did you specify the id column? auto random does not need to be specified, it uses the default assigned value.
If you want to specify it, you need to set parameters when using TiDB, but it might not work with TiSpark. You can give it a try.
插入数据 | PingCAP 文档中心

translator_bot · June 23, 2024, 6:47am

| username: fenghaojiang | Original post link

The statement is when submitting a spark-job, not SQL. The auto random column is not specified, and the statement is similar to:

dataframe.write()
...

The dataframe does not have the auto random column. So, does it mean that currently, it does not support batch writing data with auto random columns using Spark? The provided link is for TiDB and does not go through TiSpark.

translator_bot · June 23, 2024, 6:47am

| username: yilong | Original post link

Confirmed, it does not support writing, it only supports reading.

translator_bot · June 23, 2024, 6:47am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.