How to use .mode("append") in kyuubi-tispark SQL

translator_bot · June 21, 2024, 7:19am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: kyuubi-tispark 的sql如何使用.mode(“append”)

| username: Chen_Hank

To improve efficiency, please provide the following information. A clear problem description will help resolve the issue faster:

[Overview] How to use insert into and insert overwrite in kyuubi + tidb mode

[Application Framework and Business Logic Adaptation]

[Background]
Using kyuubi + tispark + tidb for data processing

[Phenomenon]
There are primary key issues and errors in these two cases,

Inserting into a table without a primary key is normal.

After searching for information, enabling batch mode, and in submit mode, you can use the following method

How can I use batch mode by submitting SQL through kyuubi?

[Problem] Current issues encountered
How to solve the batch mode insert method under kyuubi

[Business Impact]

[TiDB Version] 6.5.6

[Attachments] Relevant logs and monitoring

SQL Error: org.apache.kyuubi.KyuubiSQLException: org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.tikv.common.exception.TiBatchWriteException: currently user provided auto id value is only supported in update mode!
at com.pingcap.tispark.write.TiBatchWriteTable.preCalculate(TiBatchWriteTable.scala:179)
at com.pingcap.tispark.write.TiBatchWrite.$anonfun$doWrite$7(TiBatchWrite.scala:199)
at scala.collection.immutable.List.map(List.scala:293)
at com.pingcap.tispark.write.TiBatchWrite.doWrite(TiBatchWrite.scala:199)
at com.pingcap.tispark.write.TiBatchWrite.com$pingcap$tispark$write$TiBatchWrite$$write(TiBatchWrite.scala:94)
at com.pingcap.tispark.write.TiBatchWrite$.write(TiBatchWrite.scala:50)
at com.pingcap.tispark.write.TiDBWriter$.write(TiDBWriter.scala:41)
at com.pingcap.tispark.v2.TiDBTableProvider.createRelation(TiDBTableProvider.scala:94)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated

translator_bot · June 21, 2024, 7:19am

| username: 有猫万事足 | Original post link

It’s a bit tricky.
I feel that using batch mode may not necessarily solve the problem.
The issue might actually be with the target table having an auto-increment key.

github.com

pingcap/tispark/blob/e17587c7c74b93878f67be70ea7523fe635ddbd6/core/src/main/scala/com/pingcap/tispark/write/TiBatchWriteTable.scala#L179C12-L179C12


      
          "currently user provided auto id value is only supported in update mode!")

  if ((autoIncProvidedID || autoRandomProvidedID) && conflictRows.count() != count) {
    throw new TiBatchWriteException(
      "currently user provided auto id value is only supported in update mode!")
  }

The condition is like this. I feel it doesn’t have much to do with batch mode.

translator_bot · June 21, 2024, 7:19am

| username: Chen_Hank | Original post link

Hello.
We want to try using TiSpark for data processing.
But we found that the insert into capability of Kyuubi TiSpark seems limited, and it can only be used with target tables without auto-increment primary keys. However, for large-scale writes, to avoid hotspots, we must use auto-random bigint primary keys. But this requires the primary key to be present during the write.
Do you have any good practices for this?

translator_bot · June 21, 2024, 7:19am

| username: 有猫万事足 | Original post link

I haven’t really used TiSpark much, just traced it based on your error stack.
Does your code not throw an exception when it encounters an auto-random table? From the code above, it seems that both auto-random and auto-increment don’t work.

According to the code, it seems that to avoid triggering the above exception, there should be no auto-random and auto-increment, and no conflicting data between the two tables.

As for eliminating hotspots, either use clustered tables with auto-random or non-clustered tables with SHARD_ROW_ID_BITS.
If you must use auto-increment keys, the best way is to use non-clustered tables with SHARD_ROW_ID_BITS to eliminate hotspots.

translator_bot · June 21, 2024, 7:19am

| username: Chen_Hank | Original post link

I analyzed the error again, and it aligns perfectly with your analysis.

Batch mode is already the default.

translator_bot · June 21, 2024, 7:19am

| username: Chen_Hank | Original post link

I tried inserting into a clustered table with a primary key.

translator_bot · June 21, 2024, 7:19am

| username: 有猫万事足 | Original post link

Clustered tables with auto-increment will definitely have hotspots.

When I used DM to write traffic into MySQL, I couldn’t replace the auto-increment key either.
The final way to eliminate hotspots was to use non-clustered tables with SHARD_ROW_ID_BITS.

translator_bot · June 21, 2024, 7:19am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.