When the data service executes a transaction, if one of the statements reports an error, it will not be rolled back automatically and will be stuck until the 50s timeout

Application environment:

The data service, http endpoint from tidb cloud

Reproduction method:

BEGIN;
insert into users (id) values (1);
insert into non_exists_table (id) values (1); // stuck here
COMMIT:

Problem:

data service can not rollback automaticly.

Resource allocation:

Attachment:

1. Root Cause Analysis

The transaction remains stuck due to TiDB’s combination of:

  • Optimistic concurrency control: Errors are detected at commit phase rather than statement execution (source)
  • HTTP endpoint behavior: Maintains open transaction context until explicit COMMIT/ROLLBACK
  • Missing table error classification: Considered a non-fatal statement error rather than transaction-level failure
  • Application error handling gap: Failure to implement proper transaction cleanup after SQL errors

2. TiDB Transaction Handling Mechanism

When using TiDB Cloud’s HTTP endpoint:

  • Statement-level rollback: Failed INSERT rolls back only that statement’s effects
  • Transaction persistence: Open transaction holds resources until timeout
  • Distributed locking: Maintains locks across TiKV nodes during transaction lifespan
  • HTTP session binding: Transaction context persists across multiple requests in some configurations (source)

3. 50-Second Timeout Explanation

This duration is determined by:

  • tidb_idle_transaction_timeout (default 60s in TiDB Cloud)
  • HTTP endpoint’s max_execution_time setting
  • TiKV’s lock TTL (Time-To-Live) mechanism
  • Distributed deadlock detection interval

4. Step-by-Step Solutions

Immediate Fix:

# Sample error handling for HTTP endpoint transactions
from fastapi import HTTPException

try:
    with conn.transaction():
        conn.execute("INSERT INTO non_existent_table...")
except DatabaseError as e:
    conn.rollback()
    raise HTTPException(status_code=400, detail=str(e))

Configuration Adjustments:

  1. Set lower idle timeout:
    SET GLOBAL tidb_idle_transaction_timeout = 15000;  # 15 seconds
    
  2. Enable pessimistic locking mode:
    BEGIN PESSIMISTIC;
    

Preventative Measures:

  1. Schema validation pre-check:
    SELECT COUNT(*) FROM information_schema.tables 
    WHERE table_name = 'target_table';
    
  2. Implement exponential backoff retry:
    from tenacity import retry, stop_after_attempt, wait_exponential
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def safe_transaction():
        # transaction logic
    

5. Best Practices

  1. Explicit Error Handling

    • Always wrap transactions in try/catch blocks
    • Roll back immediately on any error (source)
  2. Timeout Configuration

    SET GLOBAL max_execution_time = 30000;  # 30s query timeout
    SET GLOBAL tidb_txn_assertion_level = STRICT;  # Enable strict assertion
    
  3. Monitoring

    SELECT * FROM information_schema.tidb_transactions 
    WHERE time_elapsed > 5000;  # Find long-running transactions
    
  4. Schema Validation

    • Use SHOW CREATE TABLE before critical operations
    • Implement schema migration checks (source)
  5. Connection Management

    # Ensure connection cleanup
    try:
        conn = pool.get_connection()
        # operations
    finally:
        conn.close()  # Forces transaction cleanup
    

For implementation details, refer to TiDB’s official documentation on transaction error handling and HTTP endpoint configuration.

Hi @Kimberly_Switzer thanks for your feedback! We’ve identified the issue and it’s now scheduled for a fix. I’ll keep you updated once there’s any progress.