TiKV encounters a large number of prewrite error logs, causing many statements to slow down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV 遇到大量 prewrite 报错日志,导致大量语句变慢

| username: hzc989

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.7
[Reproduction Path] N/A
[Encountered Problem: Problem Phenomenon and Impact]

Phenomenon: Occasionally, SQL requests suddenly become slow online. After investigation and localization, it was found that the time points of occurrence were all related to a large number of prewrite errors in a certain TiKV store, as shown in the attached pictures below.

[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

Specific logs are as follows:

[2023/11/07 08:42:44.631 +08:00] [WARN] [prewrite.rs:630] ["commit_ts is too large, fallback to normal 2PC"] [lock="Lock { lock_type: Put, primary_key: 748000000000011F3B5F698000000000000001017765766670613676FF3531373231373231FF3638393539313730FF383132373233325FFF64616576666B746FFF6E6C5F315F686663FF33365F733376322EFF646174612E313530FF2E31000000000000F9, start_ts: TimeStamp(445465955077456458), ttl: 3000, short_value: C000000B673C7FB3, for_update_ts: TimeStamp(0), txn_size: 1, min_commit_ts: TimeStamp(445465955077456459), use_async_commit: true, secondaries: [748000000000011F3B5F724000000B673C7FB3], rollback_ts: [] }"] [max_commit_ts=445465955601744458] [min_commit_ts=445465956086710572] [start_ts=445465955077456458] [key=748000000000011FFF3B5F698000000000FF0000010177657666FF70613676FF353137FF3231373231FF3638FF393539313730FF38FF3132373233325FFFFF64616576666B746FFFFF6E6C5F315F6866FF63FF33365F733376FF322EFF646174612EFF313530FF2E310000FF00000000F9000000FC]

Additionally, we noticed in the TiKV documentation that after version 6.5, there is a new parameter enable-async-apply-prewrite (documentation link: TiKV 配置文件描述 | PingCAP 文档中心). From the description, it seems to be closely related to this issue, but we couldn’t find further detailed information on this topic. We would like to confirm whether prewrite slowness can cause occasional request latency increases and whether upgrading to version 6.5 and using this new parameter can improve the related issues?

| username: xfworld | Original post link

It is not recommended to upgrade. First, check what issue is causing the data contention, resulting in a large number of locks…

Then try to resolve this contention issue through different approaches.

For upgrading in a production environment, you need to first conduct a proof of concept (POC)…

| username: zxgaa | Original post link

It’s best to first identify the cause; it could also be a business logic issue.

| username: Jellybean | Original post link

Let’s not rush to upgrade.

The first phase of 2PC, prewrite, mainly involves checking the MVCC version of the data and detecting lock conflicts for write operations.

ZC, it looks like your issue is caused by a large number of insert statements being executed simultaneously in a short period, leading to a lot of lock contention. See if it’s possible to adjust the SQL from the business side or reduce the concurrency for now.

| username: 像风一样的男子 | Original post link

Take a look at the flame graph, it’s probably a hotspot read/write issue.

| username: yulei7633 | Original post link

Take a look at the coprocessor graph. Is the resource usage caused by logging statistics?

| username: yulei7633 | Original post link

tikv-detail-coprocess

| username: tidb菜鸟一只 | Original post link

Which table’s primary key is this? Did you generate it yourselves?
Parallel insert lock table waiting…

| username: TiDBer_小阿飞 | Original post link

Check which transaction on the table with write conflicts is causing the lock wait.

| username: TiDB_C罗 | Original post link

What type is insert?

| username: hzc989 | Original post link

Hot read-write exclusion has been done, and there were no signs of sudden hot read-write at that time.

| username: hzc989 | Original post link

Thank you, Wenjie, for your support. The main issue is that the business hasn’t changed significantly; we’ve always had concurrent inserts like this on a daily basis. It’s just that sometimes this problem occurs.

| username: Jellybean | Original post link

Referencing this post, a longer prewrite time may lead to this error. Generally, in the case of 1PC or async commit, during Prewrite, the min_commit_ts is calculated based on the written key, and the largest one is chosen as commit_ts. To ensure schema consistency between start_ts and commit_ts, TiDB adds a max_commit_ts, which is calculated as Prewrite + 2 seconds. When min_commit_ts exceeds max_commit_ts, the commitTStoolarge error is reported, and it falls back to the 2PC commit mode.

Therefore, you can check the system load, disk IO pressure, CPU utilization of raft threads, network latency between components, etc., at the time of the issue, and analyze them together.

| username: hzc989 | Original post link

At present, there is no obvious bottleneck in CPU, memory, or disk IO resources. This issue is quite strange. As for lock conflict waiting, it has been confirmed with the business that there were no concurrent duplicate inserts at the time of the incident.

| username: hzc989 | Original post link

What type of insert are you referring to? This table has only two indexes, one is the primary key (clustered index), and the other is a unique index used for business purposes.

| username: heiwandou | Original post link

Check if there are any locks.

| username: TiDB_C罗 | Original post link

I encountered a similar situation where the time was consumed by prewrite, but not as long as yours. My insert is in the form of “insert into … on duplicate key update”. I still haven’t found the reason why prewrite takes so long.

| username: TiDB_C罗 | Original post link

  1. If it is an optimistic transaction, reduce concurrent writes.
  2. Use pessimistic mode.
| username: oceanzhang | Original post link

Learn it.

| username: hzc989 | Original post link

The pessimistic mode is here. Everyone mentioned reducing concurrency, but I don’t understand how it affects this issue. Our business usually involves high-concurrency writes, and now it happens occasionally. We can’t find any other signs of problems, nor have we pinpointed any performance bottlenecks.