TIDB v7.1.2 Experiences Write Drop to 0 Every 2 Hours During Full Data Ingestion Stage

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB v7.1.2 在全量灌入数据阶段每2个小时出现一次写陡降到0

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.2
[Encountered Problem: Phenomenon and Impact] During the full data import phase in TiDB v7.1.2, there is a write drop to 0 every 2 hours. From the TiDB slow query log, even a simple write operation can take several seconds, and there are no read SQLs during this period. How can this issue be located, and can you provide some troubleshooting ideas? [No need to consider hardware issues and migration tasks]
[Attachments: Screenshots/Logs/Monitoring]

TiDB Affected Rows Monitoring:

TiKV Monitoring Information:

| username: Jellybean | Original post link

Looking at these monitoring charts, the issues appear at fairly regular intervals. We should focus on examining the situation with TiKV, especially the GC. First, check if the GC trigger interval is every 2 hours.

| username: xfworld | Original post link

Check the dashboard to see if there are any hotspots.

| username: 小龙虾爱大龙虾 | Original post link

What about the SQL response time chart? Are there any slow SQL queries or active session charts?

| username: zhanggame1 | Original post link

There should be some logs indicating such a significant drop. Check the logs for TiDB, TiKV, and PD around that time.

| username: residentevil | Original post link

The GC layer has been analyzed, it’s not this issue.

| username: residentevil | Original post link

No read requests

| username: Jellybean | Original post link

Analyze this slow SQL in detail; you can see the specific slow execution stages from the Dashboard.

| username: h5n1 | Original post link

https://metricstool.pingcap.net/
Export the overview/tidb/tikv-detail/node-exporter/pD snapshot by following these steps:

  1. Open the monitoring dashboard.
  2. Select the monitoring time and make sure to expand all panels.
  3. Wait for the data to load completely (press ‘d’ then ‘E’ to open all Rows’ Panels, and wait for a while until the page is fully loaded).
  4. Refer to https://metricstool.pingcap.net to export Grafana data as a snapshot.
| username: residentevil | Original post link

I’ll refer to it, thank you.

| username: TiDBer_gxUpi9Ct | Original post link

Check if there are any hotspots.

| username: 有猫万事足 | Original post link

l0-files-threshold

  • When the number of L0 files in KvDB reaches this threshold, the flow control mechanism starts to work. When the value of enable is true, it will override the configuration of rocksdb.(defaultcf|writecf|lockcf).level0-slowdown-writes-trigger.
  • Default value: 20

soft-pending-compaction-bytes-limit

  • When the pending compaction bytes of KvDB reach this threshold, the flow control mechanism starts to reject some write requests and reports ServerIsBusy. When the value of enable is true, it will override the configuration of rocksdb.(defaultcf|writecf|lockcf).soft-pending-compaction-bytes-limit.
  • Default value: “192GB”

Too many L0 files (exceeding the default value of 20) and pending compaction bytes exceeding 200GB (as shown in the figure, it can reach 500GB) trigger TiKV flow control.
You can probably see ServerIsBusy in the TiDB logs.

It is recommended to check the content about writestall in the documentation and adjust the parameters accordingly.

| username: residentevil | Original post link

Brother, I thought the same thing. It indeed triggered a WRITE STALL, haha. Thank you for your hard work.

| username: 有猫万事足 | Original post link

Actually, just by looking at how precise your monitoring graph is, I feel like you already have a good idea of what’s going on. :joy:

| username: residentevil | Original post link

I do have relatively little experience in this area, so I still need to ask for advice, haha.

| username: TiDBer_小阿飞 | Original post link

If there is no read, then it is a write issue. Write issues are nothing more than IO bottlenecks, slow analysis at the TiDB server layer, slow writes to RocksDB in TiKV, accumulation in the Raftstore thread pool, or problems with PD scheduling. You should check each of these one by one.

The Raftstore thread pool is the most complex thread pool in TiKV, with a default size (raftstore.store-pool-size) of 2. All write requests are first written to RocksDB in fsync mode by the store thread (unless raftstore.sync-log is manually set to true; setting raftstore.sync-log to false can improve write performance to some extent but also increases the risk of data loss). Due to the presence of IO, the store thread theoretically cannot reach 100% CPU usage. To minimize the number of disk writes, multiple write requests are batched together and written to RocksDB. It is best to control its CPU usage between 40% and 60%. Do not blindly increase the size of the store thread pool to improve write performance, as this may backfire and increase the disk burden, resulting in worse performance.

| username: residentevil | Original post link

Your analysis is very thorough, impressive.

| username: h5n1 | Original post link

Could you please share a screenshot of the tikv-detail → rocksdb → write stall reason?

| username: 路在何chu | Original post link

Please send a screenshot to take a look.

| username: xingzhenxiang | Original post link

I am also importing data on 7.12 to see if there are any issues. It has been running for a week.