TiDB Slow Write and High Latency

translator_bot · June 23, 2024, 6:17am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 写入慢延迟高

| username: leoones

[Overview] Scenario + Problem Overview
When TiDB is performing concurrent writes,
insert statements are almost all completed within 8-10 seconds
with around 100 concurrent business inserts.

[Problem] Current Issue
Insert writes are very time-consuming.
The SQL statement is as follows, with 300 rows of data concatenated in each batch:

INSERT INTO
    `xxxx_ods_latest` (
    `marketplace_id`,
    `asin`,
    `item_md5`,
    `res_json`,
    `request_timestamp`
  )
VALUES
  (...),
  (...),
  (...),
  (...) ON DUPLICATE KEY
UPDATE
  `item_md5` = IF (
    `xxxx_ods_latest`.`item_md5` !=
    VALUES
      (`item_md5`),
    VALUES
      (`item_md5`),
      `xxxx_ods_latest`.`item_md5`
  ),
  `res_json` =
VALUES
  (`res_json`),
  `request_timestamp` =
VALUES
  (`request_timestamp`);

[Business Impact]
Kafka cannot consume all messages, resulting in data delays.

[TiDB Version]
V5.4.0

[Attachments] Relevant logs and monitoring (https://metricstool.pingcap.com/)

translator_bot · June 23, 2024, 6:17am

| username: h5n1 | Original post link

It looks like network latency. Check the node_exporter and black_exporter monitoring. Also, check the disk IO situation.

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

translator_bot · June 23, 2024, 6:17am

| username: h5n1 | Original post link

Follow this link https://metricstool.pingcap.com/#backup-with-dev-tools to export the overview, TiDB, TiKV detail, and PD monitoring periods. Make sure to wait until all panels are fully expanded before exporting.

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

Here are the logs
tidb_log_json.zip (2.6 MB)

translator_bot · June 23, 2024, 6:17am

| username: xiaohetao | Original post link

The error message indicates that the system is unable to find the libtinfo.so.5 library. This is because the library is not installed on your system. You can try installing it using the following command:

sudo apt-get install libtinfo5

After installing the library, try running your program again.

translator_bot · June 23, 2024, 6:17am

| username: xiaohetao | Original post link

The image you provided is not accessible. Please provide the text you need translated.

translator_bot · June 23, 2024, 6:17am

| username: xiaohetao | Original post link

According to the approach in the above two images, check which part has slowed down.

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

Is there a specific optimization plan?

translator_bot · June 23, 2024, 6:17am

| username: xiaohetao | Original post link

First, identify which part is slow, then determine how to optimize it.

translator_bot · June 23, 2024, 6:17am

| username: h5n1 | Original post link

Some panels in the export have no data, and many regions have been added to some TiKV nodes, resulting in high CPU utilization. What are the values of the parameters split.qps-threshold and split.byte-threshold? Is the disk space of each TiKV the same, and have region_weight and leader weight been set?

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

split.qps-threshold = 3000
split.byte-threshold = 31457280
No region_weight or leader weight set.

translator_bot · June 23, 2024, 6:17am

| username: h5n1 | Original post link

Is the write performance normal now? What is the current status of the two previous monitoring screenshots?

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

Uneven distribution

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

The metrics have declined. Yesterday, we increased the scheduler-worker-pool-size and storage.block-cache.capacity.

translator_bot · June 23, 2024, 6:17am

| username: Tank001 | Original post link

Awesome, I will learn from you.

translator_bot · June 23, 2024, 6:17am

| username: xiaohetao | Original post link

Check the settings for raftstore.store-pool-size and raftstore.apply-pool-size. These should be as consistent as possible. The former represents the producer’s capacity, while the latter represents the consumer’s capacity. If the former value is much larger than the latter, it can affect the write speed.

translator_bot · June 23, 2024, 6:17am

| username: leoones | Original post link

The values of these two parameters are both 2.

translator_bot · June 23, 2024, 6:17am

| username: xiaohetao | Original post link

If your machine resources are relatively high, you can appropriately increase these two parameters to enhance production and consumption capabilities.

translator_bot · June 23, 2024, 6:17am

| username: qizheng | Original post link

The high tail latency of writes is mainly due to the raftstore thread propose/apply wait. You can follow the suggestion above and first increase the size of these two thread pools based on the current CPU and IO resource situation.