When the cluster data volume reaches 1T+, subsequent data writes become very slow

translator_bot · June 22, 2024, 10:34pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群数据量达到1T+时，后续数据写入很慢

| username: Lystorm

[TiDB Usage Environment] Testing
[TiDB Version] V6.1
Environment Configuration:
Number of Services:
tidb service: 1
pd: 1
tikv: 6
Configuration:
Three nodes in total, each node configured with: 12C64G
Network card: 10 Gigabit
Deployment Method:
K8S deployment
Underlying Storage:
ceph
[Encountered Issues: Problem Phenomenon and Impact]

The environment is newly deployed.
In the old environment, there are two large tables, one 100G and one 900G, along with other scattered business data totaling 500G, making up about 1.5T of data. Using the BR tool to back up and import the old data into the new environment, the 500G data took about 2 hours to import, with a relatively high overall data import efficiency, so the specific write speed was not noted.
After importing the new data, algorithm testing was performed (using tispark to read data from a 50G table, then adding a column and writing it back to a new table), and new data writing operations were carried out.
During step 3, it was noticeably felt that the data writing efficiency was very low.
After deploying Grafana to monitor tikv read/write conditions, it was found that tikv’s read/write efficiency was very low.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 10:34pm

| username: h5n1 | Original post link

What type of storage disk was used in the old environment? Looking at the append/commit/apply duration, it seems like the disk performance is very poor.

translator_bot · June 22, 2024, 10:34pm

| username: Lystorm | Original post link

The old environment used HDD disks, and K8S used local path storage.

translator_bot · June 22, 2024, 10:34pm

| username: h5n1 | Original post link

You can check the disk_performance page or the disk performance in node_exporter. It is recommended to use SSDs officially.

translator_bot · June 22, 2024, 10:34pm

| username: Lystorm | Original post link

The Ceph storage in the new environment uses an SSD+HDD combination, and the measured IO can reach 400-500MB/s.

translator_bot · June 22, 2024, 10:34pm

| username: Billmay表妹 | Original post link

I suggest you refer to NetEase’s solution; hot and cold storage of 330t is completely feasible.

TiDB Hot and Cold Storage Separation Solution_NetEase Games_Li Wenjie.pdf

Your usage scenarios are very similar, and you can also consider optimizing by adjusting the data architecture.

translator_bot · June 22, 2024, 10:34pm

| username: h5n1 | Original post link

Besides throughput, it is also related to latency. A speed of 400-500M is not considered high. Let’s check the monitoring data first.

translator_bot · June 22, 2024, 10:34pm

| username: Lystorm | Original post link

Which monitoring metrics can we look at?

translator_bot · June 22, 2024, 10:34pm

| username: xfworld | Original post link

Refer to this metric.

translator_bot · June 22, 2024, 10:34pm

| username: Lystorm | Original post link

The underlying IO has been tested with tools like fio and sysbench, and it can be confirmed that there should not be any obvious anomalies.

translator_bot · June 22, 2024, 10:34pm

| username: xfworld | Original post link

If it is below the standard value, the disk write speed will be slower.

translator_bot · June 22, 2024, 10:34pm

| username: h5n1 | Original post link

You can check the disk_performance page or the disk performance in node_exporter. It is recommended to use SSDs officially.
grafana page

translator_bot · June 22, 2024, 10:34pm

| username: h5n1 | Original post link

No obvious anomalies do not mean that storage performance meets the standards.

translator_bot · June 22, 2024, 10:34pm

| username: TiDBer_J9NiNCXD | Original post link

Having 14,000 regions on a single TiKV instance, along with full table read and write operations, can significantly impact the cluster’s performance.

translator_bot · June 22, 2024, 10:34pm

| username: xfworld | Original post link

There are also numerous cases with regions, this won’t have an impact.

If there are hotspot issues, then there will be an impact… (just write skew and read skew)

translator_bot · June 22, 2024, 10:34pm

| username: TiDBer_J9NiNCXD | Original post link

Is there a recommended number of regions for a single TiKV instance with 12 cores and 64GB of memory?

translator_bot · June 22, 2024, 10:34pm

| username: xfworld | Original post link

The reasonable range for regions is 20,000 to 30,000. If the number exceeds this range, you need to refer to best practices and articles on massive region scheduling for optimization.

translator_bot · June 22, 2024, 10:34pm

| username: TiDBer_J9NiNCXD | Original post link

Okay then

translator_bot · June 22, 2024, 10:34pm

| username: TiDBer_jYQINSnf | Original post link

It looks like your applylog and commitlog times are quite long, which basically means that disk IO is relatively slow. If hardware conditions cannot be improved and you want to maximize TiDB throughput, you can adjust the configuration with the following ideas:

Tolerate higher latency slightly and batch more. For example, if one disk IO originally writes 1k data, you can batch more and write 4k at a time.
Increase the compression rate of the underlying RocksDB. With a higher compression rate, there will be less data on the disk, and correspondingly, the IO bandwidth usage will be reduced.
Increase data caching to keep more data in memory, thus reducing disk IO.

The corresponding batch configuration parameters can be adjusted here: TiDB 配置文件描述 | PingCAP 文档中心
The corresponding RocksDB compression rate can be adjusted here: TiKV 配置文件描述 | PingCAP 文档中心
For caching, there are configurations at both the TiDB and TiKV levels:
TiKV level: TiKV 配置文件描述 | PingCAP 文档中心
 TiKV 配置文件描述 | PingCAP 文档中心
TiDB level: TiDB 配置文件描述 | PingCAP 文档中心

The main caches in TiKV are the size and number of memtables, and the size of the block cache. The size and number of memtables mainly affect writes and immediate reads after writes, while the block cache mainly affects reads. These have monitoring metrics, and you can check the cache hit rate in rocksdb-kv.
The main caches in TiDB are for caching coprocessor read results and small table caches. You can find the configurations.

There is no free lunch. To improve performance with HDD, you have to sacrifice in other areas, such as using more memory and CPU.

Take your time to adjust, it’s quite fun. The parameters listed above are just a few examples, and there are many other methods, such as increasing the region size to reduce the bandwidth occupied by network Raft interactions. However, increasing the region size can easily lead to hotspots, so you need to balance various aspects. TiDB is like a Swiss Army knife with many fancy configurations.

translator_bot · June 22, 2024, 10:34pm

| username: Lystorm | Original post link

Regarding configuration 1, I would like to ask:
Is it setting max-batch-wait-time?
How can I control the amount of data sent each time through this value? For example, if I want each IO to write 4M, how should I set this value?