TiDB Performance Benchmarking

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 性能压测

| username: rebelsre

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] v7.1.1
[Reproduction Path] Concurrent 100, select * from xxx where xxx in (xxx); with approximately 50 values in the IN clause
[Encountered Issue: Problem Phenomenon and Impact]
Currently, the average time for stress testing is about 180ms, with a target of within 50ms. Not sure if this can be achieved. Some parameter adjustments are as follows:
storage.block-cache.capacity = 64GB
storage.scheduler-worker-pool-size = 8
raftstore.store-pool-size = 4
readpool.unified.max-thread-count = 32
coprocessor.region-max-size = 256MB
coprocessor.region-split-size = 256MB
coprocessor.enable-region-bucket = true
[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

| username: tidb狂热爱好者 | Original post link

This is on the cloud, right? Here’s a suggestion: use local SSD. There’s a significant difference in disk performance. Don’t use cloud disks.

| username: 有猫万事足 | Original post link

At least from the architectural perspective, there is room for optimization. There are 4 TiKV instances squeezed onto data4. In high concurrency scenarios, there will definitely be contention for I/O.

| username: rebelsre | Original post link

They are all physical machines, with local SSDs.

| username: rebelsre | Original post link

Five machines, each with two TiKV instances, each occupying two disks [data3/4 in the diagram].

| username: zhanggame1 | Original post link

How large is the XXX data volume, and approximately how many records are returned?

| username: rebelsre | Original post link

The count(*) is 797226774, and the number of returned records is 87.

| username: 大飞哥online | Original post link

Monitor and check which resources are increasing rapidly during the stress test, and analyze step by step.

| username: Kongdom | Original post link

It is recommended to deploy according to the official configuration and then conduct stress testing. This mixed deployment does not represent actual performance.

| username: RenlySir | Original post link

“select * from xxx where xxx in (xxx);
Check the index usage. See if there is a hotspot situation.”

| username: 有猫万事足 | Original post link

Then we need to see which stages these 180ms are spent on.

Check the Database Time by SQL Phase graph in the monitoring for the corresponding time period of the stress test.

If the parse and compile phases have a high proportion, you can consider using the plan cache.

| username: 裤衩儿飞上天 | Original post link

Monitor the unified read pool CPU and take a look.

| username: rebelsre | Original post link

The default value of tidb_enable_clustered_index is INT_ONLY, which means that the clustered index is enabled by default only for tables with integer primary keys. If you want to enable the clustered index for all tables, you can set it to ON.

| username: RenlySir | Original post link

Another point, has NUMA binding been done for hybrid deployment? This is also quite important.

| username: tidb菜鸟一只 | Original post link

Check the dashboard to see which step the SQL is slow at.

| username: 托马斯滑板鞋 | Original post link

64C 512G? How many NUMA nodes? Is the file system a single SSD mounted with one file system each? Ideally, one TiKV bound to one node, exclusively occupying one SSD.

A long time ago, we also tested on Taishan 920. On a single physical machine with 4 nodes, a total of 128C, the first three nodes each bound to one TiKV, exclusively occupying one SSD, and the last one bound to network interrupts. With a single client machine and 1000 concurrent connections, it could reach 1.6 million TPMC, with a 99th percentile latency of approximately 10ms.

| username: h5n1 | Original post link

Post the execution plan and execution time.

| username: rebelsre | Original post link

| username: rebelsre | Original post link

128C512G, mixed deployment, NUMA not enabled

| username: rebelsre | Original post link

Currently, it appears that the clustered index is not being used. We will test this to see if there is any improvement.