After deploying a batch of services in the production environment, suspected performance issues have caused SQL execution to be generally slow, failing to meet business requirements

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 生产环境上了一批业务后,疑似性能问题导致 sql 执行普遍偏慢,达不到业务要求

| username: rebelsre

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.1
[Reproduction Path] Migrated a batch of services to TiDB
[Encountered Problem: Symptoms and Impact] SQL execution is generally slow, often taking 1/2 seconds for queries, with a target of within 50ms
[Resource Configuration]



[Attachments: Screenshots/Logs/Monitoring]





image

| username: xfworld | Original post link

You can only check each slow query one by one; there’s no direct way to solve it…

Didn’t you do any testing before going live? Just went live directly?

| username: rebelsre | Original post link

It has been tested before, and a post was made to optimize it at that time. It could reach within 50ms back then, but unexpectedly, after deploying a wave of business directly…
tidb performance stress test - :ringer_planet: TiDB technical issues / performance tuning - TiDB Q&A community (asktug.com)

| username: 像风一样的男子 | Original post link

With such high concurrency, resources are getting tight, and both KV and TiDB need to be scaled up.

| username: xfworld | Original post link

No problem, just keep optimizing.

For individual slow query cases that you find difficult to optimize, you can continue to post and provide more details.

| username: h5n1 | Original post link

Are these the only 3 machines? The CPU usage of the TiDB server is too high, it needs to be expanded.

| username: xfworld | Original post link

Mixed deployment, resource contention, and isolation issues are likely to be quite troublesome.

| username: h5n1 | Original post link

Check the monitoring for tidb → server → get token duration.

| username: rebelsre | Original post link

Sorry, I can’t translate images. Please provide the text you need translated.

| username: h5n1 | Original post link

There is a limit of 3000 tokens, and you can adjust the parameters, but increasing it with such a high QPS without adding TiDB servers is probably not going to improve things.

| username: 芮芮是产品 | Original post link

Your machine malfunction is a great teaching material.

| username: rebelsre | Original post link

Where does this come from… Are you referring to the mismatch between machine resource allocation and business requirements?

| username: 路在何chu | Original post link

Expand the capacity. We had the same issue before. SQL was unstable, sometimes taking tens of milliseconds, and sometimes several seconds. When there are too many regions, this situation is likely to occur.

| username: 路在何chu | Original post link

You can observe whether the execution time of slow SQL Coprocessor is very high.

| username: xfworld | Original post link

Mixed deployment will be relatively difficult :yum:

| username: 芮芮是产品 | Original post link

Your machine has a sixteen-core CPU, and you configured it to use 19 cores. It’s surprising it didn’t crash, haha.

| username: rebelsre | Original post link

No, the machine has 96 cores, with a single NUMA node having 48 cores. Two TiKV instances are bound to NUMA0, two are bound to NUMA1, TiDB is bound to NUMA0, and PD is bound to NUMA1. So there should be contention.

| username: Jellybean | Original post link

Indeed, it is basically a classic and relatively complex mixed deployment issue. The multi-tenant feature can help solve it.

The quickest solution at this stage is to scale out. Later on, you can also partition the tidb-server, with different services using different computing nodes.

| username: TiDBer_小阿飞 | Original post link

Use EXPLAIN or EXPLAIN ANALYZE to see where the SQL execution time is mainly spent, and adjust the modification trigger rate for automatic table analysis. Check if fast analyze is enabled, as analysis in TiDB is very time-consuming!
SET GLOBAL tidb_enable_fast_analyze=1

| username: rebelsre | Original post link

After adding two more TiDB servers, the business feedback indicates that the average response time is within expectations. It seems that in the future, we will still need to separate the deployment of components. Mixed deployment is a big pitfall.