Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb写入慢,发现了几个异常指标求优化办法
[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.14
[Encountered Problem: Phenomenon and Impact] Today, we found that a task’s write operation is piling up. Upon checking the monitoring, several metrics are abnormal. Could you please advise on how to optimize this?
gRPC poll CPU usage has reached 200%
Optimizing so much is actually not as good as upgrading the version. I used 4.0.9 before and many inserts were slow. After upgrading, the performance improved significantly, and you can hardly see inserts in slow queries.
Which version did you upgrade to? The iterations are so fast now that I’m actually afraid to upgrade.
Last year, we upgraded from 4.0.9 to 5.4.3. The iterations are fast, but now there’s no motivation to upgrade.
I looked at the documentation, the upgrade seems simple. Did you encounter any pitfalls during the upgrade?
Do you upgrade directly or create a new cluster and then synchronize the data over?
Both methods can achieve the upgrade.
It depends on the importance of your business. The most reliable method is to switch between the two clusters as primary and secondary.
Upgrade by synchronizing from a lower version to a higher version through CDC.
I suggest upgrading to a newer version.
Yes, we have done many upgrades from 4.0 to 5.4, using CDC master-slave synchronization and then switching for the upgrade.
We have also done many in-place upgrades; both methods are feasible.
I upgraded it in the middle of the night using --force. Without --force, it takes a long time to evict the leader from each KV. However, you should test all aspects to ensure there are no issues before upgrading.
Just upgrade directly. We tested upgrading from 4.0 to 5.4 without any issues before proceeding.
My current TiDB version is 5.2 and DM is 2.06. I previously tried upgrading during testing, but encountered data synchronization issues with DM. After the issues, I couldn’t downgrade either. Now I’m afraid to upgrade because if something goes wrong, my boss will have to sit next to me at work.
If it’s running stably and there are no special requirements, don’t upgrade.
Indeed, after we upgraded a cluster, the DDL execution became particularly fast.
The new version will fix many of the original bugs.
After 13:30, Command Per Second increased, QPS increased, and Duration decreased. It feels like a large number of small SQL executions suddenly increased, leading to higher IO usage. It may not necessarily be slow SQL, as slow SQL is generally larger. A large number of small SQLs can also have significant impact. You can check what small SQL executions have increased significantly during this period.
- Duration: Execution time
- The time from when the client network request is sent to TiDB to when TiDB finishes execution and returns to the client. Generally, client requests are sent in the form of SQL statements, but they can also include execution times for commands like
COM_PING
, COM_SLEEP
, COM_STMT_FETCH
, COM_SEND_LONG_DATA
, etc.
- Since TiDB supports Multi-Query, it can accept multiple SQL statements sent by the client at once, such as
select 1; select 1; select 1;
. In this case, the execution time statistics are the total time after all SQL statements have been executed.
- Command Per Second: The number of commands processed by TiDB per second, counted based on whether the execution result is successful or failed.
- QPS: The number of SQL statements executed per second on all TiDB instances, categorized by
SELECT
, INSERT
, UPDATE
types.
It feels like the anomaly is caused by slow network transmission.
Observe more closely to determine whether the slow writes are affecting this table in general or only during specific time periods, and further pinpoint the issue.
I suggest that experts organize a document according to the upgrade versions, such as 3-6, 4-6, etc.