TiDB Write Slow, Found Several Abnormal Metrics, Seeking Optimization Methods

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb写入慢,发现了几个异常指标求优化办法

| username: TiDBer_freGslX6

[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.14
[Encountered Problem: Phenomenon and Impact] Today, we found that a task’s write operation is piling up. Upon checking the monitoring, several metrics are abnormal. Could you please advise on how to optimize this?

gRPC poll CPU usage has reached 200%

| username: 像风一样的男子 | Original post link

Optimizing so much is actually not as good as upgrading the version. I used 4.0.9 before and many inserts were slow. After upgrading, the performance improved significantly, and you can hardly see inserts in slow queries.

| username: TiDBer_freGslX6 | Original post link

Which version did you upgrade to? The iterations are so fast now that I’m actually afraid to upgrade.

| username: 像风一样的男子 | Original post link

Last year, we upgraded from 4.0.9 to 5.4.3. The iterations are fast, but now there’s no motivation to upgrade.

| username: TiDBer_freGslX6 | Original post link

I looked at the documentation, the upgrade seems simple. Did you encounter any pitfalls during the upgrade?

| username: Hacker007 | Original post link

Do you upgrade directly or create a new cluster and then synchronize the data over?

| username: Jellybean | Original post link

Both methods can achieve the upgrade.

It depends on the importance of your business. The most reliable method is to switch between the two clusters as primary and secondary.

| username: 路在何chu | Original post link

Upgrade by synchronizing from a lower version to a higher version through CDC.

| username: andone | Original post link

I suggest upgrading to a newer version.

| username: Jellybean | Original post link

Yes, we have done many upgrades from 4.0 to 5.4, using CDC master-slave synchronization and then switching for the upgrade.

We have also done many in-place upgrades; both methods are feasible.

| username: 像风一样的男子 | Original post link

I upgraded it in the middle of the night using --force. Without --force, it takes a long time to evict the leader from each KV. However, you should test all aspects to ensure there are no issues before upgrading.

| username: 像风一样的男子 | Original post link

Just upgrade directly. We tested upgrading from 4.0 to 5.4 without any issues before proceeding.

| username: Hacker007 | Original post link

My current TiDB version is 5.2 and DM is 2.06. I previously tried upgrading during testing, but encountered data synchronization issues with DM. After the issues, I couldn’t downgrade either. Now I’m afraid to upgrade because if something goes wrong, my boss will have to sit next to me at work.

| username: 像风一样的男子 | Original post link

If it’s running stably and there are no special requirements, don’t upgrade.

| username: Kongdom | Original post link

Indeed, after we upgraded a cluster, the DDL execution became particularly fast.

| username: dba远航 | Original post link

The new version will fix many of the original bugs.

| username: 春风十里 | Original post link

After 13:30, Command Per Second increased, QPS increased, and Duration decreased. It feels like a large number of small SQL executions suddenly increased, leading to higher IO usage. It may not necessarily be slow SQL, as slow SQL is generally larger. A large number of small SQLs can also have significant impact. You can check what small SQL executions have increased significantly during this period.

  • Duration: Execution time
    • The time from when the client network request is sent to TiDB to when TiDB finishes execution and returns to the client. Generally, client requests are sent in the form of SQL statements, but they can also include execution times for commands like COM_PING, COM_SLEEP, COM_STMT_FETCH, COM_SEND_LONG_DATA, etc.
    • Since TiDB supports Multi-Query, it can accept multiple SQL statements sent by the client at once, such as select 1; select 1; select 1;. In this case, the execution time statistics are the total time after all SQL statements have been executed.
  • Command Per Second: The number of commands processed by TiDB per second, counted based on whether the execution result is successful or failed.
  • QPS: The number of SQL statements executed per second on all TiDB instances, categorized by SELECT, INSERT, UPDATE types.
| username: dba远航 | Original post link

It feels like the anomaly is caused by slow network transmission.

| username: zhaokede | Original post link

Observe more closely to determine whether the slow writes are affecting this table in general or only during specific time periods, and further pinpoint the issue.

| username: TiDBer_小阿飞 | Original post link

I suggest that experts organize a document according to the upgrade versions, such as 3-6, 4-6, etc.