TiDB Dashboard Panel -> Overview -> Sudden Surge in QPS, CPU and Memory, IO are Normal

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB Dashboard面板-》概况-》QPS突然飙升,CPU和内存、IO均正常

| username: zhimadi

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.2
[Reproduction Path] Operations performed that led to the issue
None
[Encountered Issue: Problem Phenomenon and Impact]
TiDB Dashboard panel → Overview → QPS suddenly spiked, but CPU, memory, and IO are all normal. After checking, there was no abnormality in the business access volume.
May I ask, for such sudden spikes, which metrics can be used to pinpoint the cause? It’s quite alarming.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: TiDBer_jYQINSnf | Original post link

Go check Grafana to see which type is increasing. Then look into your business to find out.

| username: tidb菜鸟一只 | Original post link

Generate a comparative diagnostic report with the normal period and take a look.

| username: xfworld | Original post link

Upgrade~~~ Upgrade first, then talk.

| username: wzf0072 | Original post link

Learned something new. Thought of the AWR report.

| username: zhimadi | Original post link

There are too many metrics on the Grafana dashboard, which types do you usually look at?

| username: zhimadi | Original post link

Okay, I’ll give it a try. How do I view the report? What are the various metrics, and which ones are the main ones to look at?

| username: zhimadi | Original post link

If a program can run, I don’t dare to touch it lightly. :rofl:

| username: xfworld | Original post link

It has already been announced that there is a major bug and it is not recommended to use it, yet you are still pushing it hard…

| username: tidb菜鸟一只 | Original post link

First, let’s see what the biggest differences are.

| username: zhimadi | Original post link

tikv, approximate_region_size Expand Approximate Region size MAX_DIFF:-534186557.44
tikv, snapshot_info tikv_snapshot_size MAX_DIFF:53056819.00
tikv, gc_info Expand tikv_gc_keys_total_num write,next MAX_DIFF:14660.00
tikv, cache_hit Expand tikv_memtable_hit MAX_DIFF:-124.00
That’s about it, the differences in other items are relatively small.

| username: h5n1 | Original post link

Which specific item has increased in the QPS?

| username: zhimadi | Original post link

Basically, it’s select.

| username: h5n1 | Original post link

statements_summary and statements_summary_history tables can be used to check the SQL statements during that period. Look at the entire cluster’s cluster_statements_summary_history.

Alternatively, you can select the time range in the SQL analysis section of the dashboard.

| username: zhimadi | Original post link

These don’t seem to have any anomalies.

| username: TiDBer_jYQINSnf | Original post link

If the sudden increase is due to select queries, check the traffic visualization in the dashboard to see which area is significantly highlighted during that period, and then zoom in. On the left side, you can see which specific table it is. Then, you can go to the business side to investigate accordingly.
Ultimately, you will have to query the business side; various monitoring panels only provide some clues. The open-source version does not have SQL auditing.
Additionally, you can also check the slow queries to see if there are any unfamiliar SQL statements.

| username: zhimadi | Original post link

When visualizing traffic in the dashboard during that time period, there is a highlight, just a single line. The corresponding table is also controllable for daily use, and the data volume is not large. Additionally, there are no abnormal SQL queries in the slow query log.

| username: h5n1 | Original post link

Did you see a spike in network-related metrics in the TiDB server monitoring during that period?

| username: TiDBer_jYQINSnf | Original post link

Well, according to what h5n1 said, check the network. If there’s a sudden increase in network activity, then it’s indeed an increase on the business side. If there’s no increase, I don’t know what the situation is either :man_shrugging:. If all else fails, look through the TiDB logs and see if any SQL statements were logged during that time. As I said before, everything seen on the TiDB side is just the result; the cause needs to be investigated on the business side.

| username: zhimadi | Original post link

It seems there isn’t any. Take a look at the picture: