Ask a Question: A TiDB Cluster Issue - The Cluster is Very Laggy and the Business Has Crashed, What Should Be the First Step?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 提一个问题 一个tidb集群的问题,集群很卡,业务已经挂掉第一步该干嘛?

| username: tidb狂热爱好者

【TiDB Usage Environment】Production Environment
【TiDB Version】6.1
【Encountered Issue】Suddenly reported that the database could not be accessed.
【Reproduction Path】What operations were performed that caused the issue
【Problem Phenomenon and Impact】
I handled two things at the time, but actually, only one was effective.
I limited the execution time of SQL, discarding any SQL operations that took more than a minute.
The database returned to normal.
In hindsight, it was still very risky.
I am specifically seeking advice from everyone.
The root cause was also analyzed later. It was a bug in the development code.
A for loop was written that kept querying a very slow SQL, gradually clogging TiDB.
【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: tidb狂热爱好者 | Original post link

Currently, the methods to restrict inefficient SQL in TiDB are still quite limited. One is to limit the execution time of SQL, and the other is to limit the memory usage of SQL. I hope everyone can share other methods.

| username: Christophe | Original post link

Check Grafana to see where the load is high.

| username: xiaohetao | Original post link

Check if there are high CPU metrics on Grafana;
Look at the dashboard to see if there are slow SQL queries. If there are slow SQL queries, continue to check the specific execution details of the SQL.