SHOW stats_healthy No Data, Causing TiDB Node CPU to Spike and Slow SQL Execution

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: SHOW stats_healthy 无数据,导致tidb节点 cpu 飙升,执行sql慢

| username: TiDBer_an

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 6.5.2
[Reproduction Path] What operations were performed to cause the issue
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

SQL execution is extremely slow,
Executing
SHOW stats_healthy returns no data
Only after restarting the PD and TiDB server nodes does it recover

Under what circumstances can this situation occur?

Here is the TiDB server unavailable warning log

3 PDs, 2 deployed separately, 1 PD and TiFlash deployed on the same machine, machine CPU 30%, memory 50%
TiDB server node deployed separately

| username: Soysauce520 | Original post link

Is the number of tables and partitions very large? Check the memory growth again.

| username: 有猫万事足 | Original post link

Do you have your deployment topology?
Blind guess is mixed deployment. PD might not have grabbed the CPU.

| username: TiDBer_an | Original post link

There is a PD mixed with TiFlash, but the CPU usage on that machine is around 30%, and the memory usage is around 50%.

| username: TiDBer_an | Original post link

The number of tables is quite large, and partitions are not being used. The memory usage is also increasing.

| username: DBAER | Original post link

It can be seen that PD is generally mixed with TiDB, and there are always various contention issues with TiKV and TiFlash.

| username: tidb菜鸟一只 | Original post link

It is best not to deploy TiFlash together with any other nodes. It consumes a large amount of CPU when running, which may directly exhaust the CPU and cause unexpected issues for other nodes deployed together. Your CPU utilization has already reached 100%…

| username: 友利奈绪 | Original post link

It seems to be caused by TiFlash mixed deployment. Running some computational statements will consume all resources. Previously, slow SQL caused TiFlash to spike, making it impossible to run any SQL.

| username: 小于同学 | Original post link

Are there a lot of tables and partitions? Check the memory growth again.

| username: TiDBer_an | Original post link

The screenshot is of the TiDB server machine, with 3 PDs, 2 deployed separately, and one PD machine is mixed deployment, its CPU is around 30%.

| username: tidb菜鸟一只 | Original post link

Are there several nodes for tidb-server, or just one?

| username: Soysauce520 | Original post link

  • Mitigated the issue of TiDB nodes running out of memory (OOM) when there are too many tables or table partitions to process #50077 @zimulala

If there are too many tables, consider upgrading to a minor version, or check the base table information in the MySQL database based on the table_id.

| username: dba远航 | Original post link

There is an issue with the metadata in PD.

| username: TiDBer_an | Original post link

There are two.

| username: tidb菜鸟一只 | Original post link

Neither of the two tidb-servers can return the results of SHOW stats_healthy?