After Rebuilding the PD Cluster, PD Server Frequently Goes Out

translator_bot · June 22, 2024, 11:43pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd集群重搭后经常 pd server out

| username: kuuhaku

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.3.3
[Reproduction Path] PD cluster has been rebuilt
[Encountered Problem: Phenomenon and Impact]
Some tables show pd server timeout when queried
Sometimes select * from xxx where id=1 works
But select * from xxx limit 1 does not work

Experts, please take a look
[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 11:43pm

| username: kuuhaku | Original post link

Supplement

translator_bot · June 22, 2024, 11:43pm

| username: h5n1 | Original post link

First, check in the black exporter/node exporter monitoring if the PD leader node has any network latency, high CPU usage, or high disk latency.

translator_bot · June 22, 2024, 11:43pm

| username: tidb狂热爱好者 | Original post link

Check if the CPU usage of TiKV and TiDB is high. I have encountered a situation with “execute limit” before, where slow SQL queries caused the business to hang. The TiDB dashboard was all red. After fixing the slow SQL, PD was able to connect.

translator_bot · June 22, 2024, 11:43pm

| username: kuuhaku | Original post link

Are you looking at it on Grafana?

translator_bot · June 22, 2024, 11:43pm

| username: h5n1 | Original post link

I think the main reason is that the tidb-server process is not running. You can check the status of the tidb-server process on the corresponding machine.