Abnormal PD Server Goroutine Count Metric

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD server Goroutine Count 指标异常

| username: 不输土豆

【TiDB Usage Environment】Production Environment
【TiDB Version】V6.5.0
【Reproduction Path】
No operations were performed. The user reported the following issue:

Then, upon checking the monitoring, the following abnormal monitoring was found:

In the logs, the following ERROR log exists:
[2023-08-09 14:41:51]
[heartbeat_streams.go:119] [“send heartbeat message fail”] [region-id=7902052] [error=“[PD:grpc:ErrGRPCSend]send request error: EOF”]

Additional Information:
This cluster, as a downstream cluster, has ticdc tasks synchronizing from other clusters to this cluster!

【Encountered Issue: Problem Phenomenon and Impact】 PD server Goroutine Count metric abnormality, abnormal increase
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

| username: WalterWj | Original post link

Capture the PD flame graph and upload it to see if any developers can take a look. Try restarting this PD.

| username: redgame | Original post link

When an exception occurs, also share the logs and context.

| username: Hacker_ZcrkjsVg | Original post link

We also encountered the same issue, and it was only resolved after restarting the entire cluster.

| username: Curry瀚 | Original post link

View the flame graph

| username: cassblanca | Original post link

When the cluster load is too high, the PD server will automatically adjust the scheduling strategy and increase the number of schedulers, leading to a surge in the number of goroutines. As seen in the first image, the CPU load is also elevated, which can basically determine that there were other high loads at the time of the incident.

| username: Jasper | Original post link

Use the command curl http://<pd_address>:<pd_port>/debug/pprof/heap -o heap.log to get the heap profile and check the flame graph.