PD memory usage continues to rise, a large number of goroutines are generated, eventually causing the entire cluster to become unavailable

translator_bot · June 22, 2024, 12:23am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD内存持续升高，大量goroutine产生最终导致整个集群不可用

| username: TiDBer_a3wWH0G2

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0
[Reproduction Path] Unknown
[Encountered Problem: Phenomenon and Impact]
Memory increased significantly, eventually triggering swap, causing the entire TiDB cluster to become unavailable. Can we find out what caused the sudden increase in memory?

Client reports

From the PD leader logs, there was a PD election, but from the final result, there was no leader switch

There are a lot of fsync operations in etcd

There are many schedule operator operations

PD logs show a lot of heartbeat timeout errors

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

translator_bot · June 22, 2024, 12:23am

| username: TiDBer_vfJBUcxl | Original post link

The link you provided leads to a Zhihu article. Please provide the specific text you need translated, and I will translate it for you.

translator_bot · June 22, 2024, 12:23am

| username: WalterWj | Original post link

Turn off swap, it’s better to let PD trigger OOM.

translator_bot · June 22, 2024, 12:23am

| username: Jasper | Original post link

Check if there are any hotspots on the TiKV side.
Use curl http://<pd_address>:<pd_port>/debug/pprof/heap -o heap.log to collect the flame graph and take a look.

translator_bot · June 22, 2024, 12:23am

| username: redgame | Original post link

Why is there a significant increase in query load in monitoring? This could lead to the TiDB cluster needing more memory to handle query requests.

translator_bot · June 22, 2024, 12:23am

| username: TiDBer_a3wWH0G2 | Original post link

After encountering the issue, swap has been disabled. It is indeed better to let the program OOM directly, triggering an automatic leader switch.

translator_bot · June 22, 2024, 12:23am

| username: TiDBer_a3wWH0G2 | Original post link

It should be that after the failure, the business retries, generating more requests.

translator_bot · June 22, 2024, 12:23am

| username: TiDBer_a3wWH0G2 | Original post link

I’ll collect and take a look.

translator_bot · June 22, 2024, 12:23am

| username: TiDBer_a3wWH0G2 | Original post link

The issue has been resolved. The reason was that the priority of the faulty PD node was higher than that of the other nodes, preventing the leader from switching.

translator_bot · June 22, 2024, 12:23am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.