PD reports that it cannot connect to etcd, CPU and memory usage spike, please advise on the cause

translator_bot · June 20, 2024, 12:59pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd报etcd连接不上，CPU和内存飙升，请大佬指点原因

| username: TiDBer_mra9oJ0k

[TiDB Usage Environment] Production Environment
[TiDB Version] v8.1.0
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact]

Phenomenon
The newly built cluster (3 machines, 4c 32G) ran stably for about four days. On June 17, 2024, around 9 AM, feedback indicated application query lag. Upon checking CPU and memory, two nodes’ memory was nearly 100%. The issue was resolved after restarting the machines and modifying the TiKV memory.
Problem
According to the PD logs, there were numerous etcd access errors on all three machines at 9:09 AM (similar errors also appeared on June 12, 2024, with the same symptoms).
[2024/06/17 09:09:03.365 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc001590000/10.88.0.202:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: leader changed”]
[2024/06/17 09:09:03.365 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc0015901e0/10.88.0.202:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: leader changed”]
[2024/06/17 09:09:05.082 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc001590780/10.88.0.201:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout”]
[2024/06/17 09:09:05.082 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc0016761e0/10.88.0.201:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout”]
PD Monitoring

image1380×542 133 KB
etcd Monitoring

image1380×612 81.6 KB
Host Resources

image1380×574 159 KB

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 20, 2024, 12:59pm

| username: Billmay表妹 | Original post link

[Resource Allocation] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

Let’s take a look at your resource allocation.

translator_bot · June 20, 2024, 12:59pm

| username: TIDB-Learner | Original post link

You mean a total of 3 machines to set up the cluster. 3 TiDB + 3 PD + 3 KV? Mixed deployment.
In this case, there will be insufficient resources and resource contention.

translator_bot · June 20, 2024, 12:59pm

| username: tidb菜鸟一只 | Original post link

For mixed deployment, it’s crucial to properly set memory limits for tidb-server and tikv; otherwise, if PD gets squeezed out, this will happen.

translator_bot · June 20, 2024, 12:59pm

| username: wfxxh | Original post link

Post the topology deployment.

translator_bot · June 20, 2024, 12:59pm

| username: tony5413 | Original post link

So it was caused by resource issues, right?

translator_bot · August 26, 2024, 6:03am

| username: Hacker_zuGnSsfP | Original post link

Is it resolved? I encountered the same issue.