The overall response of the cluster has slowed down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群整体响应变慢

| username: Kongdom

[TiDB Usage Environment] Production Environment
[Encountered Issue] The cluster response has significantly slowed down, and TiKV nodes have gone offline. The server was restarted at 15:30.

Checking the TiDB logs, there are numerous warnings:

[WARN] [pd.go:131] [“get timestamp too slow”] [“cost time”=33.477587ms]

Checking the PD leader logs, there are no obvious errors.

Checking the TiKV logs, there are numerous errors:
[ERROR] [peer.rs:3488] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5487695 store_id: 4”] [peer_id=5487696] [region_id=5487693] [type=MsgHibernateResponse]

At the same time, the network bandwidth remains consistently high.

| username: dbaspace | Original post link

You can switch the PD-LEADER.

| username: Kongdom | Original post link

Why switch if the PD leader didn’t report an error?

| username: xfworld | Original post link

Which version is it? It looks like a bug…
KV: Raftstore: Transport

An error means synchronization has completely failed…

| username: Kongdom | Original post link

Version 5.1.0

| username: xfworld | Original post link

Check the number of regions first.

See if all the copies are completed after a night.

| username: Kongdom | Original post link

The region count is consistent, but there is a slight difference in the leader count.

| username: xfworld | Original post link

Is it still giving an error? It shouldn’t be anymore. :stuck_out_tongue_winking_eye:

| username: Kongdom | Original post link

Still the same, the same error.

| username: Kongdom | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.