TiKV Error Log Report

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tikv err日志报错

| username: TiDBer_uWAor6XR

[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0
[Reproduction Path] Error reported at 1:34 AM
[Encountered Problem: Phenomenon and Impact] All nodes restarted, TiDB service was basically unable to provide read/write operations
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]
The remaining memory showed a significant increase at the time of the failure, from 50G to 200G.
tidb_issue_troubleshooting.docx (56.2 MB)

| username: zhanggame1 | Original post link

Has the restart been completed?

| username: TiDBer_uWAor6XR | Original post link

After the restart, the cluster returned to normal.

| username: zhanggame1 | Original post link

It looks like a network issue based on the logs.

| username: TiDBer_uWAor6XR | Original post link

I also suspected network issues and checked the network latency in the monitoring, which showed no fluctuations and remained quite stable. Looking at the issue in the link above, based on the final jemalloc error, it might be related to a TiDB bug.

| username: zhanggame1 | Original post link

The version is too old, consider upgrading.

| username: TiDBer_uWAor6XR | Original post link

If it’s a network issue, TiDB shouldn’t restart, nor should all nodes restart. It might be a cluster bug, but I’m not sure how to trigger it.

| username: Fly-bird | Original post link

Check the resource utilization of the cluster.

| username: Kongdom | Original post link

  1. Recommend upgrading.
  2. Available memory increased when the failure occurred, likely due to node restart releasing it.
| username: TiDBer_uWAor6XR | Original post link

The core count is 64 cores, and the network is 10 Gigabit.

| username: xfworld | Original post link

What is the specific version number?

| username: oceanzhang | Original post link

Looking at the system logs, it seems to be a network issue, which can be cross-verified at the software level.

| username: TiDBer_uWAor6XR | Original post link

4.0.13

| username: TiDBer_uWAor6XR | Original post link

If it’s a network issue, I’m not sure how to explain it. All nodes restarted around those 2 minutes. Is it because a large number of regions didn’t elect a leader, causing the cluster to restart? Or is it due to too many optimistic lock conflicts, leading to the restart? (There are also many optimistic lock conflicts in the TiKV logs)

| username: xfworld | Original post link

I suggest you upgrade to a minor version and check the bugs fixed in 4.0.14.

Here are the bugs fixed in 4.0.15.

4.0.16


The election of regions and the nodes where replicas are located are related, but this won’t cause a restart.
Too many optimistic lock conflicts also won’t lead to a restart…

Only in the case of a panic, it might cause a hang or restart.

| username: swino | Original post link

Have you found the cause of the problem?

| username: oceanzhang | Original post link

Was it resolved in the end? Can you share it?

| username: andone | Original post link

Restarting fixes everything.