Why do large volumes of scheduler handle commands cause some nodes to become unavailable?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 大批量的scheduler handle command,导致部分节点不可用是为什么

| username: TiDBer_an

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.4
[Reproduction Path] What operations were performed to cause the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

At 12:30, it was found on the dashboard that there were 4 TiKV nodes (labeled) on 2 machines, showing (disconnected) and sometimes up status.

During this period, some SQL queries reported region unavailable errors, and then it was found that the cluster became unwritable from 12:06.

Checked the logs at 12:06 and found a batch of warning logs, but no other special logs.

Then the cluster had region unavailable issues.

Finally, it kept logging continuously,

Tried restarting via command in the middle but it was ineffective.

Finally, at 14:00, the server was restarted. Then at 14:15, it recovered and could write again. Not sure if it was self-recovery or something else.

I would like to ask, what is the purpose of that batch of scheduled tasks, or are there other troubleshooting methods?

| username: DBAER | Original post link

Check the resource usage of TiKV.

| username: TiDBer_an | Original post link

This is the resource usage of 2 machines.

The CPU usage is a bit strange. From 12:06 to 12:10, the usage rate suddenly dropped to nothing, while the memory usage increased.

| username: Hacker_PtIIxHC1 | Original post link

Is there a PD log?

| username: Hacker_PtIIxHC1 | Original post link

You can also send the TiDB server logs for us to review together.

| username: Kamner | Original post link

| username: Kamner | Original post link

By the way, check the dashboard to see if there are any large transactions.

| username: DBAER | Original post link

This is a bit strange. The CPU usage has decreased, so there shouldn’t be a resource shortage. I noticed in the logs that some regions don’t have a leader. Let’s wait for the experts to analyze it.

| username: TiDBer_an | Original post link

Most of the PD logs are like this:

Filtering errors results in these:

Most of the TiDB server logs are like this:

| username: TiDBer_an | Original post link

Monitoring data for the last 3 hours is already gone. [quote=“Kamner, post:6, topic:1024268, full:true”]

| username: Ming | Original post link

Has TiKV been restarted multiple times? Can you find anything by searching for “Welcome” in the corresponding TiKV logs?

| username: TiDBer_an | Original post link

You can find one of the KV nodes
[2024/04/07 13:14:40.820 +08:00] [INFO] [lib.rs:85] [“Welcome to TiKV”]
[2024/04/07 14:01:56.828 +08:00] [INFO] [lib.rs:85] [“Welcome to TiKV”]

But the first time should be a command restart, and the second time is a server restart.

| username: Ming | Original post link

It seems that at 12:50, the memory usage rose and then rapidly dropped, which looks like a node went down or some other service crashed. However, I don’t understand why this monitoring couldn’t collect the memory usage of one of the servers after 13:00.

| username: TiDBer_an | Original post link

Another thing I’m not sure about is why there’s no memory. It might be related to manually starting the server on machine 05, causing some ECS monitoring not to start automatically.

Here are the corresponding logs for machine 06:
[2024/04/07 12:51:50.149 +08:00] [INFO] [lib.rs:85] [“Welcome to TiKV”]
[2024/04/07 12:56:24.206 +08:00] [INFO] [lib.rs:85] [“Welcome to TiKV”]
[2024/04/07 13:53:06.783 +08:00] [INFO] [lib.rs:85] [“Welcome to TiKV”]

| username: TiDBer_21wZg5fm | Original post link

It is very likely caused by a TiKV anomaly.

| username: TiDBer_an | Original post link

Couldn’t find the reason for the exception…

| username: TIDB-Learner | Original post link

Today, my test environment was extremely slow. To get straight to the point, TiDB and PD are deployed in a mixed manner with 3 nodes each. They are all cloud hosts, and one of the nodes had its configuration downgraded by a colleague. It was previously 16GB and was reduced to 8GB. This node was scaled down for TiDB, leaving only PD. Then the lag and slowness disappeared. It doesn’t seem likely, but that’s the actual situation. Could it be that the difference in configuration of the machines running the same services is causing this?

| username: DBAER | Original post link

This prompt indicates that TiKV is running out of space. Check the disk space; TiKV will reserve disk space and not use 100% of it.

| username: zhang_2023 | Original post link

There is an issue with TiKV.

| username: TiDBer_JUi6UvZm | Original post link

It’s very suspicious that TiKV reports insufficient space.