CPU Usage of a KV Node Suddenly Increases

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 某个kv节点cpu突然升高

| username: 像风一样的男子

【TiDB Usage Environment】Production Environment / Testing / PoC
【TiDB Version】

【Encountered Problem: Phenomenon and Impact】
Starting from 2 PM yesterday, the CPU of one TiKV node suddenly increased. By 5 PM, the database latency began to rise, and some queries that usually take only tens of milliseconds turned into several seconds. Could you please suggest some troubleshooting directions?
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

| username: 像风一样的男子 | Original post link

Here are the TiKV logs from that time period:
tikv.log (4.1 MB)

| username: TiDBer_vfJBUcxl | Original post link

Refer to this

| username: tidb菜鸟一只 | Original post link

Are there any hotspot issues on the dashboard? Also, please share the topology. Is the TiKV with high CPU usage deployed in a mixed environment?

| username: 像风一样的男子 | Original post link

The gRPC duration for these 2 metrics is relatively high.

| username: 像风一样的男子 | Original post link

There is no mixed deployment, and there are no hotspot issues.


| username: 像风一样的男子 | Original post link

Only this KV node has increased traffic and latency, while the other KV nodes are normal. The KV logs also don’t show what it’s doing, which is very strange.

| username: 像风一样的男子 | Original post link

The 147kv node has not read data for the past few days, but yesterday the read volume suddenly increased.


| username: tidb菜鸟一只 | Original post link

Has there been any recent change in the distribution of leaders?

| username: 像风一样的男子 | Original post link

No changes

| username: 像风一样的男子 | Original post link

I found that there are many tasks for this 147 kv during this time period. Where can I find out what these tasks are doing?

| username: Jellybean | Original post link

Check the dashboard to see the read hotspot situation during this period. It is speculated that the problem is related to this.

| username: tidb菜鸟一只 | Original post link

That is still a reading issue, the number of concurrently running tasks in the Unified Read Pool.

| username: 像风一样的男子 | Original post link

It doesn’t seem to be very popular.

| username: tidb菜鸟一只 | Original post link

Take a look at these two tables: SELECT * FROM INFORMATION_SCHEMA.tables a WHERE a.tidb_table_id IN (‘720’,‘1260’); corresponding time period SQL.

| username: tidb菜鸟一只 | Original post link

Check if the two tables are in the hot regions?

| username: 像风一样的男子 | Original post link

No, it doesn’t seem to match.


| username: tidb菜鸟一只 | Original post link

Then check if the hotspots are concentrated on machine 147?

SELECT
	ss.ADDRESS,
	h.type,
	p.store_id,
	SUM(FLOW_BYTES),
	COUNT(1),
	COUNT(DISTINCT p.store_id),
	GROUP_CONCAT(p.region_id)  
FROM
	information_schema.TIDB_HOT_REGIONS h
JOIN information_schema.tikv_region_peers p ON h.region_id = p.region_id
AND p.IS_LEADER = 1
JOIN information_schema.TIKV_STORE_STATUS ss ON p.store_id = ss.store_id
GROUP BY
	h.type,
	p.store_id,
	ss.ADDRESS
ORDER BY
	SUM(FLOW_BYTES) DESC;
| username: 像风一样的男子 | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: redgame | Original post link

Monitor it at 2 PM today.