Issues with TiDB Business Query Latency and Timeout

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB 业务查询延时和超时问题

| username: liujun6315

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
The development team reported timeouts and delays during business queries. By checking the traffic visualization on the dashboard, there were delayed SQL queries. From the traffic visualization, there were no obvious read hotspot issues. Later, by checking Zabbix, frequent memory fluctuations were found. Then, the system logs and Grafana of the TiKV nodes were checked.
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

The system logs of the three TiKV nodes all show automatic restarts.

Each of the three TiKV nodes has 128G of memory, and storage.block-cache.capacity is set to 54G.

| username: 啦啦啦啦啦 | Original post link

grep "Out of memory" /var/log/messages to check if the logs indicate an OOM (Out of Memory) issue. Are there any other components mixed on the TiKV node?

| username: liujun6315 | Original post link

No out of memory found in the system logs, and the TiKV node is not mixed with other components.

| username: 啦啦啦啦啦 | Original post link

Looking at the monitoring, the TiKV memory is always close to 100%. You might consider configuring the block-cache-size to be smaller.

| username: liujun6315 | Original post link

You can try it in the evening, as doing it during the day might affect other business operations. Does this parameter require a restart of the TiKV node to take effect?

| username: liujun6315 | Original post link

I checked the related SQL, and it seems that most of the time is spent on TiKV.

| username: 小龙虾爱大龙虾 | Original post link

Here’s a suggestion: For IP address masking, you can cover the first few digits to easily distinguish different instances. Just revealing the beginning as 192 doesn’t make much sense. :rofl:

| username: 小龙虾爱大龙虾 | Original post link

This SQL execution plan seems to have an issue.

| username: 啦啦啦啦啦 | Original post link

You need to restart. Modify it with tiup edit-config during off-peak business hours.

| username: 啦啦啦啦啦 | Original post link

Are there a large number of big queries? It is possible that the gRPC sending speed cannot keep up with the speed at which the Coprocessor outputs data, leading to OOM.
Refer to this link:

| username: liujun6315 | Original post link

The query volume is about 300,000 per day, but the query volume is not very high at night. However, it has continued from yesterday afternoon until noon today.

| username: liujun6315 | Original post link

Looking at the monitoring, it should have nothing to do with this; everything is within the normal range.

| username: 数据库真NB | Original post link

Temporarily increase the memory configuration and see what business is running from afternoon to evening.

| username: tidb菜鸟一只 | Original post link

My suggestion is to directly set the block-cache.capacity to 40G, reduce the memory first, and then observe how much memory is used at most over a few days…
SET config tikv storage.block-cache.capacity=‘40960MiB’;
Modify online, take effect immediately.

| username: oceanzhang | Original post link

Out of Memory (OOM)

| username: zhanggame1 | Original post link

Check which process is using the memory by logging into the machine, running top, and then pressing Ctrl+M to sort.

| username: zhang_2023 | Original post link

In the monitoring GrafanaTiKV-details, select the corresponding instance and check the RocksDB block cache size monitoring to confirm if it is the issue.

| username: 哈喽沃德 | Original post link

Optimize the SQL.

| username: changpeng75 | Original post link

Please post the complete SQL and execution plan for review.
Due to parallel computing, the execution time of the push-down operator is not very long.