【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
The development team reported timeouts and delays during business queries. By checking the traffic visualization on the dashboard, there were delayed SQL queries. From the traffic visualization, there were no obvious read hotspot issues. Later, by checking Zabbix, frequent memory fluctuations were found. Then, the system logs and Grafana of the TiKV nodes were checked.
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
grep "Out of memory" /var/log/messages to check if the logs indicate an OOM (Out of Memory) issue. Are there any other components mixed on the TiKV node?
You can try it in the evening, as doing it during the day might affect other business operations. Does this parameter require a restart of the TiKV node to take effect?
Here’s a suggestion: For IP address masking, you can cover the first few digits to easily distinguish different instances. Just revealing the beginning as 192 doesn’t make much sense.
Are there a large number of big queries? It is possible that the gRPC sending speed cannot keep up with the speed at which the Coprocessor outputs data, leading to OOM.
Refer to this link:
The query volume is about 300,000 per day, but the query volume is not very high at night. However, it has continued from yesterday afternoon until noon today.
My suggestion is to directly set the block-cache.capacity to 40G, reduce the memory first, and then observe how much memory is used at most over a few days…
SET config tikv storage.block-cache.capacity=‘40960MiB’;
Modify online, take effect immediately.
In the monitoring Grafana → TiKV-details, select the corresponding instance and check the RocksDB block cache size monitoring to confirm if it is the issue.