TiKV memory usage is very high, TiKV cannot start, SQL query fails

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV内存占比很高,TIKV无法启动,SQL查询失败

| username: Steve阿辉

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.2
[Reproduction Path] Operations that led to the issue: The previous cluster version was 5.1.3. We reduced the TiKV configuration from 16 cores and 32GB to 8 cores and 32GB and upgraded TiKV to version 6.1.2.
[Encountered Issue: Symptoms and Impact] In the previous version 5.1.3, after reducing the configuration, the memory usage increased, but there was no TiKV OOM disconnection and restart. Later, we saw that the new official version had memory optimizations, so we upgraded to version 6.1.2. Since then, the memory usage has been consistently at 95%, and any large queries cause TiKV to OOM, disconnect, and restart.

Another SQL query that previously worked fine now fails to execute after the TiKV instability and version upgrade. We are unsure if this is due to a rule change in the new version or some other issue.

Today, during normal business operations with high QPS and high memory usage, we needed to write data to the cluster, causing TiKV to OOM and disconnect. However, this time it failed to automatically restart.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: 我是咖啡哥 | Original post link

What about the logs?

| username: 特雷西-迈克-格雷迪 | Original post link

How do you configure the memory settings parameters for TiKV?

| username: Billmay表妹 | Original post link

Have all the components been upgraded to 6.1.2 or only TiKV has been upgraded while others have not?

| username: Billmay表妹 | Original post link

Please follow the official requirements for deployment in the production environment~

| username: Steve阿辉 | Original post link

After several days of troubleshooting and analysis, and with special thanks to the friends in the TiDB community for their help, I have documented the resolution process.

Firstly, regarding the high memory usage issue, we resolved it by setting memory parameters, which prevented the memory usage from being too high. You can refer to this post: https://my.oschina.net/boreboluomiduo/blog/5535983

The issue with TiKV not starting was due to corrupted files. There are two ways to handle this: one is to remove the node and then re-add it, but this is very time-consuming. We used this method, and you can find the link to the post in the community documentation.

The other method is to locate and repair the corrupted files using the TiKV Control tool. This tool can print information about the corrupted SST files, which can then be repaired or ignored during startup. We did not try this method, so for specific steps, please refer to the community documentation.

[TiKV Control Usage Instructions]
(TiKV Control 使用说明 | PingCAP 文档中心)

For issue 3, the SQL query failure was because a subquery written with “WITH” did not use the index. After our investigation, we added a line of code after the subquery’s table: FROM table_name FORCE INDEX (‘index_name’), and the query speed returned to normal. Later, we performed table analysis with ANALYZE TABLE table_name and found that the query worked fine without forcing the index.

Finally, the query speed is also related to whether there are errors within TiKV, such as whether the data distribution is even, which can also improve performance.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.