SQL Execution Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: sql执行报错

| username: Jolyne

[TiDB Usage Environment] Production Environment / Testing / Poc
Production Environment
[TiDB Version]
5.2.1
[Reproduction Path] What operations were performed when the issue occurred
In the production environment, there is a scenario where multiple table joins are queried and written, with approximately 20 million records. Currently, executing this results in an error:
2013 - Lost connection to server at ‘handshake: reading initial communication packet’, system error: 0. Upon checking the monitoring, it was found that the pressure was too high, causing some nodes to automatically restart and leading to connection drops. I would like to ask if there are any good methods for TiDB debugging in this scenario, aside from optimizing SQL and splitting tasks. (Currently, there are quite a few background scheduling tasks in the cluster, and the IO is consistently high)
[Encountered Issues: Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]


| username: xfworld | Original post link

Please provide the configuration information of the previous cluster.
[Resource Configuration] This is not available, lacking a basis for judgment.

[Reproduction Path] How is the 20 million data associated and queried, and how is it written? Can you provide the structure, statements, and execution plan?

| username: Jolyne | Original post link

Cluster Information

| username: xfworld | Original post link

What type of disk is it? Why do TiKV nodes need to be configured with different sizes? :see_no_evil:

I saw your big SQL, do you want to run it all in one transaction?

| username: tidb菜鸟一只 | Original post link

Could you export your cluster configuration file for us to take a look? Does this SQL require sorting, and is the memory allocated to TiDB sufficient?

| username: 我是咖啡哥 | Original post link

Take a look at the startup times of each component, it’s highly likely an OOM issue.

SELECT * FROM information_schema.cluster_info;
| username: 特雷西-迈克-格雷迪 | Original post link

Your query must have exhausted the memory! Didn’t the last monitor show an instance reaching over 60GB? You can insert in batches.

| username: Jolyne | Original post link

:joy: This is due to insufficient resources. The 6T below was originally TiFlash, which I expanded into a TiKV node. The disk is just a regular disk on Huawei Cloud, with average performance. This large SQL has now been split into multiple executions.

| username: Jolyne | Original post link

This is the initial configuration modified by the previous DBA, and it has hardly been changed since. Please help check if there are any areas that can be adjusted or modified.

| username: Jolyne | Original post link

Yes, I saw from the dashboard that the component was also restarted recently.

| username: xfworld | Original post link

The disk IO is not sufficient, we can only reduce the read frequency, otherwise the IO will be overwhelmed.
172.21.10.137
172.21.10.138

What are these two nodes? Not only is the IO high, but the CPU and memory are also very high.

| username: Jolyne | Original post link

The TiKV node’s hardware is not keeping up. :sweat_smile:

| username: 我是咖啡哥 | Original post link

I think we should start by optimizing slow SQL queries, as OOM is generally caused by large result sets. If the optimization doesn’t meet expectations, then consider scaling up. If the data volume is large and there is high concurrency, regular hard drives definitely won’t suffice. SSDs are basically the standard.

| username: xfworld | Original post link

Adjust the hardware configuration of the disk; other optimizations are unlikely to have good results. Alternatively, lower your expectations… :see_no_evil:

| username: Jolyne | Original post link

Yes, this is the current approach.

| username: Jolyne | Original post link

Okay, thank you.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.