TiFlash Thread Resource Leak Issue

translator_bot · June 23, 2024, 3:49am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 线程资源泄漏问题

| username: 华健-梦诚科技

TiDB & TiFlash V6.1.1

I have an SQL query that, when executed, causes the TiFlash thread count to explode.
The data volume is not large, so I suspect it might have triggered a bug.

I noticed a resolved issue in version 6.1.1 that is similar to my problem:

github.com/pingcap/tiflash

`estimated_thread_usage`, `waiting_task_count` and `active_task_count` are not 0 after all the queries in TiFlash node finished.

opened 07:14AM - 08 Aug 22 UTC

closed 08:32AM - 08 Aug 22 UTC

windtalker

type/bug severity/critical affects-6.0 component/compute affects-6.1 affects-6.2

## Bug Report Please answer these questions before submitting your issue. Tha…nks! ### 1. Minimal reproduce step (Required) <img width="924" alt="Screen Shot 2022-08-08 at 3 12 31 PM" src="https://user-images.githubusercontent.com/1916264/183360568-7e466784-be21-47ce-9abd-9e9b63ac7adc.png"> As shows above, after all the queries finished in TiFlash, the `estimated_thread_usage` are not back to 0 ### 2. What did you expect to see? (Required) ### 3. What did you see instead (Required) ### 4. What is your TiFlash version? (Required)

However, after upgrading to 6.1.1, the problem still exists, as shown in the image below:

The SQL is quite complex and cannot be simplified (simplifying it does not reproduce the issue).
Moreover, it is related to the data; the same SQL does not cause this issue in a test environment with different data.

err_sql.txt (6.9 KB)

I am asking with a glimmer of hope if there are any experts who can provide guidance on resolving this issue.

translator_bot · June 23, 2024, 3:49am

| username: flow-PingCAP | Original post link

Hi, could you also send the result of the explain SQL?

translator_bot · June 23, 2024, 3:49am

| username: 华健-梦诚科技 | Original post link

Sorry for the late response.
The explain result is in the attachment. The explain analyze couldn’t produce a result, reporting an error of thread resource exhaustion.

explain.txt (195.1 KB)

translator_bot · June 23, 2024, 3:49am

| username: littlefall-PingCAP | Original post link

Hello, thank you for the feedback. Could you please send the coprocessor panel in the tiflash summary monitoring when the error thread is exhausted? You can use the PingCAP MetricsTool to export all the monitoring data of the tiflash summary and tiflash proxy details around the time when the issue occurred.

translator_bot · June 23, 2024, 3:49am

| username: 华健-梦诚科技 | Original post link

Okay, the cluster is currently running a stress test. I’ll send it to you in a day or two.

Thanks for your attention.

translator_bot · June 23, 2024, 3:49am

| username: jansu-dev | Original post link

Hello, do you have any feedback? Regarding scraping Metrics?

translator_bot · June 23, 2024, 3:49am

| username: 华健-梦诚科技 | Original post link

I have some additional information, sorry for the delay, the cluster just finished the stress test and became available.

Found more interesting information:

After the stress test, without touching the cluster, running the previously problematic SQL does not produce errors anymore. The explain analyze result is as follows:
explain_analyze.txt (291.0 KB)
Restarting TiFlash and running again, still no errors.
Restarting the entire cluster and running again, the error appears, with 2 nodes using over 5k threads without releasing them.
Screenshots and metrics are as follows:

image1215×470 57.7 KB

image1380×262 31.6 KB

mc-TiFlash-Summary_2022-09-21T09_47_41.154Z.json (478.6 KB) mc-TiFlash-Proxy-Details_2022-09-21T09_48_24.701Z.json (700.9 KB)
Continuing to execute the SQL multiple times, no more issues, it returns results normally, and the thread count does not change.

translator_bot · June 23, 2024, 3:49am

| username: 华健-梦诚科技 | Original post link

Can any experts provide some guidance?