TiFlash Thread Resource Leak Issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 线程资源泄漏问题

| username: 华健-梦诚科技

TiDB & TiFlash V6.1.1

I have an SQL query that, when executed, causes the TiFlash thread count to explode.
The data volume is not large, so I suspect it might have triggered a bug.

I noticed a resolved issue in version 6.1.1 that is similar to my problem:

However, after upgrading to 6.1.1, the problem still exists, as shown in the image below:

The SQL is quite complex and cannot be simplified (simplifying it does not reproduce the issue).
Moreover, it is related to the data; the same SQL does not cause this issue in a test environment with different data.

err_sql.txt (6.9 KB)

I am asking with a glimmer of hope if there are any experts who can provide guidance on resolving this issue.

| username: flow-PingCAP | Original post link

Hi, could you also send the result of the explain SQL?

| username: 华健-梦诚科技 | Original post link

Sorry for the late response.
The explain result is in the attachment. The explain analyze couldn’t produce a result, reporting an error of thread resource exhaustion.

explain.txt (195.1 KB)

| username: littlefall-PingCAP | Original post link

Hello, thank you for the feedback. Could you please send the coprocessor panel in the tiflash summary monitoring when the error thread is exhausted? You can use the PingCAP MetricsTool to export all the monitoring data of the tiflash summary and tiflash proxy details around the time when the issue occurred.

| username: 华健-梦诚科技 | Original post link

Okay, the cluster is currently running a stress test. I’ll send it to you in a day or two.

Thanks for your attention.

| username: jansu-dev | Original post link

Hello, do you have any feedback? Regarding scraping Metrics?

| username: 华健-梦诚科技 | Original post link

I have some additional information, sorry for the delay, the cluster just finished the stress test and became available.

Found more interesting information:

  1. After the stress test, without touching the cluster, running the previously problematic SQL does not produce errors anymore. The explain analyze result is as follows:
    explain_analyze.txt (291.0 KB)

  2. Restarting TiFlash and running again, still no errors.

  3. Restarting the entire cluster and running again, the error appears, with 2 nodes using over 5k threads without releasing them.
    Screenshots and metrics are as follows:



    mc-TiFlash-Summary_2022-09-21T09_47_41.154Z.json (478.6 KB) mc-TiFlash-Proxy-Details_2022-09-21T09_48_24.701Z.json (700.9 KB)

  4. Continuing to execute the SQL multiple times, no more issues, it returns results normally, and the thread count does not change.

| username: 华健-梦诚科技 | Original post link

Can any experts provide some guidance?