After executing cross-partition SQL in version 7.1, TiDB-service CPU spikes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 7.1版本执行了跨分区的sql后tidb-service cpu飙升

| username: TiDB_老表

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.1.0/7.1.1
[Reproduction Path] After executing a cross-partition SQL (executed only 1 or 2 times, taking 7.2s, with a table of 5 million records), the CPU of the tidb-service node remains at 99.9%, and the issue can only be resolved by restarting the corresponding tidb-service node. This issue almost always occurs.
[Problem Encountered: Symptoms and Impact] When the CPU reaches 99.9%, the server can hardly provide database services.
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]




| username: h5n1 | Original post link

Post the SQL execution plan (available in the dashboard for slow SQL).

| username: TiDB_老表 | Original post link

| username: TiDB_老表 | Original post link

There are no longer any cross-partition SQLs in the business, but there are scenarios where data needs to be manually refreshed. These scenarios are rare and may involve cross-shard situations. This issue did not exist in the previous 6.5 version, but after upgrading to 7.1, it has already occurred several times.

| username: h5n1 | Original post link

Full table scan, manually running this SQL in one session can push the TiDB server to 8000%. Are there no other concurrent runs?

| username: h5n1 | Original post link

Could you please provide the full execution plan? Copy the execution plan with the timing from the dashboard.

| username: TiDB_老表 | Original post link

Confirmed that there is no other concurrency, QPS is also very low, and for other partitioned tables, executing one or two cross-partition SQLs results in the same issue. The phenomenon is that as long as it is cross-partition, the CPU remains high.

| username: TiDB_老表 | Original post link

| username: h5n1 | Original post link

Looking at your execution plan, it is still in partition static pruning mode. You can set @@session.tidb_partition_prune_mode = ‘dynamic’ to enable the dynamic pruning mode at the session level and test if the issue persists.

| username: TiDB_老表 | Original post link

Okay, thank you. I can try it in the test environment. The data volume in the test environment is very small, at most a few gigabytes. The live network has 5 terabytes of data, so it may not be reproducible.

| username: h5n1 | Original post link

If this issue occurs again, you can capture the threads for analysis by the development team.

curl -G http://{TIDB_IP}:10080/debug/pprof/profile > cpu.profile
curl http://{TIDB_IP}:10080/debug/pprof/goroutine?seconds=10&debug=0 > goroutine.profile
| username: TiDB_老表 | Original post link

Okay, thank you.

| username: knull | Original post link

In the dashboard, you should be able to directly capture pprof. It is recommended to capture memory, CPU, and goroutine as well.

| username: oceanzhang | Original post link

I feel it’s not an issue with your single SQL, but rather what was running concurrently at that moment.

| username: oceanzhang | Original post link

The QPS is also very high.