SQL Execution Issues

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: sql执行问题

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.2.4
[Encountered Problem]
Received a server memory usage alert in the morning. Upon checking the occupying services, during the period of high memory usage, most business SQL queries were slow queries. The time taken to send SQL statement execution results to the client was very high, and network communication was normal upon inspection. What could be the reason for this?

| username: 大飞哥online | Original post link

How large is the query result set?

| username: h5n1 | Original post link

Is it deployed on a physical machine? What aspects of the network were checked? How high was the TiDB CPU utilization during that time period? Did any of the slow SQL queries have changes in their execution plans?

| username: 路在何chu | Original post link

There is a SQL result set size in the execution plan, check how large the result set is.

| username: 普罗米修斯 | Original post link

| username: tidb菜鸟一只 | Original post link

More than 50,000 lines take four minutes…

| username: 大飞哥online | Original post link

The result set looks like it will be about 60,000 rows according to this plan. If the table structure includes varchar(255), text, and similar types, the transmission will be larger, consuming more bandwidth and taking some time. Could you share your table structure and the fields listed after the select statement? A network engineer can check the bandwidth between the application end and the data end.

| username: TiDBer_小阿飞 | Original post link

The execution plan didn’t take any time at all! The time-consuming part is sending the result set. Maybe ask the business team or check the middleware status?

| username: 普罗米修斯 | Original post link

Physical machine deployment, network checks on metrics such as bps, pps, and tcp_count during the period of increased memory usage are all normal; CPU utilization is normal, and the execution plan has not changed.

| username: 普罗米修斯 | Original post link

SQL statements and table structure, end-to-end are all 10 Gigabit networks.


| username: 普罗米修斯 | Original post link

After restarting the middleware server, the memory usage has decreased. If the query statement is not in the slow SQL, is there any way to check the time taken to send the result? I couldn’t find it in the dashboard.

| username: 大飞哥online | Original post link

Is the network stable? Are there any packet losses?
Check the memory, CPU, and other information of the client receiving the data. Are they high?
If the receiving end processes slowly, the time will also be extended.

| username: h5n1 | Original post link

Export the monitoring data for overview/tidb/tikv-detail/overview/black exporter/node exporter for the half-hour before and during the memory increase.

Steps to export monitoring data:

  1. Open the monitoring dashboard and select the monitoring time. Make sure to expand all panels and wait for the data to load completely.
  2. Open the Grafana monitoring dashboard (press ‘d’ then ‘E’ to open all Rows’ Panels, and wait for the page to load completely).
  3. Use the tool at https://metricstool.pingcap.com/ to export Grafana data as a snapshot.
| username: 普罗米修斯 | Original post link

The client memory and CPU usage are both normal, and the network also appears normal. The memory drop occurred after restarting the middleware service.


| username: 普罗米修斯 | Original post link

The export tool link https://metricstool.pingcap.com/ cannot be opened.

| username: h5n1 | Original post link

https://metricstool.pingcap.net

| username: TiDBer_小阿飞 | Original post link

SELECT
	a.thread_id,
	sql_text,
	c.event_name,
	(c.timer_end - c.timer_start) / 1000000000 AS 'duration (ms)'
FROM
	`performance_schema`.events_statements_history_long a
JOIN `performance_schema`.threads b ON a.thread_id = b.thread_id
JOIN `performance_schema`.events_stages_history_long c ON c.thread_id = b.thread_id
AND c.event_id BETWEEN a.event_id
AND a.end_event_id
WHERE
	b.processlist_id = connection_id()
AND a.event_name = 'statement/sql/select'
ORDER BY
	a.thread_id,
	c.event_id;
| username: Fly-bird | Original post link

Slow SQL

| username: 人如其名 | Original post link

The statement executes quickly at the database level, and the high latency in sending results mainly comes from two reasons:

  1. Network latency is too long. This situation mainly occurs when the data volume is large and there are many database interactions, commonly seen when using cursor fetch with a small fetchsize. In this case, each network interaction only requests a fetchsize number of records. If the network delay is too high, the overall latency will be high. However, with your data volume, this is almost impossible.
  2. Application processing is slow. When using Cursor fetch (cursor fetch=yes, fetchsize > 0, advantage: can handle multiple result sets simultaneously, disadvantage: too many network interactions, poor performance), the application processes the current batch of records (fetchsize) before requesting from the database again. During the application processing of these records, the database will be blocked, leading to high latency in sending results. If streaming (fetchsize <= 0, advantage: fewer network interactions, no need for multiple client requests, less JDBC memory usage, disadvantage: cannot execute SQL concurrently, must wait for the previous SQL result set to be processed before handling the next SQL) and default configuration (cursor fetch=no, advantage: very few network interactions, disadvantage: large JDBC memory usage caching all result sets) are used, blocking generally does not occur.

Therefore, the troubleshooting order is:

  1. If the JDBC driver is set to cursor fetch=yes, fetchsize > 0, it is most likely a program issue (highly suspect this issue).
  2. If the JDBC driver is set to streaming fetchsize <= 0, there is a certain probability it is a network issue, and it is also necessary to check if the program is processing slowly.
  3. If the JDBC driver is set to the default configuration cursor fetch=no, it is most likely a network issue.

Additionally, high database memory usage is mainly due to slow consumption by the front-end application (or network transmission). SQL queries need to cache distSQL data from TiKV requests on the tidb-server side in the tikvclient.

| username: 随缘天空 | Original post link

You can check the CPU and memory usage of each node in the dashboard monitoring panel, as well as whether the cache is enabled or hitting the cache. If the CPU and memory usage are not high, check if there is a data hotspot issue.