TiFlash Memory Overflow Causes Automatic Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tiflash内存爆了自动重启

| username: jboracle1981

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】No special operations
【Encountered Problem: Problem Phenomenon and Impact】
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: jboracle1981 | Original post link


image
tiflash_error.log.2024-05-29-16_44_46.634 (41.9 MB)
tiflash_error.log.2024-05-29-16_46_31.695 (20.3 MB)

| username: jboracle1981 | Original post link

TiFlash graph for the past month

| username: Kongdom | Original post link

Have you checked the topsql or slow queries on the Dashboard?

| username: zhaokede | Original post link

It should be the SQL that takes up a lot of memory.

| username: jboracle1981 | Original post link

I checked, but didn’t find any SQL with high memory usage.

| username: jboracle1981 | Original post link

No SQL with particularly high memory usage was found during this time period.

| username: hacker_77powerful | Original post link

How much memory does the operating system have?

| username: jboracle1981 | Original post link

Both machines have 250GB, and both are full.

| username: FutureDB | Original post link

Check if there is a Cartesian join between large tables. Previously, in our V6.1.0, a Cartesian join between two large tables quickly brought down the TiFlash node.

| username: FutureDB | Original post link

Have you checked if there were any unsuccessful SQL executions during that time period? Look in the logs to see if there are any large SQL statements that did not execute successfully. Previously, our TiFlash crashed because of an SQL involving a Cartesian join on a large table, and it consumed so much memory that the node crashed before the execution could complete.

| username: jboracle1981 | Original post link

There are no Cartesian products, but there are quite a few failed SQLs.

| username: dgtgsou | Original post link

In the case of slow SQL, you need to look at the statements before the node restart. The failed SQL is likely due to the TiFlash node crashing, which would cause the query to fail.

| username: 随缘天空 | Original post link

Check the slow queries and resource usage on the dashboard.

| username: FutureDB | Original post link

Could you please analyze the reasons for these SQL failures in detail?

| username: jboracle1981 | Original post link

Okay, everyone. I’ll check it on Monday.

| username: YuchongXU | Original post link

It is estimated to be a large SQL.

| username: TiDBer_ZxWlj6A1 | Original post link

Columnar storage is designed to address analytical processing (AP) workloads and large SQL queries, right?

| username: Jack-li | Original post link

How large is your SQL?

| username: jboracle1981 | Original post link

It’s possible.