TiFlash Restart Due to Increased Handle Count

translator_bot · June 23, 2024, 10:16am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash由于句柄数升高导致重启

| username: Hacker_ojLJ8Ndr

【TiDB Usage Environment】Production Environment
【TiDB Version】6.1.0
【Encountered Problem】tiflash restart
【Problem Phenomenon and Impact】
tiflash_error.log:

tiflash-summary:

tidb-cluster-node_exporter:

Operating System Configuration:

translator_bot · June 23, 2024, 10:16am

| username: 数据小黑 | Original post link

Is there any prompt information in the system message? Is there any hint like oom-killer?

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

It wasn’t caused by OOM, the message shows: main process exited, code=killed

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

Root cause investigation:

The open file count is indeed relatively high, but is there a message indicating “open too many file descriptor …”?
6770 may not be the actual limit value; it could have been averaged out, and the maximum value might be higher than this.

As for avoidance, if this is the root cause, can you try setting it to unlimited to see if it can bypass the issue?

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

There are no errors related to file descriptors in /var/log/messages, only this: main process exited, code=killed, status=6/ABRT.
The operating system resources have already been configured. You can see the screenshot of the operating system configuration I provided. The configuration is effective, but the maximum value of the Opened File Count in tiflash-summary is significantly different from the configured value in the operating system, making it impossible to avoid this issue through resource limitations.

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

Use port 9090 to check the instantaneous value recorded by the Prometheus metric (which is also averaged);
Do you still have the logs from when TiFlash was restarted (tiflash.log, tiflash_error.log)? Please post them.

The current information is not easy to analyze further…

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

Prometheus record:

Image1380×516 62.6 KB
Screenshot of the first error in tiflash_error.log:

Image1893×173 8.89 KB
This is the log of the problematic tiflash node exported from the dashboard:
logs-tiflash_192.168.14.23_3930.zip (1.0 MB)

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

Could you please confirm if this is a TiFlash log? It looks more like a TiKV log.

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

The ng_monitor export should only be tiflash_tikv.log. Please go to that directory and retrieve all the logs corresponding to these 4 time points.

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

Yes, the exported file is tiflash_tikv.log, the tiflash.log at that time is no longer available.

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

Does this still occur frequently after the restart?
Do you still have the complete tiflash_error.log?

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

Occasionally it happens, this is the tiflash_error.log during the issue period:
tiflash_error.log (115.3 KB)

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

OK, does it return to normal successfully every time you restart?

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

After the restart, the number of handles decreased, and the service is normal.

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

Sure,

Please confirm the operating system version;
Please confirm whether the continuous profiling feature is enabled on TiFlash (you can see it on the dashboard); or if there is any manual profiling.

This might be the issue, still confirming → query raise the error of Unknown compression method: 200 when profiling in rhel 8 · Issue #5292 · pingcap/tiflash · GitHub

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

Operating System Version: CentOS Linux release 7.9.2009 (Core)
Continuous performance analysis is enabled, no manual profiling

translator_bot · June 23, 2024, 10:16am

| username: jansu-dev | Original post link

Based on the judgment, it is basically related to this issue → query raise the error of Unknown compression method: 200 when profiling in rhel 8 · Issue #5292 · pingcap/tiflash · GitHub
Suggestions:

Turn off continuous profiling and observe whether this phenomenon still occurs;
You can follow this issue, and we will continue to investigate and fix the problem internally;

If there are any new developments or phenomena, you can also post them here. Thank you for the feedback.

translator_bot · June 23, 2024, 10:16am

| username: Hacker_ojLJ8Ndr | Original post link

Sure, thanks for the help~

translator_bot · June 23, 2024, 10:16am

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.