TiFlash Restart Due to Increased Handle Count

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash由于句柄数升高导致重启

| username: Hacker_ojLJ8Ndr

【TiDB Usage Environment】Production Environment
【TiDB Version】6.1.0
【Encountered Problem】tiflash restart
【Problem Phenomenon and Impact】
tiflash_error.log:


tiflash-summary:

tidb-cluster-node_exporter:

Operating System Configuration:
Image

| username: 数据小黑 | Original post link

Is there any prompt information in the system message? Is there any hint like oom-killer?

| username: Hacker_ojLJ8Ndr | Original post link

It wasn’t caused by OOM, the message shows: main process exited, code=killed

| username: jansu-dev | Original post link

Root cause investigation:

  1. The open file count is indeed relatively high, but is there a message indicating “open too many file descriptor …”?
  2. 6770 may not be the actual limit value; it could have been averaged out, and the maximum value might be higher than this.

As for avoidance, if this is the root cause, can you try setting it to unlimited to see if it can bypass the issue?

| username: Hacker_ojLJ8Ndr | Original post link

  1. There are no errors related to file descriptors in /var/log/messages, only this: main process exited, code=killed, status=6/ABRT.
  2. The operating system resources have already been configured. You can see the screenshot of the operating system configuration I provided. The configuration is effective, but the maximum value of the Opened File Count in tiflash-summary is significantly different from the configured value in the operating system, making it impossible to avoid this issue through resource limitations.
| username: jansu-dev | Original post link

  1. Use port 9090 to check the instantaneous value recorded by the Prometheus metric (which is also averaged);
  2. Do you still have the logs from when TiFlash was restarted (tiflash.log, tiflash_error.log)? Please post them.

The current information is not easy to analyze further…

| username: Hacker_ojLJ8Ndr | Original post link

  1. Prometheus record:
  2. Screenshot of the first error in tiflash_error.log:
  3. This is the log of the problematic tiflash node exported from the dashboard:
    logs-tiflash_192.168.14.23_3930.zip (1.0 MB)
| username: jansu-dev | Original post link

Could you please confirm if this is a TiFlash log? It looks more like a TiKV log.

| username: Hacker_ojLJ8Ndr | Original post link

| username: jansu-dev | Original post link

The ng_monitor export should only be tiflash_tikv.log. Please go to that directory and retrieve all the logs corresponding to these 4 time points.

| username: Hacker_ojLJ8Ndr | Original post link

Yes, the exported file is tiflash_tikv.log, the tiflash.log at that time is no longer available.

| username: jansu-dev | Original post link

  1. Does this still occur frequently after the restart?
  2. Do you still have the complete tiflash_error.log?
| username: Hacker_ojLJ8Ndr | Original post link

Occasionally it happens, this is the tiflash_error.log during the issue period:
tiflash_error.log (115.3 KB)

| username: jansu-dev | Original post link

OK, does it return to normal successfully every time you restart?

| username: Hacker_ojLJ8Ndr | Original post link

After the restart, the number of handles decreased, and the service is normal.

| username: jansu-dev | Original post link

Sure,

  1. Please confirm the operating system version;
  2. Please confirm whether the continuous profiling feature is enabled on TiFlash (you can see it on the dashboard); or if there is any manual profiling.

This might be the issue, still confirming → query raise the error of Unknown compression method: 200 when profiling in rhel 8 · Issue #5292 · pingcap/tiflash · GitHub

| username: Hacker_ojLJ8Ndr | Original post link

  1. Operating System Version: CentOS Linux release 7.9.2009 (Core)
  2. Continuous performance analysis is enabled, no manual profiling
| username: jansu-dev | Original post link

Based on the judgment, it is basically related to this issue → query raise the error of Unknown compression method: 200 when profiling in rhel 8 · Issue #5292 · pingcap/tiflash · GitHub
Suggestions:

  1. Turn off continuous profiling and observe whether this phenomenon still occurs;
  2. You can follow this issue, and we will continue to investigate and fix the problem internally;

If there are any new developments or phenomena, you can also post them here. Thank you for the feedback.

| username: Hacker_ojLJ8Ndr | Original post link

Sure, thanks for the help~

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.