TiFlash Service Abnormal, Server Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 服务异常,服务器重启

| username: Hacker_ojLJ8Ndr

tiflash_stderr.log:

Logging debug to /data01/deploy/log/tiflash.log
Logging errors to /data01/deploy/log/tiflash_error.log
deprecated configuration, log-file has been moved to log.file.filename
override log.file.filename with log-file, "/data01/deploy/log/tiflash_tikv.log"
[low_level_alloc.cc : 570] RAW: mmap error: 12
...
Logging debug to /data01/deploy/log/tiflash.log
Logging errors to /data01/deploy/log/tiflash_error.log
deprecated configuration, log-file has been moved to log.file.filename
override log.file.filename with log-file, "/data01/deploy/log/tiflash_tikv.log"

tiflash_error.log:

[2023/08/15 05:20:17.345 +08:00] [ERROR] [Server.cpp:314] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/cpp/thread_manager/thread_manager.cc, line number: 39, log msg : Could not create grpc_sync_server worker-thread"] [source=grpc] [thread_id=1514686]

tidb log:

[err="[tikv:13][FLASH:Coprocessor:BadRequest] Income key ranges is illegal for region: 765697663: (while doing learner read for table, logical table_id: 30927)"]

tiflash monitoring:






mpp related parameters:
image

(The server where tiflash is deployed is not configured with cpufreq tuning)

| username: tidb菜鸟一只 | Original post link

What does the cluster topology look like? It seems that the memory of the TiFlash node machine is insufficient.

| username: Hacker_ojLJ8Ndr | Original post link

The memory is not fully utilized, one TiFlash node per server.

| username: tidb菜鸟一只 | Original post link

Did you do NUMA binding? Your machine has 256GB of memory, and TiFlash occupies 137GB?

| username: Hacker_ojLJ8Ndr | Original post link

NUMA is bound. This newly deployed TiFlash cluster has been running for more than two weeks, and the cop queries are not very high. After the cluster encountered issues, the number of cop requests processed increased.


image

| username: 有猫万事足 | Original post link

The error message Could not create grpc_sync_server worker-thread has been concluded as OOM in two issues on GitHub.

From the perspective of memory growth this time, it indeed seems to be an OOM issue.

| username: zhanggame1 | Original post link

It seems a bit like OOM (Out of Memory).

| username: Hacker_ojLJ8Ndr | Original post link

From the logs before the restart, it also looks like an OOM issue, but according to the monitoring, the memory usage of the highest node did not exceed 80%, and the cluster encountered problems when it was not even at 50%.

[2023/08/15 05:22:20.022 +08:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 1508970) Received signal Aborted(6)."] [source=BaseDaemon] [thread_id=1514696]
| username: 有猫万事足 | Original post link

Normally, if TiFlash on my side scans a large number of result sets, it also encounters similar situations.

It’s best to check whether such SQL is reasonable. If it is indeed necessary, and you have multiple TiFlash nodes, you can consider using MPP. This can solve the issue of a single TiFlash node running out of memory due to scanning large result sets without using MPP.

However, it must be emphasized that MPP essentially relies on stacking machines to decompose tasks and perform parallel computing. But if the result set is so large that the existing TiFlash instances cannot handle it even with MPP, it may cause all TiFlash instances to crash. The experience is somewhat all-or-nothing.

It’s like a chain reaction, vulnerable to fire attacks. :joy:

| username: Hacker_ojLJ8Ndr | Original post link

MPP is enabled, but not forced. It is most likely related to the new business that was launched yesterday.

| username: ShawnYan | Original post link

The memory surge caused by improper SQL handling doesn’t necessarily mean an OOM. Are there any logs of TiFlash restarting?

We still need to capture the problematic SQL and check the execution plan.

| username: redgame | Original post link

The possibility of OOM is higher.