TiFlash cannot run normally after restarting

translator_bot · June 22, 2024, 6:42pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 重启后无法正常运行

| username: TiDBer_jYQINSnf

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 5.4.0
[Reproduction Path] What operations were performed when the issue occurred
2T TiFlash, deleted a table (approximately 100G+), one of the TiFlash instances has been unable to function properly, with particularly high delay in the monitoring panel.
After restarting TiFlash, an error occurred, and TiFlash could not start.
The issue has been temporarily resolved by adding a new node. However, I would like to understand the cause of the failure and the quick recovery methods.
Logs are as follows:
[Attachment: Screenshot/Logs/Monitoring]
errorlog(1).log (4.1 MB)
serverlog(1).log (22.0 MB)
tiflash(1).log (62.8 MB)

translator_bot · June 22, 2024, 6:42pm

| username: Lucien-卢西恩 | Original post link

After restarting TiFlash, an error occurred, and TiFlash could not start.
A new node was temporarily added to resolve the issue. However, I would like to understand the cause of the failure and the quick recovery methods.

Deleting a 100+ GiB large table in TiFlash involves two replicas, right? Was the table deleted using the drop table command? What is the current configuration of the TiFlash node? Normally, a GC operation on a 100 GiB large table should be completed quickly and should not cause TiFlash to be unusable. Could you provide the time of the drop table operation and the TiFlash restart operation?

translator_bot · June 22, 2024, 6:42pm

| username: TiDBer_jYQINSnf | Original post link

I reviewed the records from that time; the table that was deleted had 10 billion rows. The drop table command was used. The TiFlash node configuration was a machine with 64 cores and 256GB of memory. The deletion occurred around noon, and the restart happened around 7 PM. This issue occurred before the Chinese New Year, and there was no quick recovery method available at that time, so a new node was created and slowly replaced. Currently, only these three logs are retained.

If it’s possible to analyze the cause of the incident from these three logs or find a way to skip the fatal region, that would be best, as it would provide a reference for solving similar issues in the future. If the information is indeed limited and it’s not possible to pinpoint the issue, then let’s leave it at that for now.

Thank you, thank you!

translator_bot · June 22, 2024, 6:42pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.