Tikv get snapshot failed

translator_bot · June 22, 2024, 6:46pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv get snapshot failed

| username: 宸凡_22

The Prewrite phase takes too long.
A large number of get snapshot failed errors are causing performance degradation. How can I troubleshoot this?
logs (2).zip (56.9 KB)

translator_bot · June 22, 2024, 6:46pm

| username: xfworld | Original post link

Could you provide additional information about the cluster configuration? Network, hardware (CPU, memory, disk), node information, etc.

translator_bot · June 22, 2024, 6:46pm

| username: 宸凡_22 | Original post link

Sorry, I cannot translate images. Please provide the text content you need translated.

translator_bot · June 22, 2024, 6:46pm

| username: xfworld | Original post link

Is this IO consistently very high…?

translator_bot · June 22, 2024, 6:46pm

| username: 宸凡_22 | Original post link

It has always been like this, with the service continuously writing data. We are in the business of smart devices, and the real-time data from the devices is written approximately every 10 seconds.

translator_bot · June 22, 2024, 6:46pm

| username: xfworld | Original post link

You can pay attention to whether the disk throughput performance fully meets the business needs…

I feel that this metric reflects the situation where business demand exceeds hardware performance.

translator_bot · June 22, 2024, 6:46pm

| username: 宸凡_22 | Original post link

From a business perspective, it mainly involves writing data and increasing the amount of batch data written, which has improved performance. However, the error above still exists. Is it a bug?

translator_bot · June 22, 2024, 6:46pm

| username: xfworld | Original post link

The IO is insufficient, which limits the communication between many services… It also makes it impossible to ensure synchronization between replicas.

I suggest you check the statistics of the regions. If you find a significant discrepancy, try to optimize it.

translator_bot · June 22, 2024, 6:46pm

| username: 宸凡_22 | Original post link

Is the IO insufficient because the read and write performance of the SSD disk does not meet the business requirements?

translator_bot · June 22, 2024, 6:46pm

| username: xfworld | Original post link

Check the performance parameters of the SSD. If possible, there might also be RAID, which could also prevent the SSD from fully utilizing its performance.

You can refer to this

Additionally, I have personally collected some parameters that you can also refer to
fdatasync performance reference

Reference value 1: fdatasync/s of non-NVMe SSDs is about 5~8K/s

Reference value 2: fdatasync/s of early NVMe SSDs is about 20~50K/s

Reference value 3: fdatasync/s of current mature PCIe 3 NVMe SSDs is about 200~500K/s

translator_bot · June 22, 2024, 6:46pm

| username: Hacker_xUwtuKxa | Original post link

It looks like the IO capability is insufficient (ioUtility%=100%). Additionally, you can also check some network transmission metrics.

translator_bot · June 22, 2024, 6:46pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.