Abnormally High ioutil in Different Linux Kernels of the Same Version

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 相同版本不同linux内核ioutil异常过高

| username: zhengjunbo

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]

Upgraded from Linux version 3.10.0-1160.6.1.el7.x86_64 to Linux version 3.10.0-1160.95.1.el7.x86_64 and encountered an issue where the performance of NVMe-based disks drastically decreased. There were no issues with regular SSDs. The problem was discovered by upgrading and downgrading the kernel version.
Can you help identify the root cause? Also, can you help verify which kernel versions cause this issue and provide a warning?

| username: WalterWj | Original post link

I feel that we need to consult experts in systems and hardware.

| username: 这里介绍不了我 | Original post link

Is it just the ioutil that’s exceptionally high? Has there been any feedback or issues from the business side?

| username: zhengjunbo | Original post link

The backup database hasn’t been subjected to heavy business yet. It’s just that after discovering differences with the primary database, we’ve been troubleshooting for a while.

| username: zhengjunbo | Original post link

The performance is very poor during stress testing, only one-fifth of the capacity of the same configuration.

| username: 这里介绍不了我 | Original post link

It’s a bit strange. Previously, we encountered inconsistencies with ioutil, but the QPS and other metrics during stress testing were still as expected.

| username: Jellybean | Original post link

In this situation, it’s necessary to move away from the database components and focus on verifying the performance issues of the operating system.

You can seek assistance from colleagues related to SRE or report to the leadership to take the lead in assisting. The DBA may not be able to resolve it.

| username: zhanggame1 | Original post link

Did you run fio?

| username: wangkk2024 | Original post link


| username: TiDBer_QYr0vohO | Original post link

Could it be caused by the incompatibility between the upgraded kernel version and the NVMe driver?

| username: zhaokede | Original post link

Before going live in the production environment, did you not perform stress testing in the testing environment?

| username: zhengjunbo | Original post link

Ran it, the fio difference is not significant.

| username: zhengjunbo | Original post link

Originally, it was about reinstalling the Linux system. As a result, the operations colleague upgraded the kernel.

| username: zhengjunbo | Original post link

Currently, I haven’t found specific information.

| username: 托马斯滑板鞋 | Original post link

Can the database be reinstalled? Remove it and re-import the data.

| username: zhengjunbo | Original post link

Got it, it’s a big pitfall. It turned out to be an issue with the stress testing tool; sysbench itself has insufficient performance.

| username: WalterWj | Original post link

Regarding this issue, the database does not need to be reinstalled or handled in any way.
It seems that upgrading the kernel resolved it, which is generally a compatibility/bug issue between the hardware and the system kernel.