After expanding TiKV nodes from 3 to 5, the Store Region score keeps fluctuating

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点从3个扩容到5个以后,Store Region score一直在波动

| username: 重启试试

After expanding the TiKV nodes from 3 to 5, the Store Region score has been fluctuating, causing the apply log and append log times to increase several times.



After several attempts, reducing the TiKV nodes to 3 resolves the issue, but expanding to 5 nodes causes it to reappear.
How should this be resolved, or what troubleshooting steps should be taken?

【TiDB Environment】 Production
【TiDB Version】 4.0.10

| username: wuxiangdong | Original post link

Is it because the scheduling is not yet complete, thus affecting the IO?

| username: alfred | Original post link

Has concurrent writing increased during peak periods?

| username: 重启试试 | Original post link

The scheduling was completed a few days ago, and the number of regions was not high to begin with. The expansion was completed in about an hour.

| username: 重启试试 | Original post link

During peak business times, the query volume increases significantly, but the write volume is not large.

| username: alfred | Original post link

“Reducing TiKV to 3 nodes solves the problem, but expanding to 5 nodes causes it again.” Are the configurations of the TiKV nodes the same? Especially the disk IO capability.

| username: 重启试试 | Original post link

Among the 5 TiKV nodes, there are two different capacity specifications, but the IO capability is the same.

| username: h5n1 | Original post link

Follow this link to export the PD monitoring pages before and after scaling out: https://metricstool.pingcap.com/#backup-with-dev-tools

| username: 重启试试 | Original post link

jkylcluster-PD_2022-08-30T08_33_19.385Z_last_7_day.json (6.6 MB)

| username: 重启试试 | Original post link

Through monitoring, we found empty regions. We modified split-region-on-table and enable-cross-table-merge, and unified the disk capacity of each TiKV, but the effect is still not very noticeable.

| username: 重启试试 | Original post link

After removing the scheduling of hot regions with scheduler remove balance-hot-region-scheduler, the region scheduling on each TiKV node has significantly decreased and stabilized. The apply log time has also reduced from 256-512ms to 64-128ms. However, compared to the 16-32ms before the expansion, it has still increased considerably.

| username: 重启试试 | Original post link

After turning off hotspot scheduling, the region scheduling monitoring is no longer so chaotic! image|690x311

| username: h5n1 | Original post link

Let’s see what tiup mirror show is.

| username: 重启试试 | Original post link

I didn’t install mirro. Yesterday afternoon around 3 PM, I turned off hotspot scheduling. By around 8 PM, the apply log time and query response time had returned to the levels before the expansion. It’s quite strange. Expanding TiKV from 3 to 5 nodes resulted in so much additional hotspot scheduling.

| username: h5n1 | Original post link

It feels like adding these two nodes has affected the scheduling algorithm.

| username: 重启试试 | Original post link

Yes, now the hotspot scheduling has been directly turned off.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.