Is it normal to have a large number of "switch region leader to specific" entries in tidb.log?

translator_bot · June 23, 2024, 9:14am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb.log 中有大量的 switch region leader to specific 正常吗？

| username: xiaohetao

As shown in the above tidb.log, there are a large number of “switch region leader to specific”. Is this normal?

translator_bot · June 23, 2024, 9:14am

| username: 长安是只喵 | Original post link

Are you importing data? According to similar posts, it doesn’t seem to have much impact.

translator_bot · June 23, 2024, 9:14am

| username: 啦啦啦啦啦 | Original post link

This log indicates that TiDB sent a region request to TiKV, but the accessed region is not the leader. This might be caused by an expired region cache. It will automatically retry, and if it doesn’t happen frequently, it should not impact the business.

translator_bot · June 23, 2024, 9:14am

| username: xiaohetao | Original post link

I see that each instance has as many as 8000 and as few as a few hundred. Is 8000 considered a lot or a little? What criteria are used for evaluation?

translator_bot · June 23, 2024, 9:14am

| username: xiaohetao | Original post link

It appears to be a different phenomenon.

translator_bot · June 23, 2024, 9:14am

| username: 啦啦啦啦啦 | Original post link

Have there been any scaling operations?

translator_bot · June 23, 2024, 9:14am

| username: TiDBer_jYQINSnf | Original post link

Check the Grafana monitoring to see if a TiKV node went down, causing all the leaders to move away and then rebalance back.

translator_bot · June 23, 2024, 9:14am

| username: xiaohetao | Original post link

There was a scaling operation, but that was done in November last year.

translator_bot · June 23, 2024, 9:14am

| username: xiaohetao | Original post link

Do you mean that TiKV crashed and then recovered by itself?

translator_bot · June 23, 2024, 9:14am

| username: TiDBer_jYQINSnf | Original post link

It’s possible, for example, due to network jitter or excessive pressure on TiKV.

translator_bot · June 23, 2024, 9:14am

| username: xiaohetao | Original post link

From the monitoring, there are no abnormal phenomena in the cluster, and there are no signs of KV crashing.

translator_bot · June 23, 2024, 9:14am

| username: 啦啦啦啦啦 | Original post link

That wasn’t caused by scaling in or out; it should be some other reason leading to a large number of leader migrations. Did it resolve itself after a period of time, or is it still reporting this issue continuously?

translator_bot · June 23, 2024, 9:14am

| username: TiDBer_jYQINSnf | Original post link

Looking at PD, there is an operator page that monitors the generation, checking, processing, and cancellation of operators. Check what the majority of the reasons for processing are.

translator_bot · June 23, 2024, 9:14am

| username: HACK | Original post link

The system has very little business volume, so it shouldn’t be due to excessive pressure on TiKV.

translator_bot · June 23, 2024, 9:14am

| username: HACK | Original post link

Nothing unusual, right?

translator_bot · June 23, 2024, 9:14am

| username: xiaohetao | Original post link

I’ve compared the last few days, and they are roughly the same.

translator_bot · June 23, 2024, 9:14am

| username: TiDBer_jYQINSnf | Original post link

This is mainly about writing hotspots and balancing regions. If this affects TiDB, you can limit the speed with the following command:

pd-ctl -i 
store limit all 5

The default store limit is 15, which means each store handles 15 region actions per minute. Limiting it to 5 can reduce some load. If there are no anomalies, no action is needed. The reason is to adjust the regions with hotspot writes.

translator_bot · June 23, 2024, 9:14am

| username: HACK | Original post link

Will this cause application connection interruptions? Or will it cause excessive application response times?

translator_bot · June 23, 2024, 9:14am

| username: TiDBer_jYQINSnf | Original post link

The response time is long.
TiDB itself has a cache for the region leader’s location. If the region’s location changes, the TiDB node needs to retrieve it again from PD, which increases the latency.
Other than that, there’s no significant impact.

translator_bot · June 23, 2024, 9:14am

| username: HACK | Original post link

Are you looking at the transfer-hot-write-leader metric in the picture?