TiFlash Load Imbalance Across Different Nodes

translator_bot · June 23, 2024, 3:26am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 不同节点负载不均衡

| username: 华健-梦诚科技

tidb version: v6.1.1
Cluster configuration and topology are as follows:

The cluster contains about 4000 tables, 1.1T storage space, and 23,000 regions:

The store size, leader, and region of tikv are as follows:

All tables have tiflash replica count set to 4, with a total of 18,000 regions, and 4 nodes are completely symmetrical.

Performed stress testing on this cluster, with 20 concurrent replays of about 4.5 million SQLs.
The SQL ratio is about 95% for AP and 5% for TP, captured from the production environment, involving almost half of the tables.

The 4 tidb nodes have evenly distributed connections and relatively balanced CPU load:

The load on the 4 tikv nodes is also relatively balanced:

However, the load on tiflash is very unbalanced, and the number of open files is also unbalanced, as shown:

I noticed there is an issue related to this problem on GitHub, which is still open:

github.com/pingcap/tidb

TiFlash replica maybe invisible to TiDB if the TiFlash node was once in down state, and makes load between TiFlash nodes not balanced

opened 01:27AM - 16 Jun 22 UTC

closed 05:59AM - 25 Oct 23 UTC

windtalker

type/enhancement sig/execution affects-7.5

## Enhancement ### Background In TiDB, there is a region cache to cache the re…gion info(region id, region replicas). When TiDB try to access the region, it will first find the region in region cache, so TiDB can get region information without access PD. A region in region cache will become invalid if - The cached information is wrong, for example, TiDB encouter region error when using the cached region information - The cached information is not used for 10 minutes If a region is invalid in region cache, TiDB will [get the latest region information from PD](https://github.com/tikv/client-go/blob/be31f33ba03b543bb20503c3264877e09440d30c/internal/locate/region_cache.go#L1418), and after it gets information from PD, it will do some pre-process to convert the raw region information to the region information used in TiDB, one of the pre-process is [`filterUnavailablePeers`](https://github.com/tikv/client-go/blob/be31f33ba03b543bb20503c3264877e09440d30c/internal/locate/region_cache.go#L1395), that is for each replica of the region, if the related store is in `Down` state, the replica will be fiter out, so TiDB does not see these replicas which the store are down. ### Problem Consider a TiDB with 2 TiFlash nodes, and a TiDB table with 2 TiFlash replicas, When query from that table using TiFlash, TiDB will balance the accessed region among the 2 TiFlash nodes using [`balanceBatchCopTask`](https://github.com/pingcap/tidb/blob/4fc9551dd13d8cb2dbbe55ae1ff707ce8e3e4637/store/copr/batch_coprocessor.go#L295), the basic idea in `balanceBatchCopTask` is first find out all the avaliable region replicas, then try to distribute the region replica to all avliable TiFlash nodes evenly. So if the cluster has multiple TiFlash nodes, it is expected that the query load is always balanced between all TiFlash nodes. Now assuming that a TiFlash node is down for a while, the cached region information will become invalid since TiDB found some replica become unreachable, it will try to reload the region, as described in the `Background` session, when reload region from PD, TiDB will filter out the replicas in the `down` store, so after reload the region, TiDB will see a region only containing 1 TiFlash replica. It works fine so far because the table has 2 TiFlash replicas, TiDB will only access the replica in the alive TiFlash node. However, after the down TiFlash comes back, TiDB still can't see the replica in that TiFlash unless it triggers another region reload. As metioned in `Background` session, TiDB will reload the region if it found problem when accessing the region or the cached region is not used for 10 minutes. That is to say, if the region can always be accessable(For AP query, it is possible that a table does not have too many write, so the region keeps unchanged for a long time), and TiDB keeps querying the table, TiDB will have no chance to reload the region, which makes TiFlash load very unbalanced(All the query tasks are still sending to one TiFlash). This is not the expected behavior because now the cluster has two TiFlash nodes, the query load should be balanced between the TiFlash nodes.

According to the description in the issue, if a tiflash node goes down and then recovers, the region cache of tidb will cause it to have no load.
For this reason, I restarted all tidb services (tiup cluster restart xxx -R tidb -y).
After rerunning the stress test, no tiflash service anomalies occurred.
The load on tiflash is still very unbalanced, but the load ratio of each node has changed significantly:

Next, I restarted the entire cluster (tiup cluster restart xxx -y).
After rerunning the stress test, no tiflash service anomalies occurred.
The tiflash load is similar to the previous figure:

My questions are:
What is the principle behind tidb assigning MPP tasks to tiflash nodes? Why does this imbalance occur? How can it be resolved?

translator_bot · June 23, 2024, 3:26am

| username: flow-PingCAP | Original post link

Please use this tool https://metricstool.pingcap.com/ to export the tiflash-summary monitoring.
Also, how many tables are in your cluster?

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

Okay, I’ll give it a try. I haven’t played with this yet.

The cluster has more than 4000 tables, as mentioned in the post above.

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

mc-TiFlash-Summary_2022-09-21T09_14_08.949Z.json (915.8 KB)

translator_bot · June 23, 2024, 3:26am

| username: flow-PingCAP | Original post link

There is indeed a situation where requests are unevenly distributed among different TiFlash nodes.

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

Currently, when TiDB generates TiFlash MPP tasks, it balances all regions of the table across all TiFlash nodes. For example, if the table has 100 regions, TiDB will try to ensure that each TiFlash reads 25 regions. In your scenario, there should be 4 nodes with 4 replicas, so it should ensure that each TiFlash reads 25 regions.

However, TiDB only performs this balancing within a single query. If the table has relatively few regions, for instance, in an extreme case where there is only one region, and there are 100 concurrent reads to this table, TiDB MPP currently does not balance between queries. Therefore, in such cases, it is possible for the load to be uneven across different TiFlash nodes. Considering your cluster has over 4000 tables but only 23,000 regions, this situation could occur.

You can enable TiDB’s debug log, which will show logs similar to the following for MPP queries:
Before region balance xxxx
After region balance xxxx

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

Let me continue to ask:
Assuming there are 4 nodes named A, B, C, and D, and the SQL is executed on the TiDB service of node A.
Assuming this SQL only involves one region of a table, and this region has 4 TiFlash replicas available on nodes A, B, C, and D, what is the logic for choosing which replica to use?

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

If accessing TiFlash using MPP or BatchCop, a replica of a region is chosen and used continuously until an error is encountered on that replica. If accessing TiFlash using cop, each query will switch to the next replica.

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

This indeed explains my observation. Was this balance logic added recently? When I first used version 5.4 for stress testing, it seemed quite balanced.

In my real business scenario, the data distribution and SQL distribution are causing TiFlash resources to be wasted. What can I do to achieve load balancing? Are there any parameters that can be set?

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

I studied the code and changed this parameter from false to true.

Because I found that inside the GetTiFlashRPCContext function, if loadBalance = false, it always returns the first store found, which causes a hotspot. If changed to true, it will switch each time.

After compiling and replacing, the load test is completely balanced:

However, I don’t know the original consideration for passing in false, and what side effects there might be if changed to true? Is it as mentioned in the comment:
// loadBalance is an option. For MPP and batch cop, it is pointless and might cause trying the failed store repeatedly.
If it’s just this, then in my business scenario, it is acceptable because the probability of store failure is still very small.

Please advise, experts.

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

This has been around since version 5.0, probably a matter of probability if you haven’t encountered it.

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

This is due to historical reasons. Currently, for MPP, it can be completely changed to true. There might still be issues with batchCop, but since batchCop can basically be completely replaced by MPP, changing it to true is not a problem.

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

If you are not confident about batchCop, you can check whether mppStoreLastFailTime is nil. If it is not nil, it is MPP; if it is nil, it is batchCop. This way, you can enable region load balance only for MPP.

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

I ran all the stress tests overnight without any issues.
So I simply changed this parameter.
I suggest that TiDB could add a parameter in the new version to control this balance.
Thanks for the guidance!

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

Yes, after we added load balancing within queries, we overlooked load balancing between queries. We are planning to fix this issue in the latest version of TiDB.

translator_bot · June 23, 2024, 3:26am

| username: windtalker | Original post link

Raised a related issue: MPP query may not be balanced between TiFlash nodes · Issue #38113 · pingcap/tidb · GitHub

translator_bot · June 23, 2024, 3:26am

| username: 华健-梦诚科技 | Original post link

Okay, awesome.

translator_bot · June 23, 2024, 3:26am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.