TiFlash Load Imbalance Across Different Nodes

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 不同节点负载不均衡

| username: 华健-梦诚科技

tidb version: v6.1.1
Cluster configuration and topology are as follows:

The cluster contains about 4000 tables, 1.1T storage space, and 23,000 regions:

The store size, leader, and region of tikv are as follows:

All tables have tiflash replica count set to 4, with a total of 18,000 regions, and 4 nodes are completely symmetrical.

Performed stress testing on this cluster, with 20 concurrent replays of about 4.5 million SQLs.
The SQL ratio is about 95% for AP and 5% for TP, captured from the production environment, involving almost half of the tables.

The 4 tidb nodes have evenly distributed connections and relatively balanced CPU load:

The load on the 4 tikv nodes is also relatively balanced:

However, the load on tiflash is very unbalanced, and the number of open files is also unbalanced, as shown:

I noticed there is an issue related to this problem on GitHub, which is still open:

According to the description in the issue, if a tiflash node goes down and then recovers, the region cache of tidb will cause it to have no load.
For this reason, I restarted all tidb services (tiup cluster restart xxx -R tidb -y).
After rerunning the stress test, no tiflash service anomalies occurred.
The load on tiflash is still very unbalanced, but the load ratio of each node has changed significantly:

Next, I restarted the entire cluster (tiup cluster restart xxx -y).
After rerunning the stress test, no tiflash service anomalies occurred.
The tiflash load is similar to the previous figure:

My questions are:
What is the principle behind tidb assigning MPP tasks to tiflash nodes? Why does this imbalance occur? How can it be resolved?

| username: flow-PingCAP | Original post link

Please use this tool https://metricstool.pingcap.com/ to export the tiflash-summary monitoring.
Also, how many tables are in your cluster?

| username: 华健-梦诚科技 | Original post link

Okay, I’ll give it a try. I haven’t played with this yet.

The cluster has more than 4000 tables, as mentioned in the post above.

| username: 华健-梦诚科技 | Original post link

mc-TiFlash-Summary_2022-09-21T09_14_08.949Z.json (915.8 KB)

| username: flow-PingCAP | Original post link

There is indeed a situation where requests are unevenly distributed among different TiFlash nodes.

| username: windtalker | Original post link

Currently, when TiDB generates TiFlash MPP tasks, it balances all regions of the table across all TiFlash nodes. For example, if the table has 100 regions, TiDB will try to ensure that each TiFlash reads 25 regions. In your scenario, there should be 4 nodes with 4 replicas, so it should ensure that each TiFlash reads 25 regions.

However, TiDB only performs this balancing within a single query. If the table has relatively few regions, for instance, in an extreme case where there is only one region, and there are 100 concurrent reads to this table, TiDB MPP currently does not balance between queries. Therefore, in such cases, it is possible for the load to be uneven across different TiFlash nodes. Considering your cluster has over 4000 tables but only 23,000 regions, this situation could occur.

You can enable TiDB’s debug log, which will show logs similar to the following for MPP queries:
Before region balance xxxx
After region balance xxxx

| username: 华健-梦诚科技 | Original post link

Let me continue to ask:
Assuming there are 4 nodes named A, B, C, and D, and the SQL is executed on the TiDB service of node A.
Assuming this SQL only involves one region of a table, and this region has 4 TiFlash replicas available on nodes A, B, C, and D, what is the logic for choosing which replica to use?

| username: windtalker | Original post link

If accessing TiFlash using MPP or BatchCop, a replica of a region is chosen and used continuously until an error is encountered on that replica. If accessing TiFlash using cop, each query will switch to the next replica.

| username: 华健-梦诚科技 | Original post link

This indeed explains my observation. Was this balance logic added recently? When I first used version 5.4 for stress testing, it seemed quite balanced.

In my real business scenario, the data distribution and SQL distribution are causing TiFlash resources to be wasted. What can I do to achieve load balancing? Are there any parameters that can be set?

| username: 华健-梦诚科技 | Original post link

I studied the code and changed this parameter from false to true.

Because I found that inside the GetTiFlashRPCContext function, if loadBalance = false, it always returns the first store found, which causes a hotspot. If changed to true, it will switch each time.

After compiling and replacing, the load test is completely balanced:

However, I don’t know the original consideration for passing in false, and what side effects there might be if changed to true? Is it as mentioned in the comment:
// loadBalance is an option. For MPP and batch cop, it is pointless and might cause trying the failed store repeatedly.
If it’s just this, then in my business scenario, it is acceptable because the probability of store failure is still very small.

Please advise, experts.

| username: windtalker | Original post link

This has been around since version 5.0, probably a matter of probability if you haven’t encountered it.

| username: windtalker | Original post link

This is due to historical reasons. Currently, for MPP, it can be completely changed to true. There might still be issues with batchCop, but since batchCop can basically be completely replaced by MPP, changing it to true is not a problem.

| username: windtalker | Original post link

If you are not confident about batchCop, you can check whether mppStoreLastFailTime is nil. If it is not nil, it is MPP; if it is nil, it is batchCop. This way, you can enable region load balance only for MPP.

| username: 华健-梦诚科技 | Original post link

I ran all the stress tests overnight without any issues.
So I simply changed this parameter.
I suggest that TiDB could add a parameter in the new version to control this balance.
Thanks for the guidance!

| username: windtalker | Original post link

Yes, after we added load balancing within queries, we overlooked load balancing between queries. We are planning to fix this issue in the latest version of TiDB.

| username: windtalker | Original post link

Raised a related issue: MPP query may not be balanced between TiFlash nodes · Issue #38113 · pingcap/tidb · GitHub

| username: 华健-梦诚科技 | Original post link

Okay, awesome.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.