TiFlash 7.5.1 CPU Load Imbalance

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 7.5.1 cpu负载不均衡

| username: foxchan

[TiDB Usage Environment] Production Environment
TiFlash version 7.5.1

With equal weights, 10 TiFlash nodes. The result is always that the green 64 CPU is abnormally high. Are there any parameters or configurations to make the TiFlash load more balanced?
[Attachment: Screenshot/Log/Monitoring]
image

| username: TiDBer_QYr0vohO | Original post link

Are all the machine configurations consistent?

| username: foxchan | Original post link

Consistent configuration

| username: yiduoyunQ | Original post link

Are all the replica counts 2?

SELECT * FROM information_schema.tiflash_replica;
| username: foxchan | Original post link

All of them

| username: yiduoyunQ | Original post link

TiFlash – Task Scheduler – Active and Waiting Queries Count. Check if the tasks are balanced?

| username: 洪七表哥 | Original post link

Do you have any special business?

| username: 有猫万事足 | Original post link

To analyze this kind of standout situation on a single machine, it is recommended to use the topsql interface to see which SQL statements are being executed on this machine.

Additionally, TiDB’s DDL is executed by only one owner, which could be the reason.

| username: foxchan | Original post link

TopSQL only analyzes TiDB and TiKV. The current uneven CPU distribution is related to TiFlash. What does it have to do with TiDB DDL?

| username: 有猫万事足 | Original post link

Okay, I didn’t pay attention to the question.

After thinking about it carefully, I really can’t think of a particularly good solution for TiFlash. Let’s see what other experts have to say.

| username: Lloyd-Pottiger | Original post link

  1. Identify the table db_x.t_x that appears most frequently in slow queries.
  2. Run the following SQL to check if the data on each TiFlash node is evenly distributed.
select TABLE_ID, p.STORE_ID, ADDRESS, count(p.REGION_ID) 
from information_schema.tikv_region_status r, information_schema.tikv_region_peers p, information_schema.tikv_store_status s
where r.db_name = 'db_x' and r.table_name = 't_x'
and r.region_id = p.region_id and p.store_id = s.store_id and json_extract(s.label, "$[0].value") = "tiflash" 
group by TABLE_ID, p.STORE_ID, ADDRESS;

If it is a single machine with multiple instances, the location-label should be set to “host”.

If it is a single machine with a single instance, you can use a tool like GitHub - Lloyd-Pottiger/tiflash-replica-table-data-balancer: A tool helps to balance the table data of TiFlash replicas between multiple TiFlash instances to manually schedule the region distribution of TiFlash replicas.

| username: foxchan | Original post link

Just find a table with slow queries, and indeed the distribution is uneven. Aren’t the regions in TiFlash evenly distributed?
You’re awesome, let me compile and give it a try.

| username: Lloyd-Pottiger | Original post link

TiFlash regions are balanced only at the store level, meaning the total number of regions per store is averaged. However, the number of regions for each table in each store may not be evenly distributed.

| username: zhaokede | Original post link

How should this situation be handled?

| username: WalterWj | Original post link

  1. The scheduling of TiFlash regions and TiKV regions is basically the same on the PD side. Uneven distribution of table data is technically expected.
  2. There is hotspot scheduling. If a node is determined to be hot for a long time, region scheduling will be initiated. However, this fluctuation probably did not trigger it.

I understand that this does not affect usage.

You can try using this interface to scatter table regions, but I’m not sure if it works for TiFlash: tidb/docs/tidb_http_api.md at master · pingcap/tidb · GitHub

| username: foxchan | Original post link

After compiling and running, the region distribution is indeed more balanced. Maybe the distribution algorithm still needs some adjustments? Kudos to the expert.

Before scheduling:
image

After scheduling:
image

For now, we can use this method to solve the hotspot region issue in TiFlash. Looking forward to TiFlash having a UI-level hotspot map and official parameters for adjustments in the future.

| username: yytest | Original post link

It is necessary to share the business scenario and check the underlying logs.