Issues with TiFlash Synchronization After Data Import Using Lightning

translator_bot · June 23, 2024, 10:27am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Lightning导入数据后，tiflash同步的问题

| username: 华健-梦诚科技

Version: 6.1.0

Using Lightning to import CSV data, over 800GB, with 32 threads

It took 1.5 hours and indicated a successful import

After the successful import, TiKV was okay, but TiFlash was not ready and took another 3 hours to become available

During these 3 hours, the replica synchronization in the tiflash_replica table showed 100%, but actually querying TiFlash storage would get stuck and not respond

Resource usage is shown in the images below. You can see that after the import ended, there was a low-load state for another 3 hours

My questions are:

Is this a normal phenomenon?
Is there a switch to speed up this 3-hour process?

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Is there an expert who can provide some guidance?

translator_bot · June 23, 2024, 10:27am

| username: songxuecheng | Original post link

The phenomenon is abnormal, this requires monitoring and log analysis.
Slow synchronization may be caused by various reasons, you can troubleshoot by following these steps.
Adjust the scheduling parameter values.

Increase the store limit to speed up synchronization.

Adjust the load on the TiFlash side. Excessive TiFlash load can cause slow synchronization. You can check the load of various metrics through the TiFlash-Summary panel in Grafana:

Applying snapshots Count: TiFlash-summary > raft > Applying snapshots Count
Snapshot Predecode Duration: TiFlash-summary > raft > Snapshot Predecode Duration
Snapshot Flush Duration: TiFlash-summary > raft > Snapshot Flush Duration
Write Stall Duration: TiFlash-summary > Storage Write Stall > Write Stall Duration
generate snapshot CPU: TiFlash-Proxy-Details > Thread CPU > Region task worker pre-handle/generate snapshot CPU

Adjust the load according to business priorities.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Thank you for the help!

It is a verification cluster, and there is no other load on the TiFlash side during synchronization. All indicators are very low according to the monitoring panel.

The current parameters for store limit are as follows. What would be an appropriate adjustment?

translator_bot · June 23, 2024, 10:27am

| username: songxuecheng | Original post link

There is no fixed value; it mainly depends on the machine load. You can try multiple times to find a suitable value.

translator_bot · June 23, 2024, 10:27am

| username: ablewang_xiaobo | Original post link

If the machine’s performance is reasonable, the delay shouldn’t be this long.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Okay, I increased it and tried again. In terms of performance, I indeed didn’t see any load; CPU/memory/IO/network are all very low.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

I have adjusted the store limit to the maximum of 200, but it still takes three to four hours, and resource usage remains very low. Is there anything else I can do?

translator_bot · June 23, 2024, 10:27am

| username: songxuecheng | Original post link

Please send the complete monitoring of TiFlash according to this. Let’s analyze it.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Captured one of the hours

translator_bot · June 23, 2024, 10:27am

| username: songxuecheng | Original post link

tiflash-learner.toml

[server]
engine-addr = The address for external access to the TiFlash coprocessor service

[raftstore]
## Number of threads in the thread pool handling Raft data persistence
apply-pool-size = 4
## Number of threads in the thread pool handling Raft, i.e., the size of the Raftstore thread pool.
store-pool-size = 4
## Number of threads handling snapshots, default is 2. Set to 0 to disable multi-thread optimization
snap-handle-pool-size = 2
## Minimum interval for persisting WAL in the raft store. Increase the delay appropriately to reduce IOPS usage, default is “4ms”, set to “0ms” to disable this optimization.
store-batch-retry-recv-timeout = “4ms”

[security]
## Introduced in v5.0, controls whether to enable log redaction
## If this option is enabled, user data in the logs will be displayed as ?
## Default value is false
redact-info-log = false

apply-pool-size and store-pool-size can be adjusted according to the cluster configuration.
snap-handle-pool-size should be 0 by default, and can be adjusted to enable.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Sorry, I didn’t understand what to do, as I just started not long ago.
Should I use tiup cluster edit-config xxx to change the configuration? Or should I find this toml file and manually change it?
Where is this file? What needs to be done after changing it?

translator_bot · June 23, 2024, 10:27am

| username: songxuecheng | Original post link

I just looked at your monitoring, and the Snapshot processing is relatively slow. So, adjust the parameters a bit; you can test this in the test environment.

You can use tiup cluster edit-config xx.
Additionally, adjust the replica-schedule-limit appropriately, but do not exceed the region-schedule-limit.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

I have adjusted the region-schedule-limit and replica-schedule-limit. Is this size sufficient:

As for the tiflash_learner configuration you mentioned, I’ll try increasing it.

translator_bot · June 23, 2024, 10:27am

| username: 小王同学Plus | Original post link

Hello, I would like to confirm what mode is used when importing data with Lightning?

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

After changing the configuration, the resource usage is still very low, with no difference from before.
It hasn’t finished running yet, but it looks like it will still take three to four hours.

tiup cluster edit-config xx:

Resource usage:

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Hello, using the local mode, importing CSV, the configuration is as follows:

translator_bot · June 23, 2024, 10:27am

| username: 小王同学Plus | Original post link

If it’s in local mode, I guess there might be some impact. In local mode, these key-value pairs are uploaded to each TiKV node in the form of SST files, and then TiKV ingests these SST files into the cluster. TiFlash synchronizes TiKV’s data in the form of a region learner.

translator_bot · June 23, 2024, 10:27am

| username: 华健-梦诚科技 | Original post link

Well, my concern now is that the synchronization speed is too slow.
It takes an hour and a half to import, and nearly 4 hours to synchronize.
I just want to ask if there is any way to speed it up.

translator_bot · June 23, 2024, 10:27am

| username: 小王同学Plus | Original post link

You can refer to this TiFlash 常见问题排查 - TiDB 的问答社区