Experts, I have a question. If enough CPU is allocated for physical import, can the speed reach 5G or 10G?

translator_bot · June 20, 2024, 11:45pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 大佬我问一个问题如果物理导入 cpu给够了速度能到5g 10g吗

| username: tidb狂热爱好者

If the CPU is adequately allocated for physical import, can the speed reach 5GB or 10GB?

translator_bot · June 20, 2024, 11:45pm

| username: 连连看db | Original post link

IO also has limitations.

translator_bot · June 20, 2024, 11:45pm

| username: 像风一样的男子 | Original post link

The bottleneck should be network and disk I/O. Additionally, Lightning can enable parallel import, allowing multiple tasks to be run in parallel on multiple machines.

When using TiDB Lightning in parallel import mode, the following restrictions are recommended to achieve optimal performance:

Each TiDB Lightning instance should be deployed on a separate machine. TiDB Lightning will consume all CPU resources by default, so deploying multiple instances on a single machine will not improve performance.
The total size of the source files imported by each TiDB Lightning instance should not exceed 5 TiB.
The total number of TiDB Lightning instances should not exceed 10.

translator_bot · June 20, 2024, 11:45pm

| username: Demo二棉裤 | Original post link

The official documentation states 500g/h, but based on my experience, it’s around 300g/h here. It should be related to the data characteristics. However, it’s not slow and can be improved with Lightning and parallel processing.

translator_bot · June 20, 2024, 11:45pm

| username: 江湖故人 | Original post link

For a 10-gigabit network, the bottleneck is likely to be in disk I/O.

translator_bot · June 20, 2024, 11:45pm

| username: 江湖故人 | Original post link

300GB/h = 85MB/s, assuming 1MB equals 10,000 rows, that’s 850,000 rows per second, which is quite fast

translator_bot · June 20, 2024, 11:45pm

| username: zhanggame1 | Original post link

Physical import also depends on hard drive performance.

translator_bot · June 20, 2024, 11:45pm

| username: 随缘天空 | Original post link

It’s still very difficult. Physical import can handle at most around 500GB per hour, and that’s likely with high-end resources. Normally, it would be around 200-300GB per hour.

translator_bot · June 20, 2024, 11:45pm

| username: okenJiang | Original post link

Bro, your requirements are a bit high. Try testing with parallel import. You can only scale horizontally, and the hardware needs to be maxed out. Also, be careful not to overwhelm the downstream TiKV.

translator_bot · June 20, 2024, 11:45pm

| username: YuchongXU | Original post link

Network card and hard drive can’t reach it either.

translator_bot · June 20, 2024, 11:45pm

| username: forever | Original post link

What kind of data is that? 10,000 rows are only 1MB

translator_bot · June 20, 2024, 11:45pm

| username: 像风一样的男子 | Original post link

Increasing the number of clients for parallel import can theoretically improve resource utilization.

translator_bot · June 20, 2024, 11:45pm

| username: 小龙虾爱大龙虾 | Original post link

The prerequisite for 10g/s is that your bandwidth must be sufficient, right?

translator_bot · June 20, 2024, 11:45pm

| username: zhaokede | Original post link

Bandwidth and SSD could both become bottlenecks.

translator_bot · June 20, 2024, 11:45pm

| username: redgame | Original post link

The bottleneck is IO.

translator_bot · June 20, 2024, 11:45pm

| username: 江湖故人 | Original post link

If 10,000 lines are 1M, then each line is approximately 100 bytes. One character occupies 1 byte, and an integer occupies 4 bytes. You can store quite a lot.

translator_bot · June 20, 2024, 11:45pm

| username: forever | Original post link

It’s rare to have such small single rows. Our large tables have rows over 4k.

translator_bot · June 20, 2024, 11:45pm

| username: DBAER | Original post link

This mainly depends on disk I/O.

translator_bot · June 20, 2024, 11:45pm

| username: 源de爸 | Original post link

I/O and network are both potential bottlenecks.

translator_bot · June 20, 2024, 11:45pm

| username: zhang_2023 | Original post link

Consider the IO bottleneck.