Does Flink still run on HDFS when integrated with TiDB?

translator_bot · June 23, 2024, 3:11am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb与flink结合时，flink还基于hdfs运行吗？

| username: TiDBer_8rWAgqMU

After using TICDC to send data to Kafka, which is then consumed in real-time by Flink, I previously used HDFS+Hive+HBase but have now switched to TiDB directly, while still retaining Flink for real-time processing. However, there’s an issue with Flink: the previous Flink tasks were based on HDFS, which has now been replaced by TiDB. So, how should I handle Flink in this case?

Here was my previous Flink configuration:
1663839961(1)

Now that TiDB has replaced HDFS+Hive+HBase, which state.backend should I use for Flink?

translator_bot · June 23, 2024, 3:11am

| username: xfworld | Original post link

S3 protocol, or other distributed file protocols are all acceptable.

translator_bot · June 23, 2024, 3:11am

| username: 特雷西-迈克-格雷迪 | Original post link

Flink does not need to run based on HDFS, it only requires partial local disk storage.

translator_bot · June 23, 2024, 3:11am

| username: TiDBer_8rWAgqMU | Original post link

Local disks are not distributed, so there will be data loss issues when a machine goes down.

translator_bot · June 23, 2024, 3:11am

| username: TiDBer_8rWAgqMU | Original post link

Could you recommend a distributed file system? HDFS is quite heavy. Does your TiDB have a distributed file system? I see that many of your company’s customer solutions use Flink. May I ask what they use to store the checkpoint?

translator_bot · June 23, 2024, 3:11am

| username: xfworld | Original post link

As mentioned, any S3-compatible service can be used, such as Alibaba OSS, AWS (Amazon) S3, and Tencent Cloud COS.

translator_bot · June 23, 2024, 3:11am

| username: TiDBer_8rWAgqMU | Original post link

Sure, I’ll look it up online.
Thank you very much.