Apache SeaTunnel(Incubating) Makes Apache Flink and Spark SQL Jobs Management easier!

Apache SeaTunnel
4 min readFeb 23, 2023

--

We regularly introduce useful content and tools to our Apache Flink and Spark Data Engineer courses to help learners develop and operate big data analytics applications more efficiently. Read on to find out what SeaTunnel is and how this high-performance distributed data integration platform makes it easy to stream data synchronization with Apache Flink and Spark SQL jobs.

Streaming Data Synchronization with SQL for Flink and Spark

A popular distributed framework for creating streaming stateful applications, Apache Flink, makes it possible to create data processing tasks not only in Scala/Java but also through SQL queries, which lowers the barrier to entry into the technology. Professional Scala/Java developers are more likely to use the DataSet or DataStream API to create Flink jobs. When accessing the data source, this method requires a lot of code, mainly to extend the connector. And the use of SQL queries is based on a declarative approach: connectors are discovered using the methods of the Java Service Provider Interface (SPI, Service Provider Interface).

You may recall that Flink supports the use of CREATE TABLE SQL statements to register tables: you can specify the table name, its schema, and parameters for connecting to an external system. The desired connection properties are converted to string-based key-value pairs. Factories will create configured table sources, table targets, and appropriate formats from key-value pairs based on factory IDs. All factories that can be found through the Java SPI are counted when looking for only one matching factory per component. If a factory cannot be found, or if more than one factory matches the given properties, an appropriate exception is thrown with additional information about the factories in question and the properties supported.

Apache Flink uses Java SPI to load connector factories and table formats by their identifiers. Because the SPI resource file named org.apache.flink.table.factories.Factory for each connector and table format resides in the same META-INF/services directory, these resource files will override each other when building the uber-jar project, where more than one is used.

Thus, for a data engineer, the DataSet or DataStream API is more flexible than SQL, but worse in terms of scalability, because every time you add a new connector, you need to write code. And SQL creates a connector based on the SPI mechanism, you only need to inject the connector Jar into the cluster. Therefore, SQL is easier to synchronize data from the source for real-time calculations.

To increase the efficiency of streaming data synchronization, you can use a specialized solution, such as Apache SeaTunnel, a high-performance distributed data integration platform that can reliably synchronize tens of billions of events daily in real-time. It can be used not only with the Flink engine but also with Spark. How do, you can refer to this article: How to simplify data synchronization using Flink SQL in Apache SeaTunnel (Incubating).

How Apache SeaTunnel Works

The main use cases for Apache SeaTunnel are bulk synchronization, data aggregation, and integration, running ETL processes with lots of data, and processing data from multiple sources. All of these scenarios can also involve Apache Flink and Spark. For example, SeaTunnel runs Spark locally, creates a client, and configures the appropriate settings in the job configuration. After the job is submitted, the spark-submit command is generated, which starts the job on the cluster. With SeaTunnel, the logic for this job will go through the main SeaTunnel Spark class and make some additions according to the template file. It is a specific configuration file that contains 4 parts: Spark configuration, data source definition, data sink definition, and data transformation. Spark will run the job based on the configuration and generate the corresponding result.

Similarly, you can apply SeaTunnel to Flink SQL jobs: first, the command is read through the shell, and the parameters are concatenated and sent to the Flink cluster. You can then get the Flink environment configuration and connector type via SQL parsing, load it into the CLASSPATH, set options, finish parsing, and send it to the cluster. It’s so easy to add any connectors: you just need to add a Sub-module in Flink SQL under the SeaTunnel connector, including the dependencies of Flink itself, and output it to the desired address when packaging. The downside of the current implementation of the Flink SQL module is poor support for application mode. Currently, it can only be deployed on YARN and Kubernetes.

Summarizing the benefits of using Apache SeaTunnel with Flink and Spark engines, we emphasize that this is a way to dynamically configure big data processing jobs in real-time. SeaTunnel will help solve the problems that can arise when synchronizing large amounts of data: loss and duplication, accumulation and delay of tasks, low throughput, long job cycle in a production environment, and lack of monitoring of the state of application work. This platform allows the data engineer to directly build a data processing pipeline using SQL, reducing the amount of complex Java/Scala code. The SeaTunnel project is still in the Apache Foundation incubator but is growing rapidly.

Reference

1. https://seatunnel.medium.com/how-to-simplify-data-synchronization-using-flink-sql-in-apache-seatunnel-incubating-f972c1685fdf

2. https://seatunnel.apache.org/docs/2.1.3/intro/about/

3. https://github.com/apache/incubator-seatunnel

Reproduced and translated from: https://medium.com/@bigdataschool/применение-seatunnel-для-управления-sql-заданиями-apache-flink-и-spark-2ad72e7443ec

📌📌Welcome to fill out this survey to give your feedback on your user experience or just your ideas about Apache SeaTunnel:)

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet