How to simplify data synchronization using Flink SQL in Apache SeaTunnel (Incubating)?

Apache SeaTunnel
7 min readJul 4, 2022

--

As an ultra-high-performance distributed data integration platform, SeaTunnel is known for supporting massive data synchronization in real-time by tens of billions of data volumes stably and efficiently every day.

Flink SQL writes real-time tasks in SQL. which greatly improved in terms of real-time performance and ease of use compared with traditional data synchronization solutions.

Using Flink SQL in SeaTunnel, the data synchronization can be simpler and faster.

In the SeaTunnel online Meetup on June 25, Tao Kelu, a big data engineer from ByteDance, introduced the way to use Flink SQL to simplify data synchronization in Apache SeaTunnel (Incubating).

About the Author

Tao Kelu, the ByteDance engineer, currently working on big data cloud-native computing. Former Alibaba Cloud big data technical expert, with many years of big data-related experience.

Summary of the speech:

  1. Overview of the three ways to use SeaTunnel
  2. Comparison of two ways to use the Flink engine
  3. List of supported functions of Flink SQL module
  4. Implementation Analysis of Flink SQL Module
  5. Plans about the follow-up Flink SQL module development
  6. How to get involved in community building

01 Overviews of the three ways to use SeaTunnel

There are three main ways to use SeaTunnel. The first is SeaTunnel+Spark, which is to using SeaTunnel to generate a Spark main class, run it in the Spark cluster, and convert the Source and Sink, that is, the data synchronization stream is serialized.

The second is Seatunnel + Flink for data synchronization. The principle is similar to Spark. The corresponding Flink DataSet/Datastream API is used to generate Flink’s main class through SeaTunnel and configuration files and finally runs in the Flink cluster.

The last one is SeaTunnel + Flink SQL, which will be introduced later.

02 Comparison of two ways to use the Flink engine

Let me demonstrate this part of the work (as shown in the video demo). The three methods used by SeaTunnel are relatively simple. For example, I now run a Spark job and enter a command. SeaTunnel starts a Spark locally, generates a client, and configures the corresponding parameters in config. After submitting the job, a Spark Submit command is generated. We can simply understand that what SeaTunnel does is to generate a Spark job according to the config, submit a command application, and the core logic will pass through the SeaTunnel Spark main class and make some supplements according to the template file.

This is the defined configuration file which contains 4 parts, Spark configuration, Source (define data source), Sink (define data Sink), and data transformation (transform). Spark will run a job based on the config and generate the corresponding result.

A Flink job also needs to start a Flink Cluster first, submit a Flink submission command through the DataSet/DataSeaTunnelream API, and then submit it to the Flink cluster we just started.

The use of Flink SQL is also very simple. The core is to generate a Flink SQL job and submit it to the cluster.

Comparing two different ways of using Flink, one of them is to use Flink’s DataSet/DataStream API to build a Flink job. This method needs a large amount of codding, mainly to expand the connector and fill it into the main class; while using the SQL method, it is Declarative based. Flink’s discovery of connectors is based on the SPI method, and there is no need to do the previous connector-related development work.

Regarding writing, API is relatively more “imperative”, SQL is more declarative, and the latter has a lower threshold for developers.

In terms of flexibility, API is more flexible than SQL.

In terms of scalability, the API is worse, because every time a new connector is added, a lot of code needs to be written; while SQL creates a connector based on the SPI mechanism, you only need to type the connector’s Jar package into the cluster.

In terms of maintainability, when we upgrade the Flink version, the API needs to be upgraded accordingly, but there is no need to worry about engine upgrades in using SQL, so I think SQL is a relatively simpler and more scalable way.

03 List of supported functions of Flink SQL module

The following describes how to use Flink SQL in SeaTunnel.

Connector

  • dynamic loading
  • Support type: jdbc, kafka, elaSeaTunnel icsearch-6/7, …

For Flink SQL to run, it is first necessary to support the connector. Currently, the connector types we support include JDBC, Kafka, elaticsearch-6/7, etc. This method is simpler than API loading. In addition, we have also implemented a dynamic loading function. When it is detected that the SQL contains Kafka and other types of connectors, it will be automatically loaded into the corresponding cluster instead of being placed in advance.

UDF: none

At present, we have not done many extensions for UDF, because the UDF provided by Flink itself can solve most of the problems.

Catalog

  • InMemoryCatalog
  • processing
  • HiveCatalog
  • JdbcCatalog

The catalog is indispensable in production practices, which serves to store metadata and can directly use a table previously defined in the catalog. Currently, we support InMemoryCatalog, and the catalogs being supported include HiveCatalog, a development type widely used in the field of big data, and JdbcCatalog.

The current functions of the Flink SQL module are shown in the demo. Watch the video from 10:38–16:14.

04 Implementation Analysis of Flink SQL Module

Next, let’s analyze the implementation logic of the Flink SQL module. Watch the video from 16:24–21:27.

  • Command-line parsing
  • Configuration parsing and settings
  • SQL Parse
  • Connector dynamic loading
  • Advantages and disadvantages of the current implementation
  • Adding connectors is easy
  • Does not support application mode well

The logic implemented by Flink SQL is very simple. First, the command is read through the shell, and the parameters are spliced ​​into a command and submitted to the Flink cluster. Then we obtain the Flink Env Configuration according to reflection, and then the connector type through SQL parsing is obtained, load it into the CLASSPATH, set it to the Flink parameter, continue the parsing, and submit it to the Flink cluster.

This implementation method has both pros and cons. The advantage is that it is elementary to add connectors. You only need to add a Sub-module to the Flink SQL under the SeaTunnel connector, including the dependencies of Flink itself, and output it to the address shown in the figure below when packaging.

But the shortcoming of the current Flink SQL module implementation is that it does not support application mode well.

05 Plans about the follow-up Flink SQL module development

For the follow-up development of the Flink SQL module, I have some thoughts.

Function improvement

The improvement to the existing functions mainly covers:

  • Connector Enrichment supports more connector types;
  • Udf Enhancement, which is not urgent, because if Flink’s functions can cover most user scenarios, this part of the work is not needed;
  • Catalog Enhancement, as mentioned above, the catalog is indispensable for the production environment, we will give priority to supporting Hive and JDBC catalog in the future.

Ease of use enhancement

To enhance ease of use, Application Mode support. Application mode can only be deployed on YARN and K8s. YARN support is relatively simple and only requires specific parameters; but k8s will involve the issue of dependent Jar packages, which is relatively complicated, so the follow-up enhancement method needs further discussion.

06 How to get involved in community building

I also have some of my own experiences for your reference on how to participate in community contributions.

First, you need to be familiar with the basic principles of git, including:

  • git clone&pull&push, no need to go into details;
  • git rebase&git merge, I think rebase has a great impact on developers, the merge will generate a new commit record, and now more mature communities will use rebase instead of merge;
  • git cherry-pick, we may often need to submit our Commit to different branches of the open-source community, so we need to master this skill.

Second, be familiar with GitHub’s project collaboration mode. The collaboration methods on GitHub mainly include:

  • Fork
  • Pull request
  • Add multi-remote upstream
  • Fetch&rebase

Third, Good first contribute. After you kknow well about the functions, you need to find a start point for committing. I suggest you do away with tasks tagged with Good first issue, which are generally friendly and easy to solve.

Fourth, mailing list discussions. When you are familiar with the open-source community, you can join the talks on the community mailing list and invest in the direction you are interested in;

Finally, if you find problems during use, you can commit a Proposal to fix these issues, and slowly familiarize yourself with the process.

That’s it for my sharing, thank you!

About SeaTunnel

SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Why do we need SeaTunnel?

SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with SeaTunnel quickly?

Want to experience SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/incubator-seatunnel/issues

Contribute code to:

https://github.com/apache/incubator-seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-10u1eujlc-g4E~ppbinD0oKpGeoo_dAw

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Come and join us!

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet