Apache SeaTunnel(Incubating)makes data integration so easy

7 min readOct 13, 2022

At the global webinar held by the Apache SeaTunnel community and Shopee on September 24th, Apache SeaTunnel PPMC Fan Jia shared his view on how to use the right data integration tools to get more done with less effort. The following is a summary of the whole speech for your reference.

Speaker Introduction

As you can see, the title of my talk is Apache SeaTunnel: make data integration easy, so let’s take a look at how Apache SeaTunnel does it.

First, I’ll show you why we need data integration, and then what data integration needs. Then introduce Apache SeaTunnel, and our community, and end with use cases.

Why We Need Data Integration

As the scale of the enterprise gradually expands, many data sources and data assets will be generated, scattered in different locations and storage. Businesses are disconnected from each other, which is the so-called Data Silo. If I want to conduct unified processing and analysis on these data, what should I do?

We need a tool to synchronize these data into a unified data service, so we can further process and analyze the data. Next, let’s look at another situation.

We can see that two databases store customer information, but the data between the databases are different, and executing the same query statement gets different results. We don’t want to see this happen definitely. We need to ensure that the information stored in the two databases is consistent.

The third situation is that with the continuous development of cloud storage, we need to migrate our data to the cloud, and we need a tool to achieve this.

What We Need

As I mentioned above, I believe you already know why we need a data integration tool, and what characteristics the data integration tools should have.

What capabilities does the solution need to have？

The tools should at least have three capabilities:

Support for many data sources so that we can support the unification of multiple data sources into the same data store.
Data Consistency Guarantee. This ensures that the data in the two databases are consistent.
Efficient data synchronization capability. Massive data synchronization requires fast synchronization to ensure data timeliness.

The 3 points listed above are partial characteristics of Apache SeaTunnel’s synchronization tools, more data integration capabilities that Apache SeaTunnel has are not included.

Apache SeaTunnel

Next, let’s talk about Apache SeaTunnel, our new partner in the field of data integration.

The official website defines it as a Next-generation high-performance, distributed, massive data integration framework. This is also our goal to solve the problems I just mentioned and provide a simple and usable, high-performance data integration framework. Apache SeaTunnel has many interesting features, and I will introduce some of them.

First, as a data integration framework, it is important to be easy and quick to use. For this, we need to make the Apache SeaTunnel users free of writing code, that is, No Code.

So how do users use Apache SeaTunnel? I will talk about it later.

To improve the throughput of data, distribution is essential, which can help greatly improve the transmission efficiency and data processing capacity, achieving excellent fault tolerance.

At the same time, Apache SeaTunnel supports running on multiple frameworks and currently supports Spark and Flink, as well as the self-developed data integration engine by Apache SeaTunnel. Users can choose the engines best suits them based on their requirements. Of course, we recommend our self-developed engine, which is customized for data integration.

How To Use Apache SeaTunnel

Next, I will introduce how to use Apache SeaTunnel for data processing. Simple and efficient use has always been our goal in designing.

Firstly, download the binary package of Apache SeaTunnel from the official website, and you can also run it through docker. Here we take the most frequently used method as an example.

Download: https://seatunnel.apache.org/download

First, unzip to get our binary package, modify the config file or create a new one to get the files we need to run. Then through the run command we provide, we can run our Apache SeaTunnel program. By running our example, we can get the data results in the console.

By checking the process just now, we should be aware that the Config file is very important, which defines the operation mode and operation logic of our tasks. Next, let’s take a look at this Config, which consists of environment, source, transform, and sink. As their names indicate, these four components define a complete data processing process, from reading data from Source to transforming processing data, and finally, Sink and writing data. The whole process is shown in the picture on the right, with various data flowing in the Apache SeaTunnel engine.

Apache SeaTunnel Community

Such a powerful product can’t do without the community standing firm behind it. Next, I will tell you about community development.

In virtue of the community power, we will support hundreds of Connectors. If the data source you want is not on the support list currently, welcome to create an issue in the community.

With numerous connectors and easy-to-use features, we have a strong team of contributors to support it. If you are interested in the program, welcome to join us.

Although our project has been around for a long time, Apache SeaTunnel gets a new start in 2022 since entering the Apache incubator. At this stage, the project has undergone tremendous changes from its earlier days, and every change has been brought about by our contributors.

Use Case

Finally, we introduce some user cases.

Currently, OPPO uses Apache SeaTunnel in the sample center and feature platform to help machine learning, and Bilibili uses Apache SeaTunnel to complete data warehouse writing and data warehouse extraction. These are just two of the many successful uses of Apache SeaTunnel. More cases can be found on our official website and Twitter.

Official website: https://seatunnel.apache.org/

Twitter: https://twitter.com/ASFseatunnell

If you have any questions about usage or development, you can join our Slack channel. We will be happy to answer your questions.

https://join.slack.com/t/apache seatunnelshared_invite/zt-1hso5n2tv-mkFKWxonc70HeqGxTVi34w

Thank you all, hope to see you in the Apache SeaTunnel community.

About Apache SeaTunnel

Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

Data loss and duplication
Task buildup and latency
Low throughput
Long application-to-production cycle time
Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

Massive data synchronization
Massive data integration
ETL of large volumes of data
Massive data aggregation
Multi-source data processing

Features of Apache SeaTunnel

Rich components
High scalability
Easy to use
Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/incubator-seatunnel/issues

Contribute code to:

https://github.com/apache/incubator-seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Come and join us!