Apache SeaTunnel(Incubating)makes data integration so easy
At the global webinar held by the Apache SeaTunnel community and Shopee on September 24th, Apache SeaTunnel PPMC Fan Jia shared his view on how to use the right data integration tools to get more done with less effort. The following is a summary of the whole speech for your reference.
Speaker Introduction
As you can see, the title of my talk is Apache SeaTunnel: make data integration easy, so let’s take a look at how Apache SeaTunnel does it.
First, I’ll show you why we need data integration, and then what data integration needs. Then introduce Apache SeaTunnel, and our community, and end with use cases.
Why We Need Data Integration
As the scale of the enterprise gradually expands, many data sources and data assets will be generated, scattered in different locations and storage. Businesses are disconnected from each other, which is the so-called Data Silo. If I want to conduct unified processing and analysis on these data, what should I do?
We need a tool to synchronize these data into a unified data service, so we can further process and analyze the data. Next, let’s look at another situation.
We can see that two databases store customer information, but the data between the databases are different, and executing the same query statement gets different results. We don’t want to see this happen definitely. We need to ensure that the information stored in the two databases is consistent.
The third situation is that with the continuous development of cloud storage, we need to migrate our data to the cloud, and we need a tool to achieve this.
What We Need
As I mentioned above, I believe you already know why we need a data integration tool, and what characteristics the data integration tools should have.
What capabilities does the solution need to have?
The tools should at least have three capabilities:
- Support for many data sources so that we can support the unification of multiple data sources into the same data store.
- Data Consistency Guarantee. This ensures that the data in the two databases are consistent.
- Efficient data synchronization capability. Massive data synchronization requires fast synchronization to ensure data timeliness.
The 3 points listed above are partial characteristics of Apache SeaTunnel’s synchronization tools, more data integration capabilities that Apache SeaTunnel has are not included.
Apache SeaTunnel
Next, let’s talk about Apache SeaTunnel, our new partner in the field of data integration.
The official website defines it as a Next-generation high-performance, distributed, massive data integration framework. This is also our goal to solve the problems I just mentioned and provide a simple and usable, high-performance data integration framework. Apache SeaTunnel has many interesting features, and I will introduce some of them.
First, as a data integration framework, it is important to be easy and quick to use. For this, we need to make the Apache SeaTunnel users free of writing code, that is, No Code.
So how do users use Apache SeaTunnel? I will talk about it later.
To improve the throughput of data, distribution is essential, which can help greatly improve the transmission efficiency and data processing capacity, achieving excellent fault tolerance.
At the same time, Apache SeaTunnel supports running on multiple frameworks and currently supports Spark and Flink, as well as the self-developed data integration engine by Apache SeaTunnel. Users can choose the engines best suits them based on their requirements. Of course, we recommend our self-developed engine, which is customized for data integration.
How To Use Apache SeaTunnel
Next, I will introduce how to use Apache SeaTunnel for data processing. Simple and efficient use has always been our goal in designing.
Firstly, download the binary package of Apache SeaTunnel from the official website, and you can also run it through docker. Here we take the most frequently used method as an example.
Download: https://seatunnel.apache.org/download
First, unzip to get our binary package, modify the config file or create a new one to get the files we need to run. Then through the run command we provide, we can run our Apache SeaTunnel program. By running our example, we can get the data results in the console.
By checking the process just now, we should be aware that the Config file is very important, which defines the operation mode and operation logic of our tasks. Next, let’s take a look at this Config, which consists of environment, source, transform, and sink. As their names indicate, these four components define a complete data processing process, from reading data from Source to transforming processing data, and finally, Sink and writing data. The whole process is shown in the picture on the right, with various data flowing in the Apache SeaTunnel engine.
Apache SeaTunnel Community
Such a powerful product can’t do without the community standing firm behind it. Next, I will tell you about community development.
In virtue of the community power, we will support hundreds of Connectors. If the data source you want is not on the support list currently, welcome to create an issue in the community.
With numerous connectors and easy-to-use features, we have a strong team of contributors to support it. If you are interested in the program, welcome to join us.
Although our project has been around for a long time, Apache SeaTunnel gets a new start in 2022 since entering the Apache incubator. At this stage, the project has undergone tremendous changes from its earlier days, and every change has been brought about by our contributors.
Use Case
Finally, we introduce some user cases.
Currently, OPPO uses Apache SeaTunnel in the sample center and feature platform to help machine learning, and Bilibili uses Apache SeaTunnel to complete data warehouse writing and data warehouse extraction. These are just two of the many successful uses of Apache SeaTunnel. More cases can be found on our official website and Twitter.
Official website: https://seatunnel.apache.org/
Twitter: https://twitter.com/ASFseatunnell
If you have any questions about usage or development, you can join our Slack channel. We will be happy to answer your questions.
https://join.slack.com/t/apacheseatunnelshared_invite/zt-1hso5n2tv-mkFKWxonc70HeqGxTVi34w
Thank you all, hope to see you in the Apache SeaTunnel community.
About Apache SeaTunnel
Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.
Why do we need Apache SeaTunnel?
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
- Data loss and duplication
- Task buildup and latency
- Low throughput
- Long application-to-production cycle time
- Lack of application status monitoring
Apache SeaTunnel Usage Scenarios
- Massive data synchronization
- Massive data integration
- ETL of large volumes of data
- Massive data aggregation
- Multi-source data processing
Features of Apache SeaTunnel
- Rich components
- High scalability
- Easy to use
- Mature and stable
How to get started with Apache SeaTunnel quickly?
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.
https://seatunnel.apache.org/docs/2.1.0/developement/setup
How can I contribute?
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!
Submit an issue:
https://github.com/apache/incubator-seatunnel/issues
Contribute code to:
https://github.com/apache/incubator-seatunnel/pulls
Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org
Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ
Follow Twitter:
https://twitter.com/ASFSeaTunnel
Come and join us!