Evolution and Planning of Apache SeaTunnel’s Data Processing Engine Adaptations

Apache SeaTunnel
5 min readAug 20, 2024

--

Apache SeaTunnel, a high-performance data synchronization tool, brings innovation to the data integration field with efficient data processing capabilities. Apart from supporting its Zeta engine, Apache SeaTunnel also supports Spark and Flink. At the 2024 CommunityOverCode Asia, Apache SeaTunnel PMC Member Chao Tian introduced the evolution, architectural design, core features, and the current progress and plans of Apache SeaTunnel based on Flink. Here is a summary of the key points from the presentation:

Evolution of Apache SeaTunnel Based on Flink

The evolution of Apache SeaTunnel is reflected in two API versions:

  • Flink API V1: The initial API version of SeaTunnel, closely coupled with Flink’s computation engine, with connectors tightly dependent on Flink’s interfaces.
  • Flink API V2: The new generation API of SeaTunnel. Although all plugins still inherit the plug-in form, it decouples from the computation engine; supports more Flink versions; and does not rely on Flink’s native connectors; Sink has added Writer, Committer, and Aggregated Committer, while Source has added Reader, Split, and Split Enumerator; reduces the cost of upgrading Flink; and provides more granular interfaces to enhance system scalability, meeting diverse data source synchronization needs.

Architecture Design Based on Flink

From the perspective of job execution, Apache SeaTunnel’s architecture design closely relies on Flink’s data processing capabilities.

At the Common API layer, SeaTunnel abstracts plugins, allowing SeaTunnel to interface with different computation engines based on this abstraction.

The interfacing layer in SeaTunnel is called the Translation Layer. For Flink, SeaTunnel implements Flink-proxy Source, Sink, and Transform, generating Flink engine job graphs to achieve efficient data transformation and synchronization on Flink.

Core Features Based on Flink

Many data synchronization tools are available, such as Apache Flink CDC and Chunjun.

Compared to these, Apache SeaTunnel exhibits the following features:

  • Supported Flink Versions: SeaTunnel supports versions 1.13 and above, providing broader compatibility.
  • Flink Connectors: SeaTunnel does not rely on Flink’s native connectors, offering higher flexibility.
  • User-Defined Metrics: SeaTunnel allows users to define their own metrics, enhancing monitoring and analysis capabilities.
  • Data Transformation Support: SeaTunnel supports data transformation operations, including but not limited to mapping and filtering.
  • Flink-SQL: Although SeaTunnel does not currently support Flink-SQL, it is one of the community’s future focuses.

To summarize SeaTunnel’s features and usability based on Flink:

  1. Supports Flink’s native poll-push architecture, enabling real-time partition data retrieval, effectively solving parallelism issues, and maximizing resource utilization.
  2. Supports Flink’s native two-phase commit feature.
  3. Supports Flink’s native user-defined metrics capability.
  4. Supports using Flink’s native global-accumulator to record data synchronization job details.
  5. Supports all Flink job submission modes (application mode/session mode).
  6. Supports user-defined event communication between enumerators and readers.
  7. Supports all versions between Flink 1.13–1.18.

Community Progress and Future Planning

Currently, the Apache SeaTunnel community is actively advancing the following work:

  • Multi-Table Read and Write Support: Developing functionality to support simultaneous read and write operations on multiple tables in Flink engines, accommodating scenarios such as multi-table routing and enhancing data processing efficiency and flexibility. This feature has already been implemented on the SeaTunnel Zeta engine.
  • Flink Proxy Source & Sink Refactoring: Currently, Flink Proxy data synchronization requires multiple conversions between Flink proxy Row and SeaTunnel Row data formats, which risks data precision loss and significantly reduces data transformation performance. Therefore, the community is working on refactoring the sources and sinks to optimize performance and stability.

The Community is also planning on:

  • Schema Evolution: Currently, SeaTunnel supports schema evolution only on Spark and Zeta engines. The community plans to support dynamic schema changes on Flink to adapt to evolving data needs.
  • SQL Transformation Support: Plans to support SQL transformations on Flink, including select projections, user-defined functions (UDFs), user-defined table functions (UDTFs), and filtering conditions, to provide richer data processing capabilities.

Conclusion

As an innovative tool in the data synchronization field, Apache SeaTunnel’s efficient data processing capabilities based on Flink offer new solutions for data integration. The community’s continuous efforts and innovations will enhance Apache SeaTunnel’s role in future data synchronization tasks. For further details or to get involved in the Apache SeaTunnel project, we welcome you to join the community and participate in discussions.

About Apache SeaTunnel

Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of Apache SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/seatunnel/issues

Contribute code to:

https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Join us now!❤️❤️

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet