Apache SeaTunnel 2.3.3 Released with CDC Support for Schema Evolution!

Apache SeaTunnel
5 min readAug 23, 2023

--

Translated&Edited by Debra Chen

After two months, Apache SeaTunnel is back with a major update! The 2.3.3 release brings significant improvements to both functionality and performance. Highlights include long-awaited features like CDC Schema evolution (DDL change sync), Primary Key Split, JDBC Sink auto table creation, SeaTunnel Zeta Engine support, variable substitution, and more. These upgrades empower Apache SeaTunnel with enhanced data synchronization capabilities, significantly boosting performance. Let’s delve into the details of this update.

CDC Updates

CDC Schema Evolution Support:

In a pivotal architectural update, SeaTunnel now abstracts DDL change events at the architecture level, adding interfaces in both Source and Sink. Furthermore, the Zeta engine incorporates handling for DDL change events and checkpoints. This architecture now lays the foundation for DDL change synchronization, with different connectors adapting interfaces for compatibility.

Primary Key Split Optimization:

Previously, CDC Source’s split was limited to numeric primary key columns. This update introduces two crucial features:

  • Support for unique indexes as split fields.
  • Support for the split on string-type fields.

This means that as long as the source table has a primary key column or a unique index column, and the column’s type is numeric or string, automatic splitting will be performed for efficient CDC reading. Additionally, the algorithm for splitting based on string-type columns has been optimized in the new version. Through testing, for a MySQL table with 400 million rows and 60 fields, the split for string-type primary keys has been reduced from 3 hours to 20 minutes. The same algorithm has been incorporated into JDBC Source’s partition splitting, thus optimizing JDBC Source for string-type split operations as well.

MongoDB CDC Connector:

The 2.3.3 release introduces a new MongoDB CDC connector, extending CDC synchronization capabilities.

Transform Updates

SQL Transform now supports ‘select *’ and ‘like’ for fuzzy matching.

The “select *” query retrieves all fields from the source and allows adding additional fields after it to achieve the effect of adding custom columns during the synchronization process. For instance, consider the following example:

transform {
Sql {
source_table_name = "fake"
result_table_name = "fake1"
query = "select *, current_timestamp as sync_timestamp from fake"
}
}

Through this Transform’s processing, a “sync_timestamp” column will be added to the first row of input data from the source. The value of this column will be the system timestamp when the row of data passes through the Transform.

The “like” fuzzy matching is used for data filtering within the Transform. Consider the example below:

transform {
Sql {
source_table_name = "fake"
result_table_name = "fake1"
query = "select *, current_timestamp as sync_timestamp from fake where name like '%Demo_'"
}
}

After processing through this Transform, in addition to achieving the column addition effect from the previous example, data filtering is possible. Only rows with “name” field values starting with “Demo_” will be output to downstream processing nodes (other Transform nodes or Sink nodes).

Enhanced Basic Capabilities

For CDC multi-table synchronization scenarios, JDBC Sink now offers automatic table creation. JDBC Sink generates DDL statements based on the upstream catalog table, streamlining table creation in target databases.

  • Please note that many databases can utilize the JDBC Sink connector, but not all databases have implemented automatic table creation. In this update, the target databases that support automatic table creation are MySQL, Oracle, Postgres, and SQLServer. Moreover, using automatic table creation also has requirements for the Source Connector. The Source connector must have implemented Catalog. In this update, only CDC Source has implemented Catalog. Therefore, the automatic table creation feature is only applicable when synchronizing CDC Source to MySQL/Oracle/Postgres/SQLServer and in multi-table synchronization mode.

Zeta Engine Updates

  1. Support for Schema Evolution.
  2. Rest API includes job submission API, allowing job submission via Rest API without installing SeaTunnel Client.

For example:

network:
rest-api:
enabled: true
endpoint-groups:
CLUSTER_WRITE:
enabled: true
DATA:
enabled: true
join:
tcp-ip:
enabled: true
member-list:
- localhost
port:
auto-increment: true
port-count: 100
port: 5801

Refer to https://seatunnel.apache.org/docs/seatunnel-engine/rest-api/#submit-job for more details.

  1. Job configuration supports variable substitution and parameter passing, enabling dynamic variable replacement during job submission.

Additional Enhancements, Optimizations, and Bug Fixes

The new version includes essential updates and optimizations across SeaTunnel Connector, Zeta Engine, Transform, and CI. Stubborn bugs have been fixed, and nearly 30 project documents have been updated, including detailed Connector usage guides.

Thanks to Contributors

Many thanks to @Liu Li for guidance and assistance in this release, and to the contributors for their support!

Contributor GitHub IDs

Visit our website for more information: https://seatunnel.apache.org

About Apache SeaTunnel

Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of Apache SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/seatunnel/issues

Contribute code to:

https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Join us now!❤️❤️

Image Source: Pixabay

Image Source: The author’s own picture

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet