Apache SeaTunnel(Incubating) released version 2.1.1, and CK-Sink supports quick-data writing up to 10 billion volume

Apache SeaTunnel
4 min readApr 29, 2022

--

In the month or so after the release of Apache SeaTunnel version 2.1.0, the community has accepted hundreds of cumulative Pull Requests from teams or individuals around the world to give the birth of Apache SeaTunnel version 2.1.0, with its performance greatly improved and its features, tests, documentation, and examples optimized.

In this article, we will introduce the details of the Apache SeaTunnel 2.1.1 update.

Release Note:https://github.com/apache/incubator-seatunnel/blob/2.1.1/release-note.md

Download link: https://seatunnel.apache.org/download

01 Major Feature Updates

ClickHouse Connector Sink performance exponentially improved

For importing batch data of more than 10 billion, the performance of traditional JDBC methods in some massive data synchronization scenarios is not satisfactory.

To make the data writing faster, we communicated with ST-CK-Sink users, and after a month-long discussion, development, and testing, we provided ClickhouseFile-Connector support. ClickhouseFile is a high-performance data writing connector. In the new version, when we use the data tables of MergeTree series engines, the ClichouseFile plug-in can write data efficiently whether they are local tables or distributed tables.

The plug-in’s advantages lie in:

  • Efficient batch writing of data;
  • Far less stress on Clickhouse clusters than traditional write methods.

In short, the ClichouseFile plugin can write data efficiently with low resource consumption.

Supported Features

  1. Support for MergeTree series of local tables.
  2. Parsing of Distribute tables and sharding of data by defined sharding_key.
  3. Multiple data file synchronization methods support SCP and RSYNC.

Principle of implementation

By calling the Clickhouse-local component, a series of operations such as data file generation and data compressing can be implemented on Apache SeaTunnel (Incubating). Then by communicating with the Server, the generated data is sent directly to the different nodes of Clickhouse, and then the data files are provided to the nodes for the query.

Processing flow

Please click the video below for more details

Support for JDK 11

According to the official description of OpenJDK (https://adoptopenjdk.net/support.html), there are currently 3 TLS versions, mainly JDK8, JDK11, and JDK17, so we decided to support the development of JDK11 after community discussion, and also added the corresponding CI for this purpose. JDK11 users can quickly experience Apache SeaTunnel (Incubating).

CI&CD

CI&CD is the guarantee of the code quality of an open-source project, we are committed to building a thick CI&CD “rampart wall” since the beginning of the Apache SeaTunnel(Incubating) project, and the community contributors are also working hard for this. In this update, we support Spark&Flink’s e2e Test and Sonar code quality automation testing and analysis, and welcome community contributors to continue to improve the automation testing of each component.

The following are the details of the update.

02 Specific updates

Features

【Connector】

  • Spark&Flink- Clickhouse enhancements, with exponential performance improvements.
  • Flink supports -elasticsearch 7. x.
  • Spark supports HTTP.
  • Spark engine supports FeiShu.
  • JDBC Connector supports for partitioning.
  • Spark Email Connectors support SSL/tls parameters.

【Core】

  • Configuration file supports JSON format.
  • Support for JDK11.

【Bug Fix】

  • Fixed the problem that the ConsoleSink plugin under the Flink engine would not print output;
  • Fix the problem of various data source JDBC type compatibility between JdbcSource and JdbcSink;
  • Fixed a bug that the Transform plugin does not execute when there is an empty data source;
  • Fix the issue of the DateTime/date string cannot be converted to timestamp/date;
  • Fix the problem that the table exits judges that it does not contain TemporaryTable, causing the Kafka plugin to fail to write;
  • Fix the issue that the FileSink plugin does not work in Flink streaming mode;
  • Fix the required configuration parameters of the RedisSink plugin when using the Spark engine;
  • Fix SQL parsing table name error;
  • Fix ClassCastException when outputting data to Doris.

【Optimization】

  • Upgrade the Log4j version to 2.17.1.
  • Unified management of third-party dependency versions.
  • Add to enableHiveSupport parameter configuration to automatically identify whether Spark engine uses Hive settings.
  • Remove useless job names from JobInfo.
  • Add console output quantity limit and batch Flink output added to console support.
  • Optimized how plugins are loaded.
  • Rewrite Spark and Flink startup scripts.
  • Added logging function to quickly locate incorrect SQL statements in Flink SQL conversion.
  • Delete the returned result of the sink plugin.
  • Supports opening the Flink web page in the Flink example.

【CI&CD】

  • Automatic code quality detection by Sonar.
  • Spark & Flink e2e.
  • 03 Acknowledgements

Thanks to the following contributors (GitHub IDs, in no particular order), it’s your dedication and hard work that allowed us to launch this version quickly.

ruanwenjun, CalvinKirs, BenJFan, mans2singh, asdf2014, zhongjiajie, simon824, yx91490, wuchunfu, kyle-cx, dongzl, zhaomin1423, kone-net, tmljob , Rianico, GezimSejdiu, realdengziqi, kalencaya, tobezhou33, DingPengfei, chenhu, v-wx-v, 1996fanrui, lvshaokang, bigdataf, ououtt, bestcx, hf200012, kid-xiong

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet