[Tutorial] Use SeaTunnel to synchronize Kafka data to ClickHouse

Apache SeaTunnel
3 min readApr 1, 2024

--

Apache SeaTunnel Dependency Address

SeaTunnel Official Website’s Source/Sink Templates

SeaTunnel’s GitHub Address

After downloading the installation package from the official website, (Note: do not download the apache-seatunnel-incubating-2.1.0-bin.tar.gz version, as it lacks dependencies and features.) use apache-seatunnel-2.3.3-bin.tar.gz, but environment configuration and jars are required, and jars need internet access.

From version 2.2.0-beta, binary packages do not include Connectors’ dependencies by default. Therefore, you need to execute the following command to install connectors before first use. Alternatively, you can manually download connectors from the Apache Maven Repository [https://repo.maven.apache.org/maven2/org/apache/seatunnel/] and move them to the Connectors/SeaTunnel directory.

sh bin/install-plugin.sh

To specify a connector version, for version 2.3.3, execute:

sh bin/install-plugin.sh 2.3.3

Manual dependency import is also possible:

  1. Place connectors in this directory:
apache-seatunnel-2.3.3/connectors/seatunnel
  1. Place MySQL and clickhouse connection drivers and SeaTunnel’s Source package in this directory:
/usr/local/mysql/module/seatunnel/apache-seatunnel-2.3.3/lib
  1. Configuration files are placed in this directory:
/usr/local/mysql/module/seatunnel/apache-seatunnel-2.3.3/config

Note the seatunnel-env.sh file configuring Flink or Spark environment variables needs setting.

Execute command:

Note the choice of startup script in the bin directory, the configuration of environment variables file seatunnel-env.sh, and script selection (different Flink versions require different startup scripts). Ensure Flink's jobmanager and taskmanager are started before executing tasks.

In the example08.conf configuration file:

env {
execution.parallelism = 1
job.mode = "STREAMING"
checkpoint.interval = 2000
}
It's imperative to use STREAMING for flink, not BATCH.

Refer to the second point above — SeaTunnel Official Website’s Source/Sink Templates for the specific configuration format. Note that executing the task might take several minutes, but it’s crucial to wait for the task to complete before expecting data to be transferred.

The attachment shows the complete directory of Apache SeaTunnel 2.3.3 version, including MySQL, clickhouse connection drivers, and configuration files like seatunnel-env.sh. Adjust according to the data synchronization link and server parameters.

The second attachment is the configuration file, from MySQL to ClickHouse, from MySQL to Kafka, and from Kafka to ClickHouse.

Execute the command once for a single synchronization. It’s crucial to ensure both the target and source tables exist and contain data during the data synchronization process. This way, the synchronization effects can be observed in the target table after executing the sync command.

java
[root@172-xx-xxx-x bin]# ./start-seatunnel-flink-15-connector-v2.sh --config ../config/example07.conf

The difference between STREAMING and BATCH in SeaTunnel’s env {j:ob.mode = "STREAMING"} configuration.

--

--

Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.