Get Apache SeaTunnel started in 3 Minutes
Introduction
SeaTunnel is an open-source big data integration tool under the Apache Software Foundation, providing flexible, easy-to-use, easy-to-expand solutions for data integration scenarios that support hundreds of billions of data integration. SeaTunnel can choose to run on the SeaTunnel Zeta engine or the Apache Flink or the Spark engine. It provides high-performance data synchronization capabilities for real-time (CDC) and batch data. This guide will guide you to quickly get started with SeaTunnel to provide support for your big data integration projects (for ease of use, this article will use SeaTunnel Zeta as the operating engine).
Environmental preparation
- If there is no Java runtime environment, please download a Java environment first: Make sure Java (Java 8 or 11, other versions advanced than Java 8 is theoretically supported) executes
java -version
in the terminal to output Java version information, as follows
2. Download the latest SeaTunnel from the official website (https://seatunnel.apache.org/download) and extract it into a suitable directory.
3. Install the Connector plug-in: install whatever data source plug-in you need [2], it is very simple. You can specify the required plugins by configuring the config/plugin_config
file. If you just want to try SeaTunnel for the first time, you only need to keep 2 connector-fake (number creation plugin) and connector-console (print to console plugin) 2 plugins, you can modify the plugin_config
file to keep only As follows:
--connectors-v2--
connector-fake
connector-console
--end--
Then run the command to install the connector (note: starting from 2.2.0-beta, the binary package does not provide connector dependencies by default, so you need to download the connector plug-in for the first use)
sh bin/install-plugin.sh 2.3.1
After execution, the corresponding connector jar will appear in the connectors/seatunnel
directory.
Note: This operation requires networking. You can also manually download the connector from Apache Maven Repository, and then manually move it to connectors/seatunnel
directory.
Configure SeaTunnel sync jobs
Add a job profile. Edit the config/v2.batch.config.template file, which determines the way and logic of data input, processing, and output after starting SeaTunnel. Here is an example configuration file:
env {
execution.parallelism = 2
job.mode = "BATCH"
#checkpoint. interval = 10000
}
source {
FakeSource {
parallelism = 2
result_table_name = "fake"
row.num = 16
schema = {
fields {
name = "string"
age = "int"
}
}
}
}
sink {
Console {}
}
Run the SeaTunnel job
In the command line, switch to the SeaTunnel decompression directory, and run the following command to start the SeaTunnel job:
cd "apache-seatunnel-incubating-${version}"
./bin/seatunnel.sh --config ./config/v2.batch.config.template -e local
This command will run your SeaTunnel job locally (local mode). If you need to use SeaTunnel Cluster (cluster mode), please refer to [3].
When you run the above command, you can see its output in the console. You can think of it as a sign of whether the command ran successfully or not.
The SeaTunnel console will print the following logs:
2023-04-11 18:33:30,547 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceSplitEnumerator - Assigning splits to readers 0 [FakeSourceSplit(splitId=0, rowNum=16)]
2023-04-11 18:33:30,551 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceSplitEnumerator - Assigning splits to readers 1 [FakeSourceSplit(splitId=1, rowNum=16)]
2023-04-11 18:33:31,489 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceReader - 16 rows of data have been generated in split(1). Generation time: 1681209211485
2023-04-11 18:33:31,489 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=1 rowIndex=1: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : jBHJM, 62571717
2023-04-11 18:33:31,489 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceReader - Closed the bounded fake source
2023-04-11 18:33:31,489 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=1 rowIndex=2: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : hOPkY, 565194744
2023-04-11 18:33:31,489 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=1 rowIndex=3: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : QRUsG, 706574302
..........
When the task is finished running, the summary information of this task will appear:
2023-04-11 18:33:32,639 INFO org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand -
***************************************************
Job Statistic Information
***************************************************
Start Time: 2023-04-11 18:33:27
End Time: 2023-04-11 18:33:32
Total Time(s) : 4
Total Read Count : 32
Total Write Count : 32
Total Failed Count : 0
***************************************************
So far, SeaTunnel has been successfully run!
Summarize
Following this guide, you have successfully set up and run a basic SeaTunnel job. You can now start trying to use SeaTunnel to solve your data integration needs. Isn’t it very easy to use? Have a try!
For more details about SeaTunnel, please visit the official documentation: https://seatunnel.apache.org/
Also, welcome to join our Slack channel to learn more about SeaTunnel: https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ
Appendix:
https://seatunnel.apache.org/docs/2.3.1/seatunnel-engine/about https://seatunnel.apache.org/docs/2.3.1/start-v2/locally/deployment#step-3-install-connectors-plugin https://seatunnel.apache.org/docs/2.3.1/seatunnel-engine/cluster-mode
📌📌Welcome to fill out this survey to give your feedback on your user experience or just your ideas about Apache SeaTunnel:)
About Apache SeaTunnel
Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.
Why do we need Apache SeaTunnel?
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
- Data loss and duplication
- Task buildup and latency
- Low throughput
- Long application-to-production cycle time
- Lack of application status monitoring
Apache SeaTunnel Usage Scenarios
- Massive data synchronization
- Massive data integration
- ETL of large volumes of data
- Massive data aggregation
- Multi-source data processing
Features of Apache SeaTunnel
- Rich components
- High scalability
- Easy to use
- Mature and stable
How to get started with Apache SeaTunnel quickly?
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.
https://seatunnel.apache.org/docs/2.1.0/developement/setup
How can I contribute?
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!
Submit an issue:
https://github.com/apache/incubator-seatunnel/issues
Contribute code to:
https://github.com/apache/incubator-seatunnel/pulls
Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org
Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ
Follow Twitter:
https://twitter.com/ASFSeaTunnel
Come and join us!