Mastering SeaTunnel Engine Deployment: A Comprehensive Guide

Apache SeaTunnel
6 min readSep 5, 2023

--

1. Deploying the SeaTunnel Engine

The SeaTunnel Engine is the default engine for SeaTunnel. The SeaTunnel installation package already includes all contents of the SeaTunnel Engine.

2. Configuring Environment Variables

Configure the environment variables in /etc/profile.d/seatunnel.sh.

export SEATUNNEL_HOME=${seatunnel install path}
export PATH=$PATH:$SEATUNNEL_HOME/bin

3. Configuring the SeaTunnel Engine JVM

SeaTunnel Engine provides two methods to set jvm options:

1.Add JVM options to the first line of $SEATUNNEL_HOME/bin/seatunnel-cluster.sh.

JAVA_OPTS="-Xms2G -Xmx2G"

2.Add JVM options when starting the SeaTunnel Engine, for example:

seatunnel-cluster.sh -DJvmOption="-Xms2G -Xmx2G"

4. SeaTunnel Configuration

SeaTunnel Engine offers numerous features, which need to be configured in seatunnel.yaml.

  1. Backup

The SeaTunnel Engine employs Hazelcast IMDG for cluster management. The cluster’s status data (job run status, resource status) is stored in Hazelcast IMap.

Data saved in Hazelcast IMap will be distributed and stored across all nodes in the cluster. Hazelcast partitions the data stored in IMap. Each partition can specify the number of backups. Thus, the SeaTunnel Engine can achieve cluster HA without utilizing other services like ZooKeeper.

For defining the number of synchronous backups, use ‘backup count’. For instance, setting it to 1 will place the partition’s backup on another member. If set to 2, it will be stored across two other members.

We recommend a backup-count value of min(1, max(5, N/2)), where N represents the cluster node number.

seatunnel:
engine:
backup-count: 1
# other config

2. Slots

The number of slots determines the quantity of TaskGroups a cluster node can run concurrently. Since the SeaTunnel Engine is a data synchronization engine, most tasks are IO-intensive.

We recommend using dynamic slots.

seatunnel:
engine:
slot-service:
dynamic-slot: true
# other config

3. Checkpoints

Similar to Flink, the SeaTunnel Engine supports the Chandy–Lamport algorithm. As such, the SeaTunnel Engine can synchronize data without data loss or duplication.

Interval: The gap between two checkpoints, in milliseconds. If the checkpoint.interval parameter is set in the job configuration file, the value set here will be overridden.

Timeout: Checkpoint timeout. A checkpoint failure is triggered if a checkpoint cannot be completed within this period.

Maximum concurrency: Specifies the maximum number of checkpoints that can run concurrently.

Tolerable failures: Maximum number of retry attempts after a checkpoint failure.

seatunnel:
engine:
backup-count: 1
print-execution-info-interval: 10
slot-service:
dynamic-slot: true
checkpoint:
interval: 300000
timeout: 10000
max-concurrent: 1
tolerable-failure: 2

5. Configuring the SeaTunnel Engine

All SeaTunnel Engine server configurations are in hazelcast.yaml.

The SeaTunnel Engine nodes use the cluster name to determine if they are part of the same cluster. If two nodes have different cluster names, the SeaTunnel Engine will refuse service requests.

Using Hazelcast, the SeaTunnel Engine cluster comprises the network of cluster members running the SeaTunnel Engine Server. Cluster members automatically connect to form a cluster. This auto-joining is facilitated by various discovery mechanisms used by the cluster members to find each other.

Please note, once the cluster is formed, member-to-member communication is always via TCP/IP, regardless of the discovery method used.

The SeaTunnel Engine employs the following discovery mechanisms, and it can be configured as a full TCP/IP cluster:

hazelcast:
cluster-name: seatunnel
network:
join:
tcp-ip:
enabled: true
member-list:
- hostname1
port:
auto-increment: false
port: 5801
properties:
hazelcast.logging.type: log4j2

TCP is our recommended method for standalone SeaTunnel Engine clusters.

Types: The persistent type for imap is currently only supported for hdfs.

Namespace: Used to distinguish storage locations for different business data, such as the OSS bucket name.

Cluster Name: This parameter is primarily used for cluster isolation. It can be used to differentiate various clusters, like cluster1, cluster2. It also differentiates different businesses.

fs.defaultFS: We use the hdfs API for file read/write operations, so providing hdfs configuration is necessary for this storage.

For HDFS usage, you can configure it like this:

map:
engine*:
map-store:
enabled: true
initial-mode: EAGER
factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
properties:
type: hdfs
namespace: /tmp/seatunnel/imap
clusterName: seatunnel-cluster
fs.defaultFS: hdfs://localhost:9000

If HDFS is not available and your cluster has only one node, you can configure it to use local files like this:

map:
engine*:
map-store:
enabled: true
initial-mode: EAGER
factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
properties:
type: hdfs
namespace: /tmp/seatunnel/imap
clusterName: seatunnel-cluster
fs.defaultFS: file:///

6. Cluster Start and Stop

The cluster-name client must be the same as SeaTunnel Engine. Otherwise, SeaTunnel Engine will reject the client request.

cluster member

All SeaTunnel Engine Server Node addresses need to be added here.

hazelcast-client:
cluster-name: seatunnel
properties:
hazelcast.logging.type: log4j2
network:
cluster-members:
- hostname1:5801

Start SeaTunnel Engine Serve

mkdir -p $SEATUNNEL_HOME/logs
nohup bin/seatunnel-cluster.sh 2>&1 &

Logs will be written to $SEATUNNEL_HOME/logs/seatunnel-engine-server.log

Just copy the directory on the $SEATUNNEL_HOMESeaTunnel Engine node to the Client node, and configure the SeaTunnel Engine Server node similar to SEATUNNEL_HOME.

7. Deploy SeaTunnel distributed cluster

This is the most recommended way to use SeaTunnel Engine in a production environment. This mode supports all functions of SeaTunnel Engine, and the performance and stability of cluster mode will be better.

In cluster mode, the SeaTunnel Engine cluster needs to be deployed first, and the client submits the job to the SeaTunnel Engine cluster for operation.

Submit command

SEATUNNEL_HOME/config/v2.batch.config.template

8. Checkpoint storage

Checkpoint is a fault-tolerant recovery mechanism. This mechanism ensures that even if an exception suddenly occurs while the program is running, it can recover on its own.

Checkpoint Checkpoint Storage is a storage mechanism for storing checkpoint data.

SeaTunnel Engine supports the following checkpoint storage types:

HDFS (OSS, S3, HDFS, LocalFile) LocalFile (Native), (DEPRECATED: Use HDFS (LocalFile) instead. We use the microkernel design pattern to separate the checkpoint storage module from the engine. This allows users to implement their own Checkpoint storage module.

checkpoint-storage-api is the checkpoint storage module API, which defines the interface of the checkpoint storage module.

If you want to implement your own checkpoint storage module, you need to implement CheckpointStorage and provide the corresponding CheckpointStorageFactory implementation.

The checkpoint storage module configuration seatunnel-server is in the seatunnel.yaml file.

seatunnel:
engine:
checkpoint:
storage:
type: hdfs #plugin name of checkpoint storage, we support hdfs(S3, local, hdfs), localfile (native local file) is the default, but this plugin is de
# plugin configuration
plugin-config:
namespace: #checkpoint storage parent path, the default value is /seatunnel/checkpoint/
K1: V1 # plugin other configuration
K2: V2 # plugin other configuration

Note: Namespaces must end with “/”.

If you use HDFS, you can configure it like this:

seatunnel:
engine:
checkpoint:
storage:
type: hdfs
max-retained: 3
plugin-config:
storage.type: hdfs
fs.defaultFS: hdfs://localhost:9000
// if you used kerberos, you can config like this:
kerberosPrincipal: your-kerberos-principal
kerberosKeytab: your-kerberos-keytab
seatunnel:
engine:
checkpoint:
interval: 6000
timeout: 7000
max-concurrent: 5
tolerable-failure: 2
storage:
type: hdfs
max-retained: 3
plugin-config:
storage.type: hdfs
fs.defaultFS: file:/// # Ensure that the directory has written permission

9. TCP

TCP Networking If multicasting is not the preferred discovery method for your environment, then you can configure SeaTunnel Engine as a full TCP/IP cluster. When you configure SeaTunnel Engine to discover members via TCP/IP, you must list all or some of the members’ hostnames and/or IP addresses as cluster members. You do not have to list all of these cluster members, but at least one of the listed members must be active in the cluster when a new member joins.

To configure Hazelcast as a full TCP/IP cluster, set the following configuration elements. See the tcp-ip element section for a complete description of the TCP/IP discovery configuration elements.

Set the enabled attribute of the tcp-ip element to true.
Provide your member elements within the tcp-ip element. The following is a sample declarative configuration.

hazelcast:
network:
join:
tcp-ip:
enabled: true
member-list:
- machine1
- machine2
- machine3:5799
- 192.168.1.0-7
- 192.168.1.21

As shown above, you can provide IP addresses or hostnames for the member elements. You can also provide a range of IP addresses, such as 192.168.1.0–7.

Instead of providing the members line by line as shown above, you can choose to use the members element and write comma-separated IP addresses as shown below.

192.168.1.0–7,192.168.1.21

If you don’t provide a port for the member, Hazelcast will automatically try port 5701, 5702 and so on.

Conclusion

Configuring and deploying the SeaTunnel Engine requires careful attention to details, especially in a clustered environment. Whether it’s setting JVM options, defining the number of backups, or utilizing Hazelcast for cluster management, the SeaTunnel Engine offers robust features to help maintain stability, performance, and high availability.

For more in-depth configuration details or troubleshooting steps, refer to the official SeaTunnel documentation.

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet