Mastering SeaTunnel Engine Deployment: A Comprehensive Guide

6 min readSep 5, 2023

1. Deploying the SeaTunnel Engine

The SeaTunnel Engine is the default engine for SeaTunnel. The SeaTunnel installation package already includes all contents of the SeaTunnel Engine.

2. Configuring Environment Variables

Configure the environment variables in /etc/profile.d/seatunnel.sh.

export SEATUNNEL_HOME=${seatunnel install path}
export PATH=$PATH:$SEATUNNEL_HOME/bin

3. Configuring the SeaTunnel Engine JVM

SeaTunnel Engine provides two methods to set jvm options:

1.Add JVM options to the first line of $SEATUNNEL_HOME/bin/seatunnel-cluster.sh.

JAVA_OPTS="-Xms2G -Xmx2G"

2.Add JVM options when starting the SeaTunnel Engine, for example:

seatunnel-cluster.sh -DJvmOption="-Xms2G -Xmx2G"

4. SeaTunnel Configuration

SeaTunnel Engine offers numerous features, which need to be configured in seatunnel.yaml.

Backup

The SeaTunnel Engine employs Hazelcast IMDG for cluster management. The cluster’s status data (job run status, resource status) is stored in Hazelcast IMap.

Data saved in Hazelcast IMap will be distributed and stored across all nodes in the cluster. Hazelcast partitions the data stored in IMap. Each partition can specify the number of backups. Thus, the SeaTunnel Engine can achieve cluster HA without utilizing other services like ZooKeeper.

For defining the number of synchronous backups, use ‘backup count’. For instance, setting it to 1 will place the partition’s backup on another member. If set to 2, it will be stored across two other members.

We recommend a backup-count value of min(1, max(5, N/2)), where N represents the cluster node number.

seatunnel:
  engine:
    backup-count: 1
# other config

2. Slots

The number of slots determines the quantity of TaskGroups a cluster node can run concurrently. Since the SeaTunnel Engine is a data synchronization engine, most tasks are IO-intensive.

We recommend using dynamic slots.

seatunnel:
    engine:
        slot-service:
            dynamic-slot: true
        # other config

3. Checkpoints

Similar to Flink, the SeaTunnel Engine supports the Chandy–Lamport algorithm. As such, the SeaTunnel Engine can synchronize data without data loss or duplication.

Interval: The gap between two checkpoints, in milliseconds. If the checkpoint.interval parameter is set in the job configuration file, the value set here will be overridden.

Timeout: Checkpoint timeout. A checkpoint failure is triggered if a checkpoint cannot be completed within this period.

Maximum concurrency: Specifies the maximum number of checkpoints that can run concurrently.

Tolerable failures: Maximum number of retry attempts after a checkpoint failure.

seatunnel:
    engine:
        backup-count: 1
        print-execution-info-interval: 10
        slot-service:
            dynamic-slot: true
        checkpoint:
            interval: 300000
            timeout: 10000
            max-concurrent: 1
            tolerable-failure: 2

5. Configuring the SeaTunnel Engine

All SeaTunnel Engine server configurations are in hazelcast.yaml.

The SeaTunnel Engine nodes use the cluster name to determine if they are part of the same cluster. If two nodes have different cluster names, the SeaTunnel Engine will refuse service requests.

Using Hazelcast, the SeaTunnel Engine cluster comprises the network of cluster members running the SeaTunnel Engine Server. Cluster members automatically connect to form a cluster. This auto-joining is facilitated by various discovery mechanisms used by the cluster members to find each other.

Please note, once the cluster is formed, member-to-member communication is always via TCP/IP, regardless of the discovery method used.

The SeaTunnel Engine employs the following discovery mechanisms, and it can be configured as a full TCP/IP cluster:

hazelcast:
  cluster-name: seatunnel
  network:
    join:
      tcp-ip:
        enabled: true
        member-list:
          - hostname1
    port:
      auto-increment: false
      port: 5801
  properties:
    hazelcast.logging.type: log4j2

TCP is our recommended method for standalone SeaTunnel Engine clusters.

Types: The persistent type for imap is currently only supported for hdfs.

Namespace: Used to distinguish storage locations for different business data, such as the OSS bucket name.

Cluster Name: This parameter is primarily used for cluster isolation. It can be used to differentiate various clusters, like cluster1, cluster2. It also differentiates different businesses.

fs.defaultFS: We use the hdfs API for file read/write operations, so providing hdfs configuration is necessary for this storage.

For HDFS usage, you can configure it like this:

map:
    engine*:
       map-store:
         enabled: true
         initial-mode: EAGER
         factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
         properties:
           type: hdfs
           namespace: /tmp/seatunnel/imap
           clusterName: seatunnel-cluster
           fs.defaultFS: hdfs://localhost:9000

If HDFS is not available and your cluster has only one node, you can configure it to use local files like this:

map:
    engine*:
       map-store:
         enabled: true
         initial-mode: EAGER
         factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
         properties:
           type: hdfs
           namespace: /tmp/seatunnel/imap
           clusterName: seatunnel-cluster
           fs.defaultFS: file:///

6. Cluster Start and Stop

The cluster-name client must be the same as SeaTunnel Engine. Otherwise, SeaTunnel Engine will reject the client request.

cluster member

All SeaTunnel Engine Server Node addresses need to be added here.

hazelcast-client:
  cluster-name: seatunnel
  properties:
      hazelcast.logging.type: log4j2
  network:
    cluster-members:
      - hostname1:5801

Start SeaTunnel Engine Serve

mkdir -p $SEATUNNEL_HOME/logs
nohup bin/seatunnel-cluster.sh 2>&1 &

Logs will be written to $SEATUNNEL_HOME/logs/seatunnel-engine-server.log

Just copy the directory on the $SEATUNNEL_HOMESeaTunnel Engine node to the Client node, and configure the SeaTunnel Engine Server node similar to SEATUNNEL_HOME.

7. Deploy SeaTunnel distributed cluster

This is the most recommended way to use SeaTunnel Engine in a production environment. This mode supports all functions of SeaTunnel Engine, and the performance and stability of cluster mode will be better.

In cluster mode, the SeaTunnel Engine cluster needs to be deployed first, and the client submits the job to the SeaTunnel Engine cluster for operation.

Submit command

SEATUNNEL_HOME/config/v2.batch.config.template

8. Checkpoint storage

Checkpoint is a fault-tolerant recovery mechanism. This mechanism ensures that even if an exception suddenly occurs while the program is running, it can recover on its own.

Checkpoint Checkpoint Storage is a storage mechanism for storing checkpoint data.

SeaTunnel Engine supports the following checkpoint storage types:

HDFS (OSS, S3, HDFS, LocalFile) LocalFile (Native), (DEPRECATED: Use HDFS (LocalFile) instead. We use the microkernel design pattern to separate the checkpoint storage module from the engine. This allows users to implement their own Checkpoint storage module.

checkpoint-storage-api is the checkpoint storage module API, which defines the interface of the checkpoint storage module.

If you want to implement your own checkpoint storage module, you need to implement CheckpointStorage and provide the corresponding CheckpointStorageFactory implementation.

The checkpoint storage module configuration seatunnel-server is in the seatunnel.yaml file.

seatunnel:
    engine:
        checkpoint:
            storage:
                type: hdfs #plugin name of checkpoint storage, we support hdfs(S3, local, hdfs), localfile (native local file) is the default, but this plugin is de
              # plugin configuration
                plugin-config: 
                  namespace: #checkpoint storage parent path, the default value is /seatunnel/checkpoint/
                  K1: V1 # plugin other configuration
                  K2: V2 # plugin other configuration

Note: Namespaces must end with “/”.

If you use HDFS, you can configure it like this:

seatunnel:
  engine:
    checkpoint:
      storage:
        type: hdfs
        max-retained: 3
        plugin-config:
          storage.type: hdfs
          fs.defaultFS: hdfs://localhost:9000
          // if you used kerberos, you can config like this:
          kerberosPrincipal: your-kerberos-principal
          kerberosKeytab: your-kerberos-keytab

seatunnel:
  engine:
    checkpoint:
      interval: 6000
      timeout: 7000
      max-concurrent: 5
      tolerable-failure: 2
      storage:
        type: hdfs
        max-retained: 3
        plugin-config:
          storage.type: hdfs
          fs.defaultFS: file:/// # Ensure that the directory has written permission

9. TCP

TCP Networking If multicasting is not the preferred discovery method for your environment, then you can configure SeaTunnel Engine as a full TCP/IP cluster. When you configure SeaTunnel Engine to discover members via TCP/IP, you must list all or some of the members’ hostnames and/or IP addresses as cluster members. You do not have to list all of these cluster members, but at least one of the listed members must be active in the cluster when a new member joins.

To configure Hazelcast as a full TCP/IP cluster, set the following configuration elements. See the tcp-ip element section for a complete description of the TCP/IP discovery configuration elements.

Set the enabled attribute of the tcp-ip element to true.
Provide your member elements within the tcp-ip element. The following is a sample declarative configuration.

hazelcast:
  network:
    join:
      tcp-ip:
        enabled: true
        member-list:
          - machine1
          - machine2
          - machine3:5799
          - 192.168.1.0-7
          - 192.168.1.21

As shown above, you can provide IP addresses or hostnames for the member elements. You can also provide a range of IP addresses, such as 192.168.1.0–7.

Instead of providing the members line by line as shown above, you can choose to use the members element and write comma-separated IP addresses as shown below.

192.168.1.0–7,192.168.1.21

If you don’t provide a port for the member, Hazelcast will automatically try port 5701, 5702 and so on.

Conclusion

Configuring and deploying the SeaTunnel Engine requires careful attention to details, especially in a clustered environment. Whether it’s setting JVM options, defining the number of backups, or utilizing Hazelcast for cluster management, the SeaTunnel Engine offers robust features to help maintain stability, performance, and high availability.

For more in-depth configuration details or troubleshooting steps, refer to the official SeaTunnel documentation.