Set up Apache SeaTunnel(Incubating) with Kubernetes

Apache SeaTunnel
5 min readJun 30, 2022

--

This article is reprinted from Gezim Sejdiu.

Dr. Gezim Sejdiu

Tech Lead Data Engineer @DPDHL & Assistant Professor @UniversumCollege ; PhD from @UniBonn | @SDA_Research

This post provides a quick guide to using Apache SeaTunnel with Kubernetes.

Prerequisites

We assume that you have a local installation of the following:

So that the kubectl and helm commands are available on your local system.

For Kubernetes minikube is our choice, at the time of writing this we are using version v1.23.3. You can start a cluster with the following command:

minikube start --kubernetes-version=v1.23.3

Installation

Apache SeaTunnel docker image

To run the image with Apache SeaTunnel, first create a Dockerfile:

FROM flink:1.13ENV Apache SeaTunnel_VERSION="2.1.0"RUN wget https://archive.apache.org/dist/incubator/Apache SeaTunnel/${Apache SeaTunnel_VERSION}/apache-Apache SeaTunnel-incubating-${Apache SeaTunnel_VERSION}-bin.tar.gz
RUN tar -xzvf apache-Apache SeaTunnel-incubating-${Apache SeaTunnel_VERSION}-bin.tar.gz
RUN mkdir -p $FLINK_HOME/usrlib
RUN cp apache-Apache SeaTunnel-incubating-${Apache SeaTunnel_VERSION}/lib/Apache SeaTunnel-core-flink.jar $FLINK_HOME/usrlib/Apache SeaTunnel-core-flink.jar
RUN rm -fr apache-Apache SeaTunnel-incubating-${Apache SeaTunnel_VERSION}*

Then run the following commands to build the image:

docker build -t Apache SeaTunnel:2.1.0-flink-1.13 -f Dockerfile .

Image Apache SeaTunnel:2.1.0-flink-1.13 need to be present in the host (minikube) so that the deployment can take place.

Load image to minikube via:

minikube image load Apache SeaTunnel:2.1.0-flink-1.13

Deploying Flink operator

The steps below provide a quick walk-through on setting up the Flink Kubernetes Operator.

Install the certificate manager on your Kubernetes cluster to enable adding the webhook component (only needed once per Kubernetes cluster):

kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.7.1/cert-manager.yaml

Now you can deploy the latest stable Flink Kubernetes Operator version using the included Helm chart:

helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-0.1.0/helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator

You may verify your installation via kubectl:

kubectl get pods
NAME READY STATUS RESTARTS AGE
flink-kubernetes-operator-5f466b8549-mgchb 1/1 Running 3 (23h ago) 16d

Run Apache SeaTunnel Application

Run Application:: Apache SeaTunnel already provides out-of-the-box configurations.

In this guide, we are going to use flink.streaming.conf:

env {
execution.parallelism = 1
}
source {
FakeSourceStream {
result_table_name = "fake"
field_name = "name,age"
}
}
transform {
sql {
sql = "select name,age from fake"
}
}
sink {
ConsoleSink {}
}

This configuration needs to be present when we are going to deploy the application (Apache SeaTunnel) to the Flink cluster (on Kubernetes), we also need to configure a Pod to Use a PersistentVolume for Storage.

  • Create /mnt/data on your Node. Open a shell to the single Node in your cluster. How you open a shell depends on how you set up your cluster. For example, in our case we are using Minikube, you can open a shell to your Node by entering minikube ssh. In your shell on that Node, create a /mnt/data directory:
minikube ssh# This assumes that your Node uses "sudo" to run commands as the superuser sudo mkdir /mnt/data
  • Copy application (Apache SeaTunnel) configuration files to your Node.

minikube cp flink.streaming.conf /mnt/data/flink.streaming.conf

Once the Flink Kubernetes Operator is running as seen in the previous steps you are ready to submit a Flink (Apache SeaTunnel) job:

  • Create Apache SeaTunnel-flink.yaml FlinkDeployment manifest:
apiVersion: flink.apache.org/v1alpha1
kind: FlinkDeployment
metadata:
namespace: default
name: Apache SeaTunnel-flink-streaming-example
spec:
image: Apache SeaTunnel:2.1.0-flink-1.13
flinkVersion: v1_14
flinkConfiguration:
taskmanager.numberOfTaskSlots: "2"
serviceAccount: flink
jobManager:
replicas: 1
resource:
memory: "2048m"
cpu: 1
taskManager:
resource:
memory: "2048m"
cpu: 2
podTemplate:
spec:
containers:
- name: flink-main-container
volumeMounts:
- mountPath: /data
name: config-volume
volumes:
- name: config-volume
hostPath:
path: "/mnt/data"
type: Directory

job:
jarURI: local:///opt/flink/usrlib/Apache SeaTunnel-core-flink.jar
entryClass: org.apache.Apache SeaTunnel.Apache SeaTunnelFlink
args: ["--config", "/data/flink.streaming.conf"]
parallelism: 2
upgradeMode: stateless
  • Run the example application:
kubectl apply -f Apache SeaTunnel-flink.yaml

See The Output

You may follow the logs of your job, after a successful startup (which can take on the order of a minute in a fresh environment, seconds later) you can:

kubectl logs -f deploy/Apache SeaTunnel-flink-streaming-example

To expose the Flink Dashboard, you may add a port-forward rule:

kubectl port-forward svc/Apache SeaTunnel-flink-streaming-example-rest 8081

Now, the Flink Dashboard is accessible at localhost:8081.

Or launch minikube dashboard for a web-based Kubernetes user interface.

The content printed in the TaskManager Stdout log:

kubectl logs \
-l 'app in (Apache SeaTunnel-flink-streaming-example), component in (taskmanager)' \
--tail=-1 \
-f

looks like the below (your content may be different since we use FakeSourceStream to automatically generate random stream data):

+I[Kid Xiong, 1650316786086]
+I[Ricky Huo, 1650316787089]
+I[Ricky Huo, 1650316788089]
+I[Ricky Huo, 1650316789090]
+I[Kid Xiong, 1650316790090]
+I[Kid Xiong, 1650316791091]
+I[Kid Xiong, 1650316792092]

To stop your job and delete your FlinkDeployment you can simply:

kubectl delete -f Apache SeaTunnel-flink.yaml

A simplified version of this post has been contributed to Apache SeaTunnel already.

Happy Apache SeaTunneling!

About Apache SeaTunnel

Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of Apache SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/incubator-seatunnel/issues

Contribute code to:

https://github.com/apache/incubator-seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Come and join us!

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet