Apache Open-source Projects in Modern Data Stacks

Published in

Dev Genius

11 min readNov 14, 2022

Editor: Detong, github.com/mischaZhang

Modern Data Stacks rebuild the enterprise data ecosystem. There are data engines like Hadoop, Snowflake, Databricks, and data lake and real-time data engines like Flink, Spark Streaming, Amazon Kinesis, and many other cloud services to support data processing. There are also many BI tools and modern data applications.

With all these tools, we can build good big-data platforms.

DataOps technologies contain data orchestration, integration, transformation, and governance. DolphinScheduler, Airflow, Airbyte, Fivetran, Apache SeaTunnel, DBT, Collibra, Bigeye are tools and platforms commonly used in DataOps.

Apache SeaTunnel(incubating) is a project focused on synchronizing data and connecting data to different systems, and Apache DolphinScheduler is a data orchestration system.

This is a big picture of Apache projects in modern data stack and DataOps in enterprises. There are two types of data, namely steam data and batch data.

Stream data can be generated by sensors, IoT devices, web logs, social media, ERP, and CRM systems. Apache SeaTunnel connector can connect stream data to data consumers, and Apache SeaTunnel agent can collect weblogs, and social media and Apache SeaTunnel CDC (change data capture) can capture data from the database binlog.

Apache SeaTunnel Engine, Flink, and Spark are used for filtering, loading, normalizing, and sinking stream data to the target, like Kafka, alert systems, or any other databases. Data in Kafka can be loaded into databases like Elasticsearch, and ClickHouse and loaded by Kibana or superset to be visualized.

In some cases, real-time data needs to be copied into batch data. This can be done with Apache SeaTunnel, by copying your real-time data stream into Hudi, Hive and you can either use the data for real-time data warehousing or further load the data with Apache SeaTunnel into your batch data systems.

Batch data can be generated from unstructured data, like files, message queues,s or relational databases, cloud databases, NoSQL databases,s and big data systems. Batch data can be collected by Apache SeaTunnel and stored in HDFS and S3 for further usage. Or use Apache SeaTunnel to load the data into Hive, Apache Hudi, and data lakes for more transformation.

You can either load unstructured batch data into data lakes and load this data into data warehouses like Snowflake, and Teradata, or load unstructured data into data warehouses with Apache SeaTunnel without data lakes.

To use the data and run calculations, DolphinScheduler, the data orchestration systems are required for orchestrating jobs. DolphinScheduler can orchestrate batch workflow, stream data workflow, and MLOps workflow.

With DolphinScheduler scheduling Apache SeaTunnel sink jobs, batch data can be fed back to Oracle, SAP, SaaS systems, social media, and relational databases.

Apache SeaTunnel(incubating)

Apache SeaTunenl(incubating) is a project for data synchronization, supporting 50+ data sources and destinations, like MySQL, Presto, PostgreSQL, TiDB, and Elasticsearch. Apache SeaTunnel supports Spark, Flink, and Apache SeaTunnel engines.

Apache SeaTunnel uses tags to generate Flink, Spark, and Apache SeaTunnel Engine code, which supports both streaming and batch data synchronization.

Apache SeaTunnel(open-source) has two major use cases.

High volume and frequency for your big data platform

Bilibili has multi-source data extraction and loading with high frequency, especially regarding data synchronizing between the database and data warehouse every day. Averaged daily data volume is more than 100TB with over hundreds of billions of data records.

Bilibili is using Apache SeaTunnel for bulk loading data into ClickHouse.

Loading data into ClickHouse can be slow and inefficient. With Apache SeaTunnel, data can be loaded into ClickHouse with efficiency. Instead of inserting each line of data, Apache SeaTunnel can generate ClickHouse files directly and copy them into the ClickHouse system to make the data insertion process faster. We call it bulk load to the ClickHouse system. Apache SeaTunnel can also bulk load data into databases and systems, like HBase.

2. Real-time Heterogeneous Data Synchronization

Vip.com and Didi use Apache SeaTunnel to synchronize real-time data from various data sources like MySQL, log files, Presto, Kafka, Spark, ClickHouse, and Hudi to other data systems, covering dozens of clusters.

The design goals for Apache SeaTunnel are to deliver an easy-to-use, distributed, scalable data integration platform that supports ultra-large amounts of data with high throughput and high/low latency.

Apache SeaTunnel solves lots of problems:

Various data sources

There are hundreds of data sources, versions are not compatible, and new ones are constantly emerging, enterprises often use lots of data systems. Apache SeaTunnel integrates multiple data connectors. Using Apache SeaTunnel can save time in building implemented connectors.

Batch load data and CDC conflicts

To start loading data from one system to another, we need to batch-load all the data from one system to another. New data stored in the source data system is loaded into the destination data system with CDC(change data capture) process.

Without Apache SeaTunnel, there would have to be two different data processing processes to synchronize data. Apache SeaTunnel can process both batch load data and CDC data, and users only need to define the data processing process in Apache SeaTunnel once.

The technology stack is complex

Users do not write Scala or SQL in Apache SeaTunnel, we provide script language and will have UI support in the future.

Quality and monitoring

Apache SeaTunnel ensures data consistency. Users can roll back to certain points and continue data processing.

Difficult to manage and maintain

Managing various data processing jobs is a hard task. We can see different data synchronization processes are separated and stored in various places, which makes these jobs hard to maintain.

Apache SeaTunnel can not only help users to manage and maintain data synchronization jobs, but also manages data cutting, and helps switch between offline batch data synchronization and real-time CDC.

Apache SeaTunnel now supports 50+ connectors, including 20+ data sources, 20+ types of sinks, and 10+ types of transforms. Next year the number of connectors will be up to 150.

The Apache SeaTunnel community is active. We doubled the number of supported connectors this year, supporting data systems like InfluxDB, IceBerg, MongoDB, ClickHouse, Doris, Kudu, etc.

Apache SeaTunnel supports JSON and many other data transformation tools to transform data, and we support using Flink, Spark, and Apache SeaTunnel Engine as data processing engines. Apache SeaTunnel Engine has more efficiency than others when data is not transformed in the synchronization process.

Apache SeaTunnel provides low-latency data transformation based on real-time processing or micro-batch processing provided by the Apache SeaTunnel engine and supports Source/Transform/Sink parallelization to improve throughput performance. Apache SeaTunnel implements distributed snapshot algorithm, 2PC commit, and idempotent writes that guarantee exactly-once delivery.

Website: https://seatunnel.apache.org

GitHub：https://github.com/apache/incubator-seatunnel

Slack: https://apacheseatunnel.slack.com

Twitter : https://twitter.com/asfseatunnel

Video：https://space.bilibili.com/1542095008

Apache DolphinScheduler

DolphinScheduler is an orchestration tool.

For Linux users, scheduling tasks and scripts with cron will make the scripts hard to manage, and deploying task scripts can be complicated. Frequently updating scripts may lead to instability. For other scheduler tools like Azkaban and Airflow, managing large-volume tasks performs badly and lacks multi-cloud data management.

DolphinScheduler provides task scheduling with visualization in various tasking categories. The decentralized design ensures the scheduling system with high stability and availability. DolphinScheduler can support millions of data stably and tasks run simultaneously.

DolphinScheduler has more than 1000 active users and 350+ community contributors.

DolphinScheduler can drag and drop tasks without coding. The picture below shows a workflow with multiple tasks, with shell workflow connected to SQL. We call the entire stack DAG map. The DAG map can be complicated, there can be sub-workflow in a DAG map, and a DAG map can be used as a function by other workflows.

DolphinScheduler provides high performance and high stability with multi workers and masters. DolphinScheduler supports multiple master servers and worker servers. Each master server manages its own tasks and jobs. The master nodes are not active-standby or master-slave. DolphinScheduler uses a hash algorithm to schedule tasks and workflows for different master servers.

DolphinScheduler is trusted by industry leaders over the world, like IBM, Tencent, Cisco, etc.

DolphinScheduler is typically used in 3 cases.

High-Performance, High-Volume Task Scheduling

China Unicom originally used an enterprise scheduling system to support data processing and task scheduling of their global data platform in combination with Shell (HiveSQL). After comparing Airflow, Azkaban, and other commercial schedulers, China Unicom finally chose DS.

DlophinScheduler supports China Unicom to schedule tasks run according to provinces and cities, the number of workflows and tasks is enormous and DolphinScheduler fits with the business and scheduling functional needs well, and supports the large data volume with cost-effectiveness.

2. Global Cloud Deployment with Ease of Use for Data Consumer

SHEIN originally used Airflow to schedule global tasks. But Airflow has a centralized design and lacks visualization and K8S support, and being globally deployed in cloud-native environments is hard. SHEIN chose to migrate from Airflow to DolphinScheduler for its global cloud deployment, K8S support, decentralized architecture that ensures stability, and friendly design for data users.

DolphinScheduler supports SHEIN’s 50,000 tasks on Kubernetes. SHEIN’s data scientists and data analysts use DolphinScheduler to orchestrate data tasks and workflows without coding.

3. AI/ML Orchestration

DolphinScheduler supports Litchi FM and 360’s data science platform. With DolphinScheduler’s DAG execution engine, machine learning tasks, AI training tasks, and big data tasks can be managed and reused. And with DolphinScheduler’s low code IDE, Litchi FM connects the data acquisition process to model training tasks easily.

DolphinScheduler is coming up with a new version, 3.1.0.

DolphinScheduler 2. X versions provide Simple and WYSIWYG (What You See Is What You Get) workflow, high reliability, rich workflow functions, and cloud native and extensible. Version 3.1.0 DolphinScheduler supports more machine learning-related orchestration, data stream functions, Python, and YAML Workflow.

In 3.1.0, users can use DolphinScheduler for data preparation and MLOps, supporting ML flow, Sagemaker, DVC, Jupyter, and PyTorch. DolphinScheduler will add Kubeflow, TensorFlow, and Bentoml this year. DlophinScheduler 3.1.0 supports Flink and Spark streaming and data stream workflow management. In 3.1.0, DolphinScheduler workflows can be generated from Python and YAML directly, which makes code review and version management easy.

PyDolphinScheduler is Python API for Apache DolphinScheduler, users can define workflows with Python or YAML code, aka workflow-as-codes. Workflow instances can be managed through DolphinScheduler.

DolphinScheduler MLOps Orchestration provides orchestration capabilities for MLOps, supports a variety of machine learning-related task plugins, and helps users build machine learning platforms and connect big data platforms efficiently and at a low cost.

DolphinScheduler can orchestrate the user’s existing machine learning tasks, provides out-of-box orchestrates for mainstream MLOps projects, presets algorithmic capabilities through open source machine learning projects, and provides the capability to connect machine learning platforms with big data platforms.

Users can use shell or EMR to prepare data, then use PyTorch for machine learning tasks.

Users can connect machine learning workflows to Spark and SageMaker. SageMaker is treated as a task in DolphinScheduler.

DolphinScheduler provides plugins supporting machine learning workflow and helps data scientists manage data with DVC and SageMaker and models with MLflow and SageMaker. DolphinScheduler supports feature stores like OpenMLDB and SageMaker. Users can train models with Shell, Python, Jupyter, MLflow, PyTorch, and SageMaker and deploy models with Shell, python, MLflow, and SageMaker.

DolphinScheduler supports task plugins for MLOps orchestration:

DolphinScheduler will support more machine learning projects like TensorFlow, BentoML, KubeFlow, and Core. If you are interested in maps and orchestration, please join our Slack channel.

Website: https://dolphinscheduler.apache.org

GitHub：https://github.com/apache/dolphinscheduler

Wehchat：海豚调度

Slack: https://s.apache.org/dolphinscheduler-slack

Twitter : @dolphinschedule

Video：https://space.bilibili.com/515596012

Apache SeaTunnel and DolphinScheduler’s positions in the modern data stack are highlighted below. They enable users to integrate data and orchestrate tasks with more efficiency.

William Kwok:

Apache Software Foundation Member

Apache IPMC Member

PMC of Apache DolphinScheduler

Mentor of Apache SeaTunnel(incubating)

Founder of ClickHouse China Community

Track Chair of Workflow/Data Governance of Apache Con Asia 2021/2022

William was the Senior Big Data Director of Lenovo. He worked as Big Data Director/manager at CICC, IBM, and Teradata. He has over 20 years of experience in big data technology and management.

https://www.linkedin.com/in/williamk2000/

E-mail: guowei@apache.org

Twitter: guowei_William

WeChat: guodaxia2999

About Apache SeaTunnel

Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

Data loss and duplication
Task buildup and latency
Low throughput
Long application-to-production cycle time
Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

Massive data synchronization
Massive data integration
ETL of large volumes of data
Massive data aggregation
Multi-source data processing

Features of Apache SeaTunnel

Rich components
High scalability
Easy to use
Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/incubator-seatunnel/issues

Contribute code to:

https://github.com/apache/incubator-seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel