Data integration is key to the modern data stack success
According to the data life cycle, we usually divide Big Data technologies into 7 parts: data integration, data storage, batch/stream processing, data query, and analysis, data scheduling and orchestration, data development, and BI.
What Is Data Integration?
Data integration can be seen at the very front of the data lifecycle, which is responsible for aggregating data from multiple data sources in a single data store (e.g. data warehouse/data lake), combining them to provide users with a single unified view that can take into account the growth of the data and all the different formats, merging all types of data to facilitate subsequent data analysis and mining.
Anyone who has been involved in data engineering knows that 90% or more of the tasks in a Big Data project are related to data integration, which has a broad meaning and includes operations such as data purging, data extraction, data conversion, and data synchronization/replication. Nowadays, the Big Data ecosystem has become quite complex (as shown below), and there are many different sources of data. How to efficiently integrate data from so many sources into a data lake/warehouse is the key focus of data integration, and that is where the value of data integration lies.
Use Case Of Data Integration In Business
Common use cases of data integration services in business are as follows:
1. Synchronization between homogeneous/heterogeneous data sources: the user’s raw data needs to be transferred to storage, or utilize the query and analysis capabilities of the target storage system, such as Hive data, local data needs to be synchronized to Snowflake, Clickhouse, etc. for the fast query.
2. Data on Cloud: Users need to migrate offline data to cloud storage quickly and safely and perform further business analysis, such as offline MySQL, Postgre, etc. to RDS on the cloud.
Based on these cases, data integration has always played the role of a data mover, providing a powerful and efficient solution for a wide range of data synchronization requirements.
Common strategies for data integration
Two common strategies for data integration: ETL and ELT
Data integration is one of the most time-consuming tasks that data engineers perform on a daily basis. What is ETL? ETL is a specific set of processes in the traditional field of data integration, which consists of three important phases: Extract, Transform, and Load, and ETL is the process of preparing data for analysis and mining.
To start with, let’s understand the concept of ETL and ELT.
The process of ETL is Extract → Transform → Load, where the data source is extracted and then transformed, and the result is written to the target (e.g. data warehouse).
The ELT process is Extract → Load → Transform, where the results are first written to the target (data warehouse/data lake) after extraction, and then the transformation is done using the analysis capabilities of the data warehouse or engines such as Spark or Presto.
Yes, it’s all about the order of Transform and Load, but the effect is very different. The biggest difference between ELT and ETL is the “ focus on extraction and load, rather than transformation”, which makes it possible to build a data lake/warehouse platform in a much lighter and faster way. With the ELT strategy, the loading of data starts immediately after the extraction is completed.
On one hand, it is faster and more efficient, on the other hand, ELT allows data analysts to access the entire raw data rather than the “ secondary” receipts after data engineering, which gives analysts more flexibility to better support the business.
To be more specific, in the past, ETL methods were essential due to the high cost of computation and storage. Typical ETL tools back then included:
Commercial software: Informatica PowerCenter, IBM InfoSphere DataStage, Microsoft SQL Server Integration Services, etc.
Open source software: Kettle, Talend, Sqoop, etc.
All these software were once very popular. But now ETL is facing the following problems:
1. Not flexible: by nature, ETL is very inflexible, requiring data engineers to process layers of raw data according to data warehouse specifications, and obliging data analysts to know in advance how they want to analyze the data and produce reports.
2. Non-intuitive: every single transformation (Transform) performed on the data makes some of the original information “ lost “. Data analysts cannot view all the data in the data warehouse, they usually can only view the data on the data aggregation level and the data mart level, and the ETL processing is usually very time-consuming and time-sensitive as it goes through multiple layers of ETL.
3. Not self-service: Building an ETL Data Pipeline is often beyond the technical capabilities of data analysts and requires the involvement of engineers, which undoubtedly increases the cost of doing so.
Following the plummeting cost of hardware storage, there is no need for data transformation (T) before using data (L), and traditional ETL is beginning to be transformed into ELT. this allows data analysts to do a far better job in an autonomous manner, specifically, the 2 obvious benefits of ELT are:
1. Support for faster decisions by data analysts. Raw data is loaded directly into the data warehouse/lake, constituting a ‘single source of truth’ that data analysts can transform as needed. They will always be able to go back to the original data and will not be affected by transformations that could compromise data integrity. This makes the business intelligence process incredibly flexible and secure.
2. ELT reduces the technical challenge for the whole organization. the ELT approach with commercial or open-source BI tools such as Looker, Tableau, etc. can be used by non-professional technical users.
Some of the open-source software that has adopted the ELT route are Airbyte, and Apache SeaTunnel. I believe you are already familiar with Airbyte, but here I will focus on the Apache Software Foundation’s Apache SeaTunnel (Incubator), a data integration project that has been incubated in the Apache incubator for almost a year.
What is Apache SeaTunnel?
Apache SeaTunnel is a very easy-to-use ultra-high-performance data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies.
Apache SeaTunnel’s features
1. Rich and scalable Connector: Apache SeaTunnel provides a Connector API that does not depend on a specific execution engine, Apache SeaTunnel provides a set of Apache SeaTunnel Connector API that does not depend on a specific execution engine, the Connector (Source, Transform, Sink) developed based on this API can run on a variety of different engines, currently supports Apache SeaTunnel Engine, Flink, Spark.
2. Connector plug-in: The plug-in design allows users to easily develop their own Connector and integrate it into the Apache SeaTunnel project, currently there are more than 80 Connectors supported by Apache SeaTunnel, and this number is growing at a very rapid pace. The list of currently supported connectors is shown in the figure below.
3. Batch and streaming data integration: The connectors developed based on the Apache SeaTunnel Connector API are perfectly compatible with offline sync, real-time sync, full sync, incremental sync and many other cases. This reduces the difficulty of managing data integration tasks significantly.
4. Distributed snapshot algorithms are supported to ensure data consistency.
5. Multi-engine support: Apache SeaTunnel uses Apache SeaTunnel Engine for data synchronization by default. At the same time, in order to adapt to the enterprise’s existing technology components, Apache SeaTunnel also supports the use of Flink or Spark as the Connector’s runtime execution engine, Apache SeaTunnel supports multiple Spark and Flink versions.
6. JDBC multiplexing, database log multi-table analysis: Apache SeaTunnel supports multi-table or whole database synchronization, solving the problem of too many JDBC connections; it supports multi-table or whole database log reading and analysis, solving the problem of repeated log reading and analysis in CDC multi-table synchronization cases.
7. High throughput and low latency: Apache SeaTunnel supports parallel reading and writing, providing stable and reliable data synchronization capability with high throughput and low latency.
8. Excellent real-time monitoring: Apache SeaTunnel currently supports detailed monitoring information for each step of the data synchronization process, allowing users to easily understand the number of data entries, data size, QPS and other information read and written by the synchronization task.
9. Support for both coding and canvas design job development: The Apache SeaTunnel web project provides the ability to visually manage, schedule, run and monitor tasks.
Apache SeaTunnel Connectors
Apache SeaTunnel Connector use plug-in mechanism, it’s very easy to extend new connector, Apache SeaTunnel currently supports 80+ Source and Sink (Target) Connectors and is growing rapidly.
As a data integration product in a modern data technology stack — the Apache SeaTunnel product architecture is as follows:
Apache SeaTunnel Runtime Flow
The Apache SeaTunnel runtime flow is shown above, with the user configuring the job information and selecting the execution engine to submit the task. The Source Connector is responsible for sending the data in parallel to the downstream Transform or directly to the Sink, which writes the data to the destination. It is important to note that both Source and Transform and Sink can be easily expanded by your own development. In addition to using Apache SeaTunnel’s own engine, you can also choose to use the Flink or Spark engine, in which case Apache SeaTunnel will wrap the Connector as a Flink or Spark application and submit it to run in a Flink or Spark cluster.
Quick Start for Apache SeaTunnel
Please refer to the official website:https://seatunnel.apache.org/docs/2.3.0-beta/start-v2/local ,
You could start the application by the following commands
- Spark
- Flink
- Apache SeaTunnel Engine
cd "apache-seatunnel-incubating-${version}"
./bin/Apache SeaTunnel.sh \
- config ./config/Apache SeaTunnel.streaming.conf.template -e local
And you can of course experience it through deployments such as Kubernetes.
Data integration also includes a data virtualization strategy, which has the advantage of providing access to different data sources through a unified “view”, without the need to reconfigure the architecture of data sources from different sources.
Data virtualization is a good solution for enterprises with high data security requirements and where replication of data is not allowed. However, data virtualization has the following pending issues: it cannot solve performance and data quality problems. With the increasing volume of data in the enterprise, performance issues are a problem faced by all kinds of data integration, and due to design flaws, data virtualization is not up to par with some data integration technologies, despite the rapid advancement in this area. Data quality control means that decisions need to be performed according to data validation rules, which is also not a priority for data virtualization. This makes the data virtualization model unsuitable for cases that require high data quality and require extensive data transformation and processing, such as data governance.
That’s all for now on the topic of data virtualization.
Summary
- Data integration is the key to the success of modern data stacks by eliminating enterprise information silos and enabling data sharing, thus providing solid “hardcore” ways for enterprises to achieve data governance.
- Data integration can connect different ‘silos’ of data, such as local data and SaaS data so that data is not isolated and can be harnessed for greater value.
- Data integration enables key elements of an organization, such as applications, processes, systems, organizations, and people, to work together and improve business efficiency.
- Data integration enables the aggregation of different types of data, allowing users to quickly access useful information and quickly analyze and extract valuable information, thus increasing the success of digital decision-making.
About Apache SeaTunnel
Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.
Why do we need Apache SeaTunnel?
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
- Data loss and duplication
- Task buildup and latency
- Low throughput
- Long application-to-production cycle time
- Lack of application status monitoring
Apache SeaTunnel Usage Scenarios
- Massive data synchronization
- Massive data integration
- ETL of large volumes of data
- Massive data aggregation
- Multi-source data processing
Features of Apache SeaTunnel
- Rich components
- High scalability
- Easy to use
- Mature and stable
How to get started with Apache SeaTunnel quickly?
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.
https://seatunnel.apache.org/docs/2.1.0/developement/setup
How can I contribute?
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!
Submit an issue:
https://github.com/apache/incubator-seatunnel/issues
Contribute code to:
https://github.com/apache/incubator-seatunnel/pulls
Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org
Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ
Follow Twitter:
https://twitter.com/ASFSeaTunnel
Come and join us!