Quickly Building a Data Integration Platform Based on Apache SeaTunnel by China Telecom Yikang Tech

Apache SeaTunnel
9 min readAug 8, 2024

--

Author | Dai Lai, Engineer from China Telecom Yikang Tech

Translator&Editor | Debra Chen

I. Introduction

As a high-performance, easy-to-use data integration framework, Apache SeaTunnel is the cornerstone for quickly implementing a data integration platform. This article will detail how we promptly built a data integration platform based on Apache SeaTunnel from several aspects, including the strategic background of our data middle platform, technology selection for the data integration platform, and how we lowered the threshold for using Apache SeaTunnel, and prospects.

II. Strategic Background of the Data Middle Platform

With the increasing demand for data-driven decision-making in the healthcare industry, the opportunity to tap into the value of healthcare data and stimulate the potential for new productivity is imminent. China Telecom Yikang has developed its own “data middle platform” to carry out full-process management and one-stop empowerment of healthcare data elements, creating a foundation for the operation of healthcare data elements, and assisting in the value extraction of healthcare data and the application of AI models. Against this strategic background, the data integration platform, as the “artery” of our data middle platform, needs to be quickly implemented and capable of meeting the complex data integration scenarios of the middle platform.

III. Technology Selection for the Data Integration Platform

3.1. Key Considerations

When selecting the technology for the underlying data integration platform, the following key factors need to be considered:

  • Performance: The data integration engine needs to have high throughput and low latency, capable of efficiently processing large amounts of data.
  • Scalability: The data integration engine should have good scalability to dynamically expand processing capabilities according to business needs.
  • Usability: The data integration platform should be easy to use and maintain, reducing reliance on professional technical personnel.
  • Ecosystem Support: The data integration engine should support multiple data sources and targets, with good ecosystem support.

3.2. Advantages of Choosing Apache SeaTunnel

Currently, the mainstream data integration technologies in the market include Sqoop, Datax, Kettle, Flink CDC, Canal, and Airbyte. Apache SeaTunnel has the following advantages, making it an ideal choice for our data integration platform:

  1. Performance

According to the latest official data, Apache SeaTunnel is 40%-80% faster than Datax and 30 times faster than Airbyte in the same test scenarios, demonstrating unparalleled performance advantages. We also tested it on-site with customers, and in a performance test from jdbc-source to jdbc-sink on an 8C32G server, our data integration platform’s speed was nearly 20,000 records per second faster on average than third-party platforms. This excellent performance stems from SeaTunnel’s excellent design. For example, in the JDBC connector, SeaTunnel uses database connection reuse and dynamic sharding, and its zeta engine implements dynamic thread sharing technology. This ensures minimal resource usage while completing data synchronization and improves efficiency.

2. Deployment Method

In our customer scenarios, most hospitals can only provide physical machines for deploying collection services at the front-end collection nodes, while the platform is deployed centrally. The network is only connected from the collection end to the hospital database and from the central end to the front-end collection nodes, without cross-end communication. Only a few customers can deploy all services in one environment. This requires very flexible deployment. SeaTunnel supports both distributed and standalone deployments, and its decentralized design ensures high availability and scalability. Each node can be both a Master and a Worker, or Masters and Workers can be deployed separately. The former is suitable for small to medium-scale deployments, while the latter is suitable for large-scale deployments.

3. Fault Tolerance

SeaTunnel’s fault tolerance is also excellent.

From a cluster perspective, if a cluster node fails, its tasks can automatically fault-tolerate to other cluster nodes. When IMAP persistence is enabled in the cluster, even if all cluster nodes fail, they can automatically recover using the persistent data when the cluster restarts. It should be noted that the first node to start in the cluster will load the persistent IMAP data, so the startup time difference between cluster nodes should not be too long to avoid all tasks being assigned to the first node that started.

From a job perspective, SeaTunnel also has a checkpoint mechanism. If a job unexpectedly fails, it can recover from the checkpoint, ensuring that expensive data synchronization tasks do not need to be resynchronized. Additionally, due to network latency, node failures, and other reasons, data in distributed systems may have consistency issues. SeaTunnel also implements a two-phase commit in related connectors to ensure data consistency.

4. Rich Ecosystem

SeaTunnel already supports over 100 types of data sources and is easy to extend and support its ecosystem. It supports whole database synchronization, multi-table synchronization, and breakpoint resumption. It also supports automatic table creation, which is a very user-friendly feature, especially when synchronizing many tables.

5. Integration Engine Architecture

SeaTunnel’s EtLT architecture is very suitable for data middle platform scenarios. In the data middle platform scenarios, 90% of the scenarios involve moving data from the source to the target, which may include transformation (Transform), but this “T” is a lowercase “t,” mainly including column copying, column filtering, and field splitting, rather than operations like join or group by. This is very common in data middle platforms, where data enters the data warehouse for complex SQL-based data association queries.

6. Platform Architecture

If we do not choose SeaTunnel as the data integration engine, our platform architecture might look like this:

The disadvantage of this architecture is that using multiple data integration engines incurs high maintenance costs, and it also requires a Flink execution environment to complete real-time synchronization tasks. From the perspective of quickly implementing a data integration platform, this is not very user-friendly as it requires in-depth research into multiple data integration engines. When we adopt SeaTunnel, the architecture of the data integration platform can be optimized as follows:

We only need to study Apache SeaTunnel and quickly implement the data integration platform based on it. If there are unmet needs, we can also carry out secondary development based on it. The development and maintenance costs are much lower compared to the former.

IV. How to Lower the Threshold for Using SeaTunnel

To lower the threshold for using Apache SeaTunnel, our team has carried out a series of modifications to better meet our usage scenario requirements and reduce the difficulty of use, including:

1. A user-friendly interface

To lower the threshold for use, a visual configuration interface is developed, allowing users to configure data integration tasks through a graphical interface without writing complex configuration files. It supports batch and streaming task creation; selecting and configuring data sources; parameter configuration (optional); configuring the mapping relationships for synchronization tasks, allowing flexible adjustment of field order, custom field values, adding default fields, and deleting redundant fields; conducting complex SQL data association queries; periodic scheduling of batch tasks to meet the needs of timed full or incremental synchronization; and global parameter settings.

The above are some examples of our product features. The entire product’s functionality goes far beyond this. Through these examples, we aim to guide users on how to quickly implement a data integration platform.

2. Providing Rich Documentation and Examples

An excellent data integration platform must have rich and excellent documentation. By providing detailed usage documentation and rich example code, users can quickly get started. This includes how to install, configure, debug, and solve common problems.

Main documents include environment requirements, project configuration, configuration file explanations, running tests, and common problem solutions. For example:

  • Common Problem Solutions
  • Data Source Connection Issues:
    Ensure the data source address, port, and authentication information are correct.
    Check network connections and firewall settings.
  • Data Transformation Errors:
    Check if the transformation rules are correct.
    Ensure all fields and types match.
  • Performance Issues:
    Adjust connector parameters and other configurations to improve performance
    Optimize data transformation logic.
  • Plugin Issues:
    Ensure all necessary plugins are installed and correctly configured
    Check the compatibility of plugin versions.

3. Integrating Automated Deployment Tools

SeaTunnel’s automated deployment and management further reduce the difficulty of use and maintenance. It implements the function of one-click deployment of the SeaTunnel service based on server address information.

We can monitor the deployed SeaTunnel service in real time now.

4. Community Support

During the development and implementation of the data integration platform, some problems are inevitable. The community already has some experience with issues like SeaTunnel cluster fault tolerance and recovery, and they can actively provide answers and help.

Additionally, some features may not meet our actual business needs. For example, in the lakehouse data middle platform architecture, we use Apache Paimon as the data lake, but the community’s Paimon connector cannot fully meet our business needs. We have successively made bug fixes and added many new features to the Paimon connector:

  • Support for CDC writing to Paimon.
  • Support for automatic table creation in Paimon sink, specifying partition key, primary key, and multiple buckets (improving write performance in large data write scenarios).
  • Support for a multi-table sink in Paimon.
  • Support for specifying formats for writing to Paimon (default is ORC, but Parquet and Avro formats can be specified).
  • Fix for incorrect date field writing and support for timestamp(n) types.
  • Support for Kerberos authentication and HA mode HDFS clusters.
  • Support for Hive catalog.
  • Support for pre-type conversion validation before writing to sink tables.
  • Fix for batch write data loss issues.

These are just glimpses of our contributions to the community. Since choosing Apache SeaTunnel as the data integration engine, we have benefited from the community and should actively contribute back to the community to help everyone improve together.

V. Future Prospects

With the increasing demand for big data in the healthcare industry, SeaTunnel will play an important role in healthcare informatization, especially in data integration and processing. As the demand for data-driven decision-making in the healthcare industry continues to grow, SeaTunnel’s features and capabilities can well meet the needs of healthcare big data platforms. Here are some prospects for SeaTunnel implementation in the healthcare industry:

1. Multi-Data Source Integration

Integration of hospital electronic medical record systems, imaging information systems (PACS), laboratory information systems (LIS), etc., to achieve cross-system data sharing.

2. Data Standards

Support for healthcare industry standards like HL7 FHIR (Fast Healthcare Interoperability Resources), improving data standardization and interoperability.

3. Security and Privacy Protection

  • Data Encryption: Using encryption technology to protect data security, especially during transmission.
  • Anonymization and Desensitization: Implementing anonymization and desensitization of data to protect patient privacy.

4. AI and Machine Learning Integration

The data integration platform will introduce more intelligent features, such as intelligent recommendation configurations, to help users more efficiently integrate and process data.

VI. Conclusion

As an efficient and flexible data integration platform, Apache SeaTunnel plays an important role in our data middle platform strategy. Through this article, we may understand how to quickly build a data integration platform based on SeaTunnel and flexibly apply it in practice. In the future, with the continuous development of technology, SeaTunnel will continue to play an important role in the field of data integration, helping enterprises achieve data-driven business transformation.

About Apache SeaTunnel

Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of Apache SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/seatunnel/issues

Contribute code to:

https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Join us now!❤️❤️

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet