The Evolution of Digital Guangdong’s Data Integration Platform from DataX to Apache SeaTunnel

Apache SeaTunnel
7 min readAug 28, 2023

--

Author | Meng Xiaopeng
Translator&Editor | Debra Chen

Background

In the era of big data, while government agencies enjoy the management convenience brought by big data, they also face challenges such as data quality, data management, and data sharing. To enhance the scientific and precision decision-making of the government, improve government efficiency, optimize government services, and promote digital transformation, government data exchange platforms have become an indispensable “bridge” for mutual data circulation in government data.

The government data sharing and exchange platform refers to a platform where government departments share data. Through this platform, data sharing, exchange, and integration between government departments can be achieved. It enables cross-level, cross-regional, cross-department, cross-system, and cross-business information resource sharing of government data.

As the government data-sharing platform needs to integrate multiple data sources, including different departments, different databases, and different formats, its construction, and maintenance rely heavily on data integration technology support.

Digital Guangdong has been at the forefront of digital transformation, constantly exploring the field of data sharing and exchange business scenarios, and has replaced DataX with Apache SeaTunnel for data integration support.

What challenges does the government sector face in data exchange business scenarios? How can Apache SeaTunnel data integration technology be utilized to maximize the value of data? Digital Guangdong’s journey through data integration technology evolution answers these questions.

Challenges in Business

Digital Guangdong, composed of China’s three major telecom operators and Tencent, aims to become a leading platform-based technology company in China.

Its product “Yueshengshi” serves as a mobile government service platform, an essential tool for fingertip government services. It is an important foundation for Guangdong Province’s government's digital transformation. “Yueshengshi” has two distinctive scenarios: comprehensive government services and epidemic prevention and control.

Government Service Scenario:

By May 2022, more than 2,500 services have been launched, including common services like provident fund, social security, medical insurance, utility bills, electronic certificates, and tax-related services.

Epidemic Prevention and Control Scenario:

During the three years of the pandemic, “Yueshengshi” played a crucial role in real-time official epidemic prevention and control information. It provided 73 services related to epidemic prevention and control, delivering real-time authoritative information.

To support existing “Yueshengshi” services, including government digital services, real-time data is crucial, demanding high data quality, efficient data flow, and reflecting in the core capabilities of the data integration platform. Therefore, data synchronization encountered some pain points, and Digital Guangdong faced challenges during the process of constructing the data synchronization platform.

Pain Points in Data Synchronization

In the realm of data synchronization, both real-time and offline, Digital Guangdong encountered various challenges.

Pain points in offline data synchronization

For offline data synchronization, the focus was on the following four aspects:

  • Ecosystem Insufficiency
  • Compatibility Issues
  • Performance Shortcomings
  • Scalability Issues

Ecosystem insufficiency refers to the limitations in the available tools and libraries to handle diverse database sources, especially within the complex ecosystem of the government sector.

Compatibility issues arise when dealing with less common or specialized data types not adequately supported by DataX’s predefined mappings.

Since the existing DataX architecture essentially consists of a read plugin + Framework + write plugin, adapting to various databases requires compatibility with different data types within those databases. DataX defines six types for data conversion and mapping, but it doesn’t handle less common or specialized data types effectively.

Performance shortcomings were observed when trying to achieve multi-channel capabilities, essential for efficient large-scale data synchronization.

DataX supports two modes: table mode and query SQL mode, the latter being the custom SQL mode commonly used in production environments. Clearly, custom SQL is better suited to meet the demands of complex real-world production scenarios. However, DataX’s current custom SQL mode only supports single-channel capability, which leads to synchronization performance issues.

Finally, scalability concerns emerged due to DataX’s single-node architecture, unable to meet the demands of large-scale data synchronization in a distributed environment.

To address these pain points, Digital Guangdong formulated a comprehensive solution consisting of the following four aspects:

  • Architecture Upgrade

As we all know, the open-source version of DataX is single-node, so Digital Guangdong undertook the task of upgrading DataX to a distributed architecture.

  • Enrichment of Data Sources with Multi-Version Support

Addressing the issue of ecosystem limitations, efforts were made to enrich the data sources within the existing ecosystem. This includes support for more databases to meet new requirements, as well as providing support for multiple versions of mainstream databases that were already in existence.

  • Task Modes

The open-source DataX cannot handle incremental tasks. Here, “incremental” refers to the need for processing tasks with incremental updates. For this purpose, there’s a requirement to store the incremental time data from the previous task run. Clearly, DataX currently lacks support for this. Therefore, Digital Guangdong enhanced the task modes.

  • Data Type Compatibility Optimization

The issue of data type compatibility mainly pertains to compatibility between plugins and data types. For data type compatibility, specific focus was placed on addressing uncommon data types, such as blob fields, Sybase fields, etc. This involved performing about 15 types of data compatibility optimizations based on the open-source version.

The optimized solution aimed to provide better support for offline data synchronization, aligning with Digital Guangdong’s business requirements.

Real-Time Data Synchronization Pain Points

As the demand for real-time data synchronization increased, Digital Guangdong encountered challenges in this area as well.

DataX, designed as an offline architecture, was inadequate for real-time synchronization. To address this, Digital Guangdong had two routes to consider: building a real-time CDC (Change Data Capture) module internally or integrating an industry-standard real-time framework, such as Flink CDC.

While building a self-developed real-time module was theoretically feasible, it posed challenges in terms of manpower, technical difficulty, and delivery timeline. Moreover, such an approach might result in duplicating existing efforts and inefficiencies.

Choosing the integration route presented its issues, requiring the integration of real-time frameworks into an existing platform, for example, Flink CDC.

While leveraging the capabilities of Flink CDC for distributed CDC data synchronization and ingestion seems promising, challenges still arise during integration. Attempting to integrate real-time pipelines into existing platforms poses three significant hurdles for the Lambda architecture in data integration:

  • The first challenge is the proliferation of pipelines. Integrating a real-time pipeline into an existing setup, which is already established as a single-line pipeline, introduces the complexity of managing multiple pipelines.
  • Under the scenario of multiple pipelines, the system requires substantial modification. This often leads to the need for separate development efforts in parallel, such as having to develop a real-time pipeline based on Flink CDC alongside an existing synchronization pipeline based on DataX.
  • Managing multiple pathways also presents operational challenges. Operations teams must not only support the maintenance of the existing offline platform but also manage the real-time Flink technology stack, adding to the complexity of maintaining the system.

In this context, Digital Guangdong sought a data synchronization solution that could address both real-time and offline pain points, providing a unified technology stack, avoiding duplicated development efforts, and simplifying operations.

Data Integration Platform Selection and Evolution

Following a thorough analysis of real-time data synchronization pain points and requirements, Digital Guangdong embarked on the journey of selecting a data integration platform.

Their requirements could be summarized as follows: they needed a rich ecosystem, distributed architecture, support for both batch and stream processing, high performance, and an active community.

After an extended research period, Apache SeaTunnel emerged as the preferred solution that best matched Digital Guangdong’s requirements. SeaTunnel boasts hundreds of enterprise users and supports three engines within a distributed batch and stream processing architecture. Additionally, the community is working on a web interface for SeaTunnel to improve its usability.

User Expectations

Looking ahead, Digital Guangdong anticipates several areas of focus for the future of its data synchronization platform.

Firstly, they will continue to iterate on their data synchronization platform based on Apache SeaTunnel. They are currently in the development stage and plan to migrate existing offline chains smoothly to the new platform.

Furthermore, Digital Guangdong expresses expectations regarding monitoring metrics and the CDC ecosystem. They hope for a more diverse CDC ecosystem that can cater to complex CDC scenarios in their production environment. Currently, the community mainly supports two types of CDC: SQL Server and MySQL.

Regarding task metric monitoring, Digital Guangdong seeks more comprehensive metrics beyond simple time, read/write records, and failure records. They desire intermediate state metrics, such as flow control-related metrics and data rate, which can provide a more accurate view of task performance in real-world scenarios.

Author Introduction

Meng Xiaopeng
  • Technology Manager at Digital Guangdong
  • Professional Committee Member of Digital Guangdong Technical Committee
  • Apache SeaTunnel & DataX Contributor

About Apache SeaTunnel

Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of Apache SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/seatunnel/issues

Contribute code to:

https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Join us now!❤️❤️

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet