【Experience sharing】Migrating from third-party data integration tools to Apache SeaTunnel

7 min readApr 8, 2024

In today’s data-driven business environment, building an efficient and reliable data warehouse is not just a fundamental task for enterprises; it is key to driving business insights, decision support, and innovation. Data integration technology plays a crucial role in this context, as its timeliness and accuracy directly affect the efficiency and output of downstream businesses.

As business needs continue to evolve and the volume of data grows, choosing a data integration tool that meets current demands and adapts to future development becomes critical. This article shares some general experiences in migrating data integration tools, using DataX as an example, and describes the process and experience of migrating to Apache SeaTunnel. This migration scheme is also applicable to other data integration tools like Sqoop. Through this share, we hope to provide references for colleagues in the community and help teams facing similar migration decisions to complete their transformation more smoothly.

Migration Background

DataX, as a stable and efficient data synchronization tool, has long served our data integration needs. However, as the demand for data processing grows, DataX’s limitations in terms of data source richness and processing capacity have become apparent.

At the end of 2023, our team decided to migrate from DataX to Apache SeaTunnel, based on in-depth research and comparison of mainstream data integration technologies on the market. During the comparison, we focused on various aspects, including but not limited to architecture design, engine performance, community support and activity, feature richness, processing performance, data consistency guarantees, user-friendliness, extensibility, and system stability. Apache SeaTunnel stood out in many respects, especially in terms of data source richness, community activity, ease of use, and its support for stream-batch unified processing capabilities, providing a solid reason for our choice.

Migration Experience Sharing

Comprehensive Field Type Comparison and Special Character Conversion

In data migration, details determine success or failure. Many teams may focus on macro tests of the overall architecture and process during data integration tool migration, such as confirming the smooth running of several synchronization tasks, before initiating migration work.

However, this method might overlook subtle but crucial details, especially in terms of comprehensive testing of field types and special character handling.

For example, in field types, different data integration tools may behave differently when processing the same data type.

For instance, when handling array-type data, DataX might return a comma-separated string (e.g., “a,b,c”), while Apache SeaTunnel might return an array format string (e.g., “[a,b,c]”). This difference might seem minor but can lead to significant data inconsistency issues in the data integration process.

Moreover, the handling of nested data structures also requires thorough testing.

For example, the document type in MongoDB might contain nested structures of various basic data types. Whether these nested data types remain consistent before and after migration needs careful comparison and testing. In our initial tests, we mainly focused on some basic data types without sufficient comparison and analysis of nested types, leading to some issues during migration.

The following image shows a portion of the source code we modified. In fact, just for MongoDB, our modifications related to data type handling involved up to eight different places.

Special character handling is another detail that needs attention.

For example, DataX might replace newline characters and other special characters with spaces, while SeaTunnel preserves these special characters. When syncing data to Hive and the target table is in textfile format, this can easily lead to data line misalignment problems.

It’s important to note that these differences exist not only between the official code of DataX and Apache SeaTunnel but also between versions customized by individual companies. Even if some data conversion logic seems unreasonable, it must remain consistent with the original to avoid impacting business use.

In summary, through detailed comparisons, including the correct handling of numbers, strings, datetime, and other field types and special symbols, we strictly controlled the consistency of data before and after migration. Any differences found were recorded and adjusted until our tool, Apache SeaTunnel, achieved the same effect as DataX, ensuring the accuracy of data migration.

Gray Release/Shadow Running Scheme

Regarding the gray release of software components, it typically refers to the practice of replacing only a portion of the servers during an upgrade to mitigate the risks associated with a full upgrade. When upgrading Apache SeaTunnel components, such as from version 2.3.1 to 2.3.2, we adopted a similar strategy.

However, there’s a limitation to this approach, as which tasks use the new version is random. Sometimes we might prefer key tasks to use the old version and non-key tasks to use the new version, ensuring the stable operation of key tasks while allowing some non-key tasks to benefit from the new version’s improvements. Therefore, we controlled task-level graying in our general synchronization script with a version parameter, which could be simply controlled with a line of code:

/home/q/dis/apache-seatunnel-${version}/bin/start-seatunnel-flink-13-connector-v2.sh

Mainly by combining platform features, it supports batch selection of tasks and filling in their version numbers. Based on the risk or impact of the change, we can flexibly choose one of the above two gray release schemes for implementation.

However, the main focus of this article is a more refined release strategy for data during the early stages of migration — task (or table) level gray release, or more accurately, “shadow running”. In the financial sector, the accuracy and completeness of data are crucial, especially when it comes to key SLA data tables, where no margin for error is tolerated.

Since the company has a dual data center architecture, all synchronization tasks run simultaneously in both data centers, and the backup data center’s tables are normally unused. This provided a great platform for our gray operations. In the early stages, ST ran in the backup data center, while DataX continued to run in the main data center, with parallel jobs lasting half a month. During this half-month, we continuously monitored the processing of both sets of components, including but not limited to running time, resource consumption, data counts, and data values. After everything met expectations, we officially moved to the main data center.

For the running efficiency of Apache SeaTunnel, we required tasks lasting more than 2 minutes, ST processing time must be 100% lower than DataX. (Why set it to 2 minutes? Because DataX runs locally, eliminating steps like submitting to the cluster, and for scenarios with very few data records, DataX is necessarily superior to Apache SeaTunnel) If SeaTunnel’s running time exceeds DataX, we would immediately start an in-depth analysis to identify performance bottlenecks and implement corresponding optimizations, such as optimizing the MONGO synchronization splitting algorithm, which reduced task duration by more than 20%.

Of course, resource consumption is also an important metric to monitor. We allow Apache SeaTunnel to consume more resources to some extent, but in practice, essentially the same resources can achieve a time-efficiency improvement, with a maximum increase of 40%.

More critically, in terms of data consistency verification, we performed precise comparisons of the data in Hive tables. This work was not limited to comparing quantities but also involved precise verification of data values. We ensured that only after thoroughly verifying that the data processed by both systems was completely consistent in all dimensions, with no deviations, did we proceed with the official system switch.

In this process, the role of data comparison tools was indispensable. Given that our primary target data storage platform is Hive, we specifically developed a data comparison tool for Hive tables. This tool could not only compare the number of data records but also compare the values of records column by column, with differences in comparison results output to a table for easy user review. This allowed us to comprehensively compare data for half a month, i.e., hundreds of millions of records, effectively avoiding the potential omissions of relying solely on partial data sampling comparisons.

Master-Slave Switching Scheme

Even with such strict gray release and shadow running schemes, we were still not at ease.

Facing a still immature new technology, especially when the company lacks corresponding technical experts, encountering unforeseen technical challenges is particularly dangerous. If Apache SeaTunnel encounters unknown exceptions that cannot be quickly resolved, it will seriously threaten business decision-making and data analysis processes, and may even lead to major production incidents. Given the company’s strict SLA requirements, we have almost no leeway for in-depth research and troubleshooting in the event of a failure.

To address these risks, we designed a master-slave scheme from the component level. The core of this scheme is to ensure that, in the event of an Apache SeaTunnel failure, DataX can quickly and seamlessly take over data synchronization tasks to maintain business continuity.

As shown in the following image, whether it’s new table synchronization tasks or historical migration tasks, we have two sets of scripts for each, allowing us to quickly switch to another component for synchronization if the problem cannot be quickly resolved.

Conclusion

In the process of migrating data integration tools to Apache SeaTunnel, we focused on comprehensive detail comparison, such as field type and special character handling, and implemented strict gray release, shadow running, and master-slave switching schemes to ensure the timeliness, accuracy, and business continuity of data.

Overall, this migration thoroughly considered various potential challenges and risks, and made corresponding countermeasures during implementation, demonstrating the rigor of data integration tool migration.