JoyaData’s Exploration and Practice Based on Apache SeaTunnel

Apache SeaTunnel
9 min readMay 27, 2024

--

With the continual evolution of big data technologies, the role of data synchronization tools in enterprises has become increasingly crucial. To meet the complex and varied business demands, finding an efficient and flexible data synchronization tool is paramount.

In this article, we’ll share insights from Li Hongjun, the R&D manager at JoyaData Communications, on why Apache SeaTunnel was chosen, its application, and the experiences garnered. These practical insights will provide valuable references for new users to better understand and utilize SeaTunnel.

Why Choose SeaTunnel?

Initially, we opted for DataX and used it for about two to three years. However, as our business needs grew, we encountered several issues. For instance, DataX only supports single-node deployment and does not support clustering. Additionally, while DataX supports common databases like Oracle and PostgreSQL, it falls short in scenarios that do not support upsert and specific user-defined scenarios. These challenges led us to restart our research and select a new data synchronization tool.

After thorough research, we discovered the Apache SeaTunnel project on GitHub. SeaTunnel not only met our requirements for high availability, upsert support, and job pausing capabilities but also offered simpler configurations than DataX. Moreover, SeaTunnel’s scalability and the vibrancy of its community were also significant factors in our decision.

From research to testing and then to deployment, the whole process took about 2–3 months. At that time, we tested version 2.3.3 and were impressed with its performance. Currently, we have migrated from DataX to SeaTunnel and upgraded to the latest version 2.3.4.

What Problems Did SeaTunnel Solve?

Initially, we were using DataX. The upper layer consisted of a web page that, through a scheduling engine (previously XXL-Job), would assemble the data from the source after collection. The Java types of the source data were obtained through the web page configuration table, with the source as the input and the target as the output, possibly involving some transformations.

After migrating to Apache SeaTunnel, the process remained fundamentally unchanged, but we redesigned the web page style based on SeaTunnel.

Moreover, we replaced the scheduling with Apache DolphinScheduler. DolphinScheduler supports many keys, such as regular Shell, SQL nodes, dependent nodes, Hive, and common data synchronization tools in the market. Hence, we replaced XXL-Job.

The final architecture involves assembling parameters through the web page. Once assembled, these are sent to the scheduling center for execution. The scheduling center then has its monitoring system to relay information to the lower levels.

Experience Sharing

Why We Use This Architecture

Our main task now involves data integration and synchronization through the web page. We transmit the information about the data source and the destination to the lower level through drag-and-drop operations. For example, we transmit the names of the source and destination tables and then generate the table structures based on the auto-table creation and the Java types of the source data. Using the FreeMarker template, we assemble the source, sink, and transform components, including JDBC and Hive, into objects and dynamically generate the configuration files required by SeaTunnel. The lower level executes command-line tasks scheduled by Apache DolphinScheduler.

During the data synchronization process, we focus on synchronization performance and ease of use. The status and performance metrics of tasks are monitored and collected by the improved DolphinScheduler and sent to the Kafka message queue. Our alert center issues alerts based on the success or failure of tasks, monitors task types, and addresses performance bottlenecks. The read/write efficiency from the logs is retrieved through interfaces and displayed on the web page, including real-time progress and curve charts.

For performance testing, we found that the data synchronization speed from TDSQL to Kafka is about 90,000 to 100,000 records per second, handling about 300 million records. From TDSQL to OSS, the speed can sometimes reach 200,000 records per second. These tests ensure the high efficiency of SeaTunnel.

Performance Issue Identification

After participating in the community and joining several user groups, I noticed many people asking about performance issues, such as why the speed is very slow. When you need to identify performance issues, we usually consider two scenarios: one is slow performance in reading source data, and the other is slow data loading performance.

For both slow reading and writing, such as from TDSQL to TDSQL, we might first land the data into a file because writing to a file is generally faster than writing to a database. This allows us to first check the reading performance from TDSQL to a file and then check the writing performance from the file to TDSQL to determine whether the issue is with reading or writing.

Also, for HBase writing, we noticed that writing in the form of put is slow, while using the bulkload method is faster. When identifying synchronization performance issues, it’s important to distinguish whether the issue is with reading or writing. We can use a console sink to test the pure reading performance and then test the writing performance. During the CTR (read and write) process, if writing is slow, reading will also slow down, making it difficult to determine from task monitoring whether the issue is with slow reading or writing. We need to use testing tools such as local files or a console sink to assess performance.

Problem Resolution

The most common issue encountered is JAR conflicts, especially when selecting drivers due to database version incompatibility.

For those familiar with SeaTunnel, using the Zeta engine, the lab directory contains Hadoop, Hive packages, and database drivers, which are prone to conflicts. We have a new filter feature that provides strict class load isolation for all connectors. Previously, the fake packages of Hadoop were not isolated, leading to conflicts when using Hive or Hadoop. Once this feature is completed, each connector will have its independent package directory, and the engine’s Hadoop package will also be stored independently. This will allow different versions of Hive, Hadoop, and databases to be supported in the same job or cluster.

This feature is planned for release in version 2.4, expected to bring significant changes. The current version is 2.3, so related changes will be implemented in 2.4.

Advice for New Users of SeaTunnel

For new users of SeaTunnel, here are some experiences that might help you avoid some pitfalls:

Read the official documentation: Start by thoroughly reading the official documents to understand the basic configurations and methods of use. Official documentation usually provides detailed installation, configuration, and operation guides, which are the best resources for beginners. Download and run the official package: For users who do not want to deal directly with the source code, you can download the official package provided by the developers, run it on a server, and familiarize yourself with the basic operational processes and the mechanics of SeaTunnel. Dive into the source code: If you wish to gain a deeper understanding of how SeaTunnel works, you can pull the source code, examine the configuration files, run and debug the code, and understand how each node operates and how data flows. Adjust configurations and source code: During operation, if you find that certain functions do not meet your needs, you can adjust the configuration files or modify the source code to resolve these issues. For instance, sometimes you may need to deal with some fields that do not match the mapping relationships, and modifying the source code can address these issues. Organize the source code process: During your learning journey, it’s advisable to organize some flowcharts of the source code to better understand the internal logic of SeaTunnel and the implementation of key terms. For example, search for specific keywords (such as “sharding”) to locate relevant classes and methods, which can help you study and modify the source code more efficiently. These suggestions should help new users get started with SeaTunnel more quickly and resolve issues more systematically. Hopefully, you will be able to use SeaTunnel smoothly and enhance your work efficiency.

How to Learn Quickly?

When learning and using SeaTunnel, the following methods and resources can help you master the tool more efficiently:

Use Examples for Debugging

Examples are a key resource for learning and debugging SeaTunnel. Almost all connectors and jobs can be run in the Examples, especially those requiring a cloud environment. If you have prepared a cloud environment, you can also debug in the Examples. This can help us familiarize ourselves with and streamline the entire process.

The Importance of the E2E Module

The E2E module in SeaTunnel’s code contains the usage methods for all connectors and provides detailed test cases. By reviewing and running the test cases in the E2E module, you can gain a comprehensive understanding of the usage and processes of various connectors.

Learning Path and Reference Materials

Official documentation: Read the official documents to understand various examples and parameter configurations. Although the official documentation will start offering Chinese versions from version 2.3.5, it might not be complete initially but will be gradually perfected.

  • Required parameters: Focus on required parameters first when configuring; optional parameters usually have default values and can be omitted.
  • Local debugging: Use Docker to run E2E tests locally, facilitating quick familiarization.
  • Community and contributions: We also hope that community users and contributors will help improve the documentation to help more new users understand and use SeaTunnel more quickly.

By utilizing Examples and the E2E module, combined with official documentation and community resources, you can learn and use SeaTunnel efficiently. Hopefully, these suggestions will help you avoid detours and master this tool more quickly.

How Has Using SeaTunnel Impacted Your Personal Technical Growth?

Yes, like before, we might not have participated in such deep architecture. By delving into SeaTunnel’s architecture, especially technologies like Hazelcast for distributed storage and task scheduling, we can enhance our understanding and application abilities for distributed systems.

Additionally, SeaTunnel’s read/write plugins and transmission features, which employ technologies like SPI and auto service, are not commonly encountered in everyday company coding. These technologies significantly help expand our knowledge and enhance our skills. Overall, SeaTunnel not only enhances our technical experience but also broadens our knowledge base, providing strong support for personal career development.

Does the Community Support a Bulkload Plan?

Currently, we use the put method for writing to Hive, which is slow. I saw that some users in the community mentioned whether a bulkload plan could be supported, and I’m not sure if there are plans for this. Previously, a contributor discussed this issue with me, but I’m not clear on the subsequent progress. If there are no plans in the community to support bulkload, we plan to implement it ourselves first and then contribute it to the community.

How to Rename Columns?

When reading data from HBase, colons in column names cause conversion issues. Usually, we handle column names through transform. For example, we can add rules in transform to replace specific characters in column names with other characters. Currently, we indeed implement column name modifications this way, that is, by capturing the parts before and after the colon.

Professor Gao: We can further discuss this solution. I suggest creating an issue or sending an email to detail your design proposal and see if it can be merged into the main branch.

Is There a Tool in Hazelcast to View Underlying Executions and Specific Storage Actions?

I have a question about using Hazelcast before, as it seems to have a high barrier to entry. Is there a convenient tool to view the contents stored in the engine?

Actually, we use Hazelcast for three main purposes:

Cluster management capability: Hazelcast provides strong cluster management functions. RPC communication capability: Hazelcast is used to implement RPC communication between cluster nodes. Distributed memory grid: Cluster status, monitoring data, and runtime states are stored in Hazelcast’s distributed memory grid, effectively replacing Zookeeper. Through Hazelcast’s message module, you can clearly view the current cluster’s node information, the underlying IMAP list, the amount of data stored in IMAP, request frequency, and response duration.

I recommend using Hazelcast Manager. Although it is not open source, its deployment and configuration are simple, and it allows easy viewing and management of internal information in Hazelcast.

Moreover, Hazelcast provides interfaces that can retrieve detailed monitoring information. If you need to customize the interface or integrate third-party monitoring tools, you can use Hazelcast’s JMX interface. If you prefer to use ready-made tools, you can directly use Hazelcast Manager.

In summary, Apache SeaTunnel not only solved many problems we encountered during data synchronization but also significantly enhanced our work efficiency. By sharing the practical application experiences of JoyaData Communications, we hope to help more users better understand and utilize SeaTunnel, promoting the application of open-source data synchronization tools in more scenarios. Thank you to every developer and user who has contributed to SeaTunnel. Let’s work together to make SeaTunnel even better!

--

--

Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.