Handling 2TB of Daily Data in Internet Banking with Ease Using Apache SeaTunnel

4 min readJul 8, 2024

Editor&Translator | Debra Chen

In China, the trend toward digitalization has driven the rapid development of Internet banking. In recent years, internet banks have been actively expanding their online business, leveraging big data technology to enhance risk control, and undergoing digital transformation. As emerging internet banks ride the wave of digital reform, pairing with the data integration platform Apache SeaTunnel, millions and billions of data points now have a fast-flowing pipeline.

At the community’s online user meeting in June, a big data engineer from a certain internet bank shared their experience and practices regarding the application of Apache SeaTunnel in internet banking. The following is a summary of the presentation for reference (hereafter referred to as “the bank”):

Presentation Overview

Presenter: Chen Wei
Date: June 26, 2024
Topic: Application and Customization Practices of Apache SeaTunnel 2.1.3 in Data Integration

Background

With the growing demand for data integration in the bank, we needed a tool that could support configurable development, heterogeneous data source access, and provide high performance and efficiency in data integration. After thorough investigation and consideration, Apache SeaTunnel met these needs with its powerful data processing capabilities, leading us to adopt Apache SeaTunnel.

Application Scenarios

Apache SeaTunnel plays a crucial role in the following three main scenarios at the bank:

Data Acceleration: The data warehouse processes models at the data model layer and, upon completion, pushes the data to the OLAP dedicated engine (currently configured with the ClickHouse engine) to support real-time user queries.
Data Pushing: Processed result data from internal management systems, such as indicator management systems and label management systems, is pushed to target data sources (MySQL).
Data Collection: Enhancing the timeliness of business system data by promptly collecting it to the target data source.

SeaTunnel Customization (V2.1.3)

To better meet the bank’s needs, we made a series of custom improvements to SeaTunnel:

Data Source Support: Added support for data sources not directly supported by Spark, such as Transwarp Inceptor and Hive transactional tables.
Plugin Optimization:

Added custom plugins.

2. Iteratively optimized existing plugins like Jdbc, ClickHouse, Hive, and ElasticSearch.

3. Other runtime optimizations.

Customization of Specific Plugins

Jdbc:

Added support for multiple queries and automatic partitioning based on specified fields.
Jdbc Sink adds support for PreSQL execution.
Adds support for transaction tables in Inceptor tables.

ClickHouse & Hive:

Adds support for PreSQL execution.
Adjusts the way data is written to Hive.

SeaTunnel Integration Application

Integration with Apache Livy

We integrated Apache SeaTunnel into the existing Apache Livy service, improving in terms of quick startup, security, and flexibility.

Quick startup: Through the Livy Client, multiple SeaTunnel Jobs can run under the same SparkContext, improving startup efficiency.
Security: Accessing the big data platform through Livy with client security authentication settings without exposing the entire big data cluster, thus protecting the big data cluster’s security.
Flexibility: Integration with Livy allows submitting SeaTunnel tasks through Livy jobs without producing local configuration files, enhancing system flexibility.

Integration with Apache DolphinScheduler

Shared data sources: Uses the same data source configuration as SQL and other tasks, reducing the complexity of configuration changes.
Consistent parameters: Supports parameter configuration consistent with the scheduling system, making it easier for users to learn and use.
Consistent metadata: The bank has developed support for lineage-related features, providing metadata configuration relative to SQL and other tasks at the task level, and facilitating automatic system triggering.

SeaTunnel Deployment

Projects integrated: 7
Tasks integrated: 2000+
Daily instances: 2000+
Daily data volume: 2TB
Supported data sources: Transwarp Inceptor, MySQL, Oracle, ElasticSearch, Remote Hbase, ClickHouse

SeaTunnel Summary and Outlook

SeaTunnel currently supports our needs for anomaly data integration, mainly focusing on the data application end. Future work includes promoting support for data collection to enhance overall data pipeline efficiency;
Improvement is needed for the application of SeaTunnel in bulk data collection, especially for sharding support. The scheduling system side requires the addition of scheduling capabilities by markers (database markers, file markers, etc.);
Enhancing the metrics data collection for SeaTunnel data integration;
Optimizing the parallelism of SeaTunnel data integration (especially for ES write optimization).

Joining the SeaTunnel Community

We welcome developers and enterprises interested in data integration to join the SeaTunnel community to jointly discuss and promote the development of data integration technology.

About Apache SeaTunnel

Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

Data loss and duplication
Task buildup and latency
Low throughput
Long application-to-production cycle time
Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

Massive data synchronization
Massive data integration
ETL of large volumes of data
Massive data aggregation
Multi-source data processing

Features of Apache SeaTunnel

Rich components
High scalability
Easy to use
Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/seatunnel/issues

Contribute code to:

https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Join us now!❤️❤️