Handling 2TB of Daily Data in Internet Banking with Ease Using Apache SeaTunnel
Editor&Translator | Debra Chen
In China, the trend toward digitalization has driven the rapid development of Internet banking. In recent years, internet banks have been actively expanding their online business, leveraging big data technology to enhance risk control, and undergoing digital transformation. As emerging internet banks ride the wave of digital reform, pairing with the data integration platform Apache SeaTunnel, millions and billions of data points now have a fast-flowing pipeline.
At the community’s online user meeting in June, a big data engineer from a certain internet bank shared their experience and practices regarding the application of Apache SeaTunnel in internet banking. The following is a summary of the presentation for reference (hereafter referred to as “the bank”):
Presentation Overview
- Presenter: Chen Wei
- Date: June 26, 2024
- Topic: Application and Customization Practices of Apache SeaTunnel 2.1.3 in Data Integration
Background
With the growing demand for data integration in the bank, we needed a tool that could support configurable development, heterogeneous data source access, and provide high performance and efficiency in data integration. After thorough investigation and consideration, Apache SeaTunnel met these needs with its powerful data processing capabilities, leading us to adopt Apache SeaTunnel.
Application Scenarios
Apache SeaTunnel plays a crucial role in the following three main scenarios at the bank:
- Data Acceleration: The data warehouse processes models at the data model layer and, upon completion, pushes the data to the OLAP dedicated engine (currently configured with the ClickHouse engine) to support real-time user queries.
- Data Pushing: Processed result data from internal management systems, such as indicator management systems and label management systems, is pushed to target data sources (MySQL).
- Data Collection: Enhancing the timeliness of business system data by promptly collecting it to the target data source.
SeaTunnel Customization (V2.1.3)
To better meet the bank’s needs, we made a series of custom improvements to SeaTunnel:
- Data Source Support: Added support for data sources not directly supported by Spark, such as Transwarp Inceptor and Hive transactional tables.
- Plugin Optimization:
- Added custom plugins.
2. Iteratively optimized existing plugins like Jdbc, ClickHouse, Hive, and ElasticSearch.
3. Other runtime optimizations.
Customization of Specific Plugins
- Jdbc:
- Added support for multiple queries and automatic partitioning based on specified fields.
- Jdbc Sink adds support for PreSQL execution.
- Adds support for transaction tables in Inceptor tables.
- ClickHouse & Hive:
- Adds support for PreSQL execution.
- Adjusts the way data is written to Hive.
SeaTunnel Integration Application
Integration with Apache Livy
We integrated Apache SeaTunnel into the existing Apache Livy service, improving in terms of quick startup, security, and flexibility.
- Quick startup: Through the Livy Client, multiple SeaTunnel Jobs can run under the same SparkContext, improving startup efficiency.
- Security: Accessing the big data platform through Livy with client security authentication settings without exposing the entire big data cluster, thus protecting the big data cluster’s security.
- Flexibility: Integration with Livy allows submitting SeaTunnel tasks through Livy jobs without producing local configuration files, enhancing system flexibility.
Integration with Apache DolphinScheduler
- Shared data sources: Uses the same data source configuration as SQL and other tasks, reducing the complexity of configuration changes.
- Consistent parameters: Supports parameter configuration consistent with the scheduling system, making it easier for users to learn and use.
- Consistent metadata: The bank has developed support for lineage-related features, providing metadata configuration relative to SQL and other tasks at the task level, and facilitating automatic system triggering.
SeaTunnel Deployment
- Projects integrated: 7
- Tasks integrated: 2000+
- Daily instances: 2000+
- Daily data volume: 2TB
- Supported data sources: Transwarp Inceptor, MySQL, Oracle, ElasticSearch, Remote Hbase, ClickHouse
SeaTunnel Summary and Outlook
- SeaTunnel currently supports our needs for anomaly data integration, mainly focusing on the data application end. Future work includes promoting support for data collection to enhance overall data pipeline efficiency;
- Improvement is needed for the application of SeaTunnel in bulk data collection, especially for sharding support. The scheduling system side requires the addition of scheduling capabilities by markers (database markers, file markers, etc.);
- Enhancing the metrics data collection for SeaTunnel data integration;
- Optimizing the parallelism of SeaTunnel data integration (especially for ES write optimization).
Joining the SeaTunnel Community
We welcome developers and enterprises interested in data integration to join the SeaTunnel community to jointly discuss and promote the development of data integration technology.
About Apache SeaTunnel
Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.
.
Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)
Why do we need Apache SeaTunnel?
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
- Data loss and duplication
- Task buildup and latency
- Low throughput
- Long application-to-production cycle time
- Lack of application status monitoring
Apache SeaTunnel Usage Scenarios
- Massive data synchronization
- Massive data integration
- ETL of large volumes of data
- Massive data aggregation
- Multi-source data processing
Features of Apache SeaTunnel
- Rich components
- High scalability
- Easy to use
- Mature and stable
How to get started with Apache SeaTunnel quickly?
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.
https://seatunnel.apache.org/docs/2.1.0/developement/setup
How can I contribute?
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!
Submit an issue:
https://github.com/apache/seatunnel/issues
Contribute code to:
https://github.com/apache/seatunnel/pulls
Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org
Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ
Follow Twitter:
https://twitter.com/ASFSeaTunnel
Join us now!❤️❤️