Apache SeaTunnel(Incubating) Examples module design and implementation practice
Wang Jianda, a big data engineer from ThinkingData, shares the topic of “The Use and Design of the Apache SeaTunnel Examples Module” based on his re-development experience. In addition to introducing the instinct of Apache SeaTunnel Examples module design and how he developed the module, Wang Jianda will also share his experiences and stories of participating in community activities as an open-source contributor.
Personal profile
Wang Jianda, a Senior Big Data Engineer of ThinkingData, works in the Big Data R&D Department of ThinkingData and is mainly responsible for the re-development and the R&D of the data integration system.
This sharing is divided into three parts:
- Getting to Know Apache SeaTunnel (Incubating)
- Examples of module design and use
- My story with the open-source community
01 Getting to know Apache SeaTunnel for the first time
First I will share with you how I came to know about Apache SeaTunnel (Incubating). ThinkingData, where I work, is mainly engaged in the user behavior analysis of the game industry and providing customers with a big data intelligent analysis platform.
The lower left corner of the above figure is the functional module division/architecture hierarchy diagram of our platform, which can be divided into the Data Access Layer, Business Application Layer, Real-time Computing Layer, Data Persistence Layer, and O&M Monitoring Layer. At first, I was mainly engaged in the development of real-time computing and data access, but with the increasing demand for data integration, I began to set about the re-development and data integration platform.
The re-development and data integration platform is positioned as the data routing between the company’s products and the customer’s business system, which can not only complete the data circulation within the product but also perform the data exchange with external systems, to meet the user’s needs for efficient and stable data flow.
Data synchronization scenarios include advertising, monetization, and ROI breakthrough.
There is a high demand for data integration in the industry, that is, the connection of in-game user behavior data with advertising cost and advertising monetization data, and the realization of ROI and other calculation analyses. The module in the middle part of the figure is the standard user behavior analysis data of our system. Generally, users will report their in-game behavior data to our system. The module on the left is the cost of traffic attract channels, such as AppsFlyer and adjust, which will access the traffic attract cost data into our system through the integrated platform.
The advertising revenue platforms on the right, such as ironSource, TopOn, etc., will also access this part of advertising revenue into our analysis platform, forming the closed loop of the entire game data and advertising cost realization data. At this time, we only need to calculate the paid advertising revenue in the game and divide it by the advertising cost to get the RIO indicator we often talk about. We use DataX.
However, bigger customer scale and user volume bring more problems. To solve the issues, I performed research and got to know Apache SeaTunnel (Incubating).
DataX vs Apache SeaTunnel
This is my comparison summary of DataX VS Apache SeaTunnel.
Because Apache SeaTunnel supports Spark and Flink engines, it inherently supports distributed reading and writing, while DataX’s distribution requires payment to be used.
In terms of performance, relying on the Flink and Spark frameworks, in some specific scenarios, the performance of Apache SeaTunnel can be several times higher than that of DataX, while DataX is limited by the performance of single-point servers.
In terms of data volume support, Apache SeaTunnel can expand horizontally with the expansion of the computing engine, while DataX can only choose to upgrade the server when it encounters a particularly large amount of data, but this vertical expansion is difficult in terms of cost or operation.
Finally, comparing the scalability, Apache SeaTunnel, and DataX are both rich in data sources that support reading and writing, but in the aspect of transform support, the DataX open-source version only has a simple single-column processing function, and it is difficult to use SQL to transform, aggregate, and process data source data like Apache SeaTunnel.
First impression of Apache SeaTunnel
Firstly, I’m impressed by its multi-engine. Apache SeaTunnel supports both Flink and Spark, which is very friendly to users with rich technology stacks. There is no need to choose between the two, and Apache SeaTunnel can shield users from awareness of the application side from the underlying computing through application encapsulation, combined with methods such as dynamic resource analysis, to distribute data synchronization tasks to computing clusters with relatively reasonable load, and performs the corresponding data synchronization work.
Second, Apache SeaTunnel is simple and easy to use. Apache SeaTunnel can perform data synchronization through simple configuration, making good preparation for subsequent visualization configuration.
The third is rich data sources. Thanks to the active contributions of the community, it has now supported the reading and writing of dozens of data sources.
Finally, Apache incubation. A stable team and active community will be a very important criterion for everyone when choosing technology.
02 Examples module design and use
Why did we add the Examples module to Apache SeaTunnel?
The first goal is to run faster and get intuitive feedback.
Secondly, is to provide a convenient local development for local debugging, instead of packaged deployment and debugging, or remote debugging code in the IDE, which will also facilitate debugging of the code in the framework. When adding data source plug-ins for your own business, it is also very convenient to debug and verify.
Examples of module design and implementation
The Example is divided into three parts so far:
- Flink Examples module
- Spark Example module
- Flink SQL Examples module (recently added)
Demo (8:14s-14:14s)
Next, I will show you how to run the Examples module after getting the code, as well as configuration, file structure, and precautions.
Flink Example demo
First, let’s run the code of the Flink Example. The code is mainly divided into three parts, configuration loading, parameter loading, and Apache SeaTunnel context running.
After turning it up, there will be a lot of output, why are there these outputs? Let’s take a look at the configuration file.
The configuration file of Apache SeaTunnel mainly includes four parts.
- The environment. You can choose the environment parameters according to the Flink deployment mode, such as Local, application, session, or Yarn.
- The data source. This demo uses FakeSourceStream, which is mainly used to simulate the generation of two fields, name, and age.
- Transform, which can be transformed, filtered and aggregated through SQL.
- The sink is a ConsoleSink, which prints the results you want to debug on the console.
Now we can get a variable, FakeSourceStream, and then copy value to search globally, we can see that when doing plug-in development, rewrite a method getPluginName. In this way, during the entire Apache SeaTunnel context loading process, it will dynamically load relevant plug-ins according to the configuration file, which can be easily debugged, and our perception of the framework will be more realistic.
Continuing to execute the command will jump to the fakesource plugin, simulate the output data, and generate data named Gary. By skipping the breakpoint, you can easily output the information on the console to get the information that debugging needs. The operation process of other modules is similar.
Let’s take a look at the configuration file of Flink SQL. The configuration file of Flink SQL will simulate the read end and the write end into two tables, which can be considered as data synchronization between the two tables. The read end table is the event table, the writing table is the print table, and the intermediate transform is an insert select statement. Combined with the output results, you can see that there are various values in f_type. When the program stops, add a where condition, after f_type = 1, you can run this class again, and you can see that only records with f_type = 1 are printed in the console, other records have been filtered.
Spark example demo
Take another look at the Spark example. The batched configuration file of the Spark example module will be automatically exited after the entire data flow ends, which is another feature of Apache SeaTunnel, Unified Stream and Batch Processing, which allows not only streaming data synchronization tasks but also batch data synchronization tasks.
Make sure to rely on the Apache SeaTunnel connector part of the pom file in your example module to the plug-in you developed yourself when you develop a new plug-in, which will facilitate debugging, otherwise, Error messages such as the plugin could not be loaded may be reported.
03 My story with the open-source community
Next, I will share my feelings and stories about joining the open-source community.
The first time I participated in open source was in 2020. At that time, my company needed to investigate the scheduling system, I got to know about DolphinScheduler for the first time. The configuration process was not smooth. Later, I located the problem that the configuration files’ capital and small letter needed to be distinguished. If you ignore it, an error will be reported when the configuration file is loaded.
At that time, I tried to submit the optimization. Before submitting, I did a lot of work, such as how to do a style check, how to subscribe to the developer mailing list, how to understand branch management of open source projects, and how to submit a pull request. After submitting, it is anxiously waiting for my PR to be merged. A poem that describes my state of mind well before the merge, is “How can I forget? How can I not regret? My deep sorrow will last till with you I have met”. Later, I communicated with the community through WeChat groups and emails, and my PR was merged with the community finally. It was so convenient and encouraging. At this time, my mind is like “Laughing loud, I toss my head back and walk out the door, I will not be the grassroots always trampled down’’. This is the first time I made an open-source contribution, and it is also the process of an open-source novice who came into the world of open source.
During participating in the open-source community, I learned something, like:
- Every optimization counts. Don’t worry about whether this optimization value is worth submitting. The progress of the community is driven by the code line by line from everyone. Every idea, and every optimization has its value;
- Self-grow in the process of helping others. Your troubleshooting capability and professional skills will be strengthened in the process of when helping others, and at the same time, you will be recognized by the community;
- Gaining confidence and friendship, and the functions contributed by yourself can be used by many people is undoubtedly the source of confidence gaining for the technicians. We will also gain friendships in the process of community communication and mutual help.
In the end, this is my word for you: there are such a group of people in the community, they are enthusiastic, they are kind, and they don’t care about gains and losses, they only create. This is what I want to share this time, and I look forward to collaborating and creating with you in the future.
About Apache SeaTunnel
Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.
Why do we need Apache SeaTunnel?
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
- Data loss and duplication
- Task buildup and latency
- Low throughput
- Long application-to-production cycle time
- Lack of application status monitoring
Apache SeaTunnel Usage Scenarios
- Massive data synchronization
- Massive data integration
- ETL of large volumes of data
- Massive data aggregation
- Multi-source data processing
Features of Apache SeaTunnel
- Rich components
- High scalability
- Easy to use
- Mature and stable
How to get started with Apache SeaTunnel quickly?
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.
https://seatunnel.apache.org/docs/2.1.0/developement/setup
How can I contribute?
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!
Submit an issue:
https://github.com/apache/incubator-seatunnel/issues
Contribute code to:
https://github.com/apache/incubator-seatunnel/pulls
Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org
Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ
Follow Twitter:
https://twitter.com/ASFSeaTunnel
Come and join us!