“Transforming Data Integration with Apache SeaTunnel: Insights from Qingfeng, a Leading Paper Brand”

Apache SeaTunnel
6 min readMay 25, 2023

--

I am Han Shanfeng, from Jinhongye Paper Industry Group. Today, I will introduce to you the application scenarios of Apache SeaTunnel in our Jinhongye Paper Industry Group, including why we chose Apache SeaTunnel and how we improved our internal data development efficiency based on it.

Han Shanfeng

Jinhongye Paper Industry Data Analyst

01 Product Selection Journey

When I first joined Jinhongye, all data were in the Oracle database. At that time, we used Oracle views for data warehousing.

If one view could not meet the requirements, we would create another view, and if two views still couldn’t meet the requirements, we would add another view. However, as time went on, the efficiency problem of this method began to appear, so we started looking for new solutions.

The first task was to research data synchronization between Oracle and Clickhouse. At this stage, our goal was simple, i.e., to directly push system table data to Clickhouse, and then have Clickhouse directly handle front-end applications. After solving the data synchronization problem from Oracle to Clickhouse, we started the second phase.

In the second phase, we started dealing with SAP’s data. As a traditional enterprise, our manufacturing, marketing, and supply chain production all depend on ERP systems, especially SAP, which also brings us the challenge of data output. We chose to use SAP’s RFC interface for data output, and coincidentally, the tool Kettle we used in the first phase also supports this method.

In the third phase, we planned to use Hive to build our data warehouse. However, due to the problems and limitations of Kettle itself, we started looking for new tools that can import our Oracle database data and SAP interface data into Hive, and perform model processing and data cleaning in the data warehouse.

In the fourth phase, we started exploring how to push cleaned Hive data to Clickhouse because our BI, reports, and visualization applications rely more on Clickhouse. At this stage, we discussed how to integrate the entire data ecosystem.

In this process, we evaluated various solutions, including commercial solutions and the open-source Apache SeaTunnel. I remember when I first came across Apache SeaTunnel, it was called “Waterdrop.”

We read its documentation on GitHub in detail, performed in-depth code analysis, and finally decided to build our data integration tool based on Apache SeaTunnel.

After choosing the tool, we started facing new challenges. In addition to the standard business systems in the enterprise, we also purchased some SaaS services. Many SaaS services only give you a client or a Web account password, making it difficult to get the data you want.

This led us to our fifth phase, that is, how to integrate all the data through a tool into our platform, and then provide services through our Clickhouse. This process was difficult for us, but we eventually achieved our goal and improved our data development efficiency.

We went through five stages to choose and implement the appropriate data product. Through comparing different products, we iteratively upgraded from Kettle to SeaTunnel.

We listed a few key steps:

First, we dealt with traditional offline data access. Secondly, we achieved data access through the RFC interface. This step is only partially supported by Apache SeaTunnel in its native form. To fully support this function, we studied SAP documents and Apache SeaTunnel’s source code, and developed and optimized it.

In this process, SeaTunnel provided our technical team with detailed documents, including both Chinese and English versions. When we encountered problems, we could seek solutions by submitting Github requests or through the Slack, Apache SeaTunnel community group. The configuration files and documents during the process were relatively easy for developers to understand, which is also one of the reasons why we chose Apache SeaTunnel!

02 Product Application Scenarios

In terms of data access, we mainly have the following application scenarios:

Offline data access: This is our main application scenario. We synchronize databases and different libraries and tables from various business systems to our Hive via Apache SeaTunnel. We then process business logic within Hive and output the processed data to our Clickhouse. Finally, we present the data through our front-end BI or reporting tools.

RFC data access: This is another main application scenario. In this process, we use SAP’s RFC interface for data access. This interface is one of SAP’s standard external interfaces, which provides implementations for both Java and Python.

Third-party data access: Although this scenario has a smaller share in our applications, we access these data via the HTTP protocol or Kafka protocol, and then conduct internal data analysis.

03 Improvement in Development Efficiency

Next, I want to talk about the increase in our team’s development efficiency. Basically, we have turned into “copy and paste” engineers. All topics are built using a standardized directory structure and template, and then tasks are divided and handled. It has completely turned into an assembly line workflow.

In the “assembly line” process, we have many fixed elements, which, on one hand, are internal normative constraints, and on the other, are things we learned from open-source projects and applied to our situation.

When our team first engaged with open-source, our attitude was mainly to learn from and apply them. But we found that SeaTunnel was a high-activity, well-documented project, so we decided to use it.

In the process of using it, we found that it indeed could solve some problems within our company, including some issues within our data team. During its use, we also found some issues, which prompted us to submit PRs and try to provide some positive suggestions. This feedback also helped us, so we want to continue providing positive feedback for the community.

We were using version 1.0 in 2021, as version 2.0 was not out at that time. When using version 1.0, we encountered many problems, but in the 2.0 era, many protocols were supported, some of which we needed very much, such as the HTTP protocol. After internal testing, we found that it indeed greatly reduced our development time.

04 Product Upgrades and Iteration

In our initial architecture, we were actually using Azkaban, because at the time we didn’t have much time to test different products, so we chose the simplest Azkaban to do the resource scheduling for our entire cluster.

But it had some problems, like it couldn’t manage my various scripts very well.

As we connected more and more topics, the dependencies between related topics also became more and more. Initially, maybe only one or two topics could solve it between themselves, but now we have to be very careful to adjust the scheduling time between each dependency. Sometimes, a task with a large amount of data might suddenly take one or two hours, and then the subsequent tasks might fail. We need to handle the scheduling of each dependency very carefully.

Then, we encountered the tool Apache DolphinScheduler, which essentially solves our scheduling dependency problem and resource management problem quite well. Our subsequent plan is to replace the current Azkaban with DolphinScheduler.

Finally, I want to share a passage from a book about the software industry that I recently read. It gave me deep insights. The book describes the 1960s, the era of vacuum tubes, when chips were still in their initial stage.

There’s a profound statement in the book: The simpler the chip, the better the reliability and power-saving it provides. I think this also applies to our software programs. If our program is simple enough for the users, its reliability and the functionality it provides will be stronger.

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet