From an independent developer to a contributor to the SeaTunnel community, what did I do right?

Apache SeaTunnel
3 min readMay 20, 2024

--

Hello everyone, my name is Yan Chengyu, and I am currently an independent developer specializing in data development, machine learning, resource scheduling algorithms, and distributed systems.

GitHub ID: CheneyYin

Personal Website: https://cheneyyin.github.io/

Contributions to the Community

  • Enhanced support for SeaTunnel data types in Spark and Flink engines.
  • Fixed several bugs in the transformation layer of the Spark engine.
  • Improved the data type support in the Assert connector.
  • Addressed several CI-related bugs.
  • Enhanced various documentation pieces.

Contribution record: https://github.com/apache/seatunnel/pulls?q=is%3Apr+author%3ACheneyYin+is%3Aclosed

Initial Exploration

From 2022 to 2023, I was engaged in developing a visualization data integration software akin to StreamSet and NiFi.

By around March 2023, I completed a rudimentary version of a data integration software named Metal and migrated it to my GitHub repository. Although simple, Metal successfully validated the feasibility of the design ideas and the technology stack.

It wasn’t until I read the article “The Evolution of Architecture from ETL to EtLT” published on the devops.dev community that I gained new insights into data integration, such as the concept of ‘little t’, the limitations of using general computing engines, and the value of data integration execution engines.

This was also my first encounter with Apache SeaTunnel, which is built upon these new principles. After my initial trial with Apache SeaTunnel, I decisively shifted my focus and chose to become active in the SeaTunnel community.

Submitting My First PR

I’d like to share the story of my first PR submission. During an early stress test of SeaTunnel, I noticed an OOM (Out of Memory) exception thrown by the Spark engine.

I first replicated the issue, then debugged and pinpointed the cause. It turned out that the TransformerProcessor in the Spark transformation layer was temporarily storing output results in memory, leading to heap space insufficiency when processing large data volumes.

After analyzing the problem in-depth and finding a solution, I submitted my first issue to the Apache SeaTunnel community (Issue #4502), where I explained the phenomena, the causes, and proposed a solution. You can check it out. Subsequently, I submitted my first PR (#4503).

My first PR was merged within just four days, demonstrating the community’s efficient response rate. However, the process was filled with anticipation and felt lengthy, especially when an anomaly in the CI environment caused the tests to fail.

Fortunately, seasoned community members promptly assisted, and the PR was successfully merged. So, when you initially contribute, seeking help from experienced contributors is crucial, and they are generally very willing to assist. But be mindful not to overly consume their time.

Ongoing Involvement

Over the past year, I have actively participated in community activities, absorbed insights from tech leaders, responded to community issues, and continuously monitored the Pull Request list.

Additionally, I have made several code contributions to the community.

For example:

  • Added support for SeaTunnel’s Time type in the Spark engine (#5188).
  • Introduced configurable precision and scale for the Decimal type in the Flink engine (#5419).
  • Enhanced Hocon-style generic declarations (#6187).
  • Completed coverage of all data types in the Assert connector (#6275).

These pull requests are mostly aimed at improving the user experience.

Impressions of the Community

My first impression of the Apache SeaTunnel community was that it is enthusiastic and active. The community quickly responds to issues and pull requests. It is also very friendly and patient with new contributors, making it easy for newcomers to get involved.

Future Aspirations

I hope the community will continue to grow, attracting more developers to propel SeaTunnel forward. I wish for an expanding user base, allowing more people to enjoy its convenient data integration solutions. I anticipate continuous improvements in user experience and new breakthroughs in SeaTunnel’s stability.

Also, I hope for more comprehensive and clear documentation, providing detailed usage guides and technical documents to help users quickly get started and resolve issues.

--

--

Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.