Floss Weekly: Data is surprisingly exciting!

26 min readAug 9, 2023

Apache SeaTunnel Mentor William Guo is invited to the broadcasting of the Floss Weekly show last week, introducing the Apache SeaTunnel project to the audience and having a great conversation with the hosts about the data integration technology, open source influence, and other interesting topics like ChatGPT and what the future will be like under the influence of advancing technologies. Let's dive into their conversations.🔽

Doc Searls: Hello again everyone, everywhere, this is Floss Weekly, I am Doc Searls and this week joined by Shawn Powers himself. Here he is. We actually relatively closer one state apart. We’re in adjacent states, I’m in Indiana, you’re in Michigan and you’re in green, and I’m an orange. For those of you not watching.

Shawn Powers: That red looks red.

Doc Searls: It’s orange, it’s actually orange. It’s a Firefox shirt. Today we have put an older Firefox logo, one of several older, very, very fixed like this.

Shawn Powers: Alright, maybe it’s because it’s the contrast with the fox that makes it like right?

Doc Searls: It could be. So our guest today is William Guo from the Apache SeaTunnel project, which I have compiled a whole lot of stuff on and don’t understand well enough yet. So have you done your homework on this thing?

Shawn Powers: I mean, so yes and no. I mean, big data is big, right? I mean, there’s a lot to figure out there. I have questions in my youth, in my yesterday.

Doc Searls: You just deleted trust.

Shawn Powers: Okay, all right. Well, I was a database manager at a university, and we had this incredibly archaic database that we had to tie in with another more modern SQL database. And basically, all my questions are going to be based on, would this have made my job easier back in the day? And I’m pretty sure the answer is yes. But that’s the extent of my knowledge of big data is that it’s a big pain in the butt. So hopefully this makes it less painful.

Doc Searls: Yeah, I’m intrigued by some of the claims that I saw or some of the stories I saw about it saying that it’s actually cheap to run and you can use it on smaller projects. I’m very interested in using it personally. So that’s an interesting thing and we can go in lots of different directions with this.

Let me introduce our guest. It’s William Guo. I’m hoping I get the pronunciation right. He’s been an Apache Software Foundation Member, a mentor with the Apache SeaTunnel project, and Apache Dolphin Scheduler PMC Member. DolphinScheduler is another topic of conversation today. He is the initiator of the ClickHouse Chinese community, a graduate of Peking University, used to work as a big data director for Lenovo Research Institute and General Manager of the Wanda e-commerce data department, and he’s a visiting researcher at the Big Data Business Analysis Research Center of Renmin University and has been committed to promoting the democratization of data capabilities, the development of Open-source projects globally. And so we with that is an inadequate introduction. My dear. So welcome, William.

William Guo: Hello, everyone. And yeah, glad to see you. I am William.

Doc Searls: So, tell us a bit about an Apache SeaTunnel and what led to it. Because I’m reading DolphinScheduler is a part of it. So give us kind of the overall thing and we’ll dive down into parts of it.

William Guo: Okay. So Apahce SeaTunnel is a projector that you can do big data integration. And actually, we can extract data from a different database such as MySQL, Oracle or DB2 or AWS Aurora, or even SaaS or Cloud DB and then you can load the data into another database, such as Hive, Hadoop or Redshift or ClickHouse or any database you want. So I think it is a very good open-source project that can help you to extract data from your database to the other databases.

Doc Searls: I am not a database expert at all, but I do know that companies have always had a hard time integrating these things because there are many different fields, many different variables, but many different conventions involved in them, many different ways of querying them. And I wonder how you pull all of those together and don’t have a monster of some sort that is too hard to get into? How does that look when you’re done with it?

William Guo: Actually, I met a problem before because I just want to synchronize data from AWS Aurora to AWS Redshift. And I used to use a tool called AWS DM, that’s a tool that AWS offer to me. But I think it’s not workable at that time. And also I found that there are many databases, not only AWS Aurora but also we have MongoDB Neo4j, which is another kind of, database. And also we have many data warehouses such as Snowflake, Redshift, Teradata, and also Oracle. So, there are so many databases. And what I want to do is just synchronize one database to the other. But I cannot find a very good tool to do that. So we have to build a synchronizing tool that we call WaterDrop. And that’s the former name of the SeaTunnel project. And then we found that everyone needs a tool that can synchronize data between different data sources, perhaps between Kafka, MongoDB, or MySQL. So then we create an open-source project called SeaTunnel that let you synchronize data very easily. And even you can use drag and drop to create a job.

So that’s why we create SeaTunnel. I think it’s easy for people who are no technical background that want to synchronize one database to the other, even you can synchronize data from, for example, Notion to Google Docs if you want. So that are different data sources. SeaTunnel helps you to do data synchronization from any data source to the other data source.

Shawn Powers: Okay. And this is, again, I wish that I would have known you 10 years ago because we could have solved some major problems with my job. But when you talk about synchronizing two different database types, is it only a one-way synchronization? And if so, does that just mean you set up 2 synchronizations, one each way? Like, let’s say I have a SQL database, and I wanted to sync it to FoxPro, I don’t want to try again, databases, aren’t what I do right now, but you want any changes on the other side to also be reflected? You know, you just want to keep them in sync with each other. Is that like a one-process thing or do you have to set up two jobs? And whichever happens first? Is that 2 ways or one way?

William Guo: Yeah, very good question. For now, is a one-way synchronization. But for synchronization, we have two kinds of ways, one kind, we call a batch job. That means you extract data and load data for 1 time. And the other, we call real-time synchronization. That means, you just can read the data from MySQL, for example. And we called it the CDC. That’s a Change Data Capture. And then you can load the data to AWS Redshift or Snowflake in real-time. So this is another type of data synchronization. But we can do it, in real-time now, not only for the batch synchronization. So I think that’s why many users of SeaTunnel use it to do data synchronization.

Now we didn’t have a FoxPro connector, but I think now many people use Snowflake, and then you can use SeaTunnel to extract data from Notion or Google DOC or Excel and load them into Snowflake very easily.

Shawn Powers: Okay, so it’s basically they’re one way, but you can set up multiples. Is that is that a fair answer to that question?

William Guo: Yes.

Shawn Powers: Okay. And then, in the back, Jonathan and Bennett and I were both thinking the same sort of thing when there are conflicts with, that sort of like if there are synchronizations two separate ones, you know, going passing in the night, how do you deal with conflicts? Is it time stamp based or how did those conflicts get handled?

William Guo: Yeah, it’s a very good question. Sometimes we just want to load the data to the target database. But there is some data already. So we have a mode called save mode, and you can choose to replace the record or you just update or delete the record. So you will have the save mode to handle the issue that you met. I think it’s very easy for you to choose the mode.

Shawn Powers: Okay, I guess it makes sense. And probably it depends on the use case. Like, if changes are made on both databases and the hope is to get the data occurring in both. I assume time stamps must be at play like, okay, which gets the preference for being stored. Doc, do you have any questions? I have so many questions, but I don’t want to dominate.

Doc Searls: I don’t know, those are all good, but I’m not clear on again, I’m not a database person. What is the user looking at? I mean, if you’re used to your Oracle database, your Mongo, your MySQL, what are you looking at? Are you looking at the one your company normally uses for something? Are you looking at some other user interface that’s unique to SeaTunnel?

William Guo: Yeah, for users of SeaTunnel now, I think data engineers, and who want to handle the data. In the old days, they have to write a lot of codes to handle the data synchronization. Because there are no very good tools like SeaTunnel to do the synchronization between different databases. Now they just drag a job or they can just write a SQL code to do the synchronization with SeaTunnel. So I think Apache SeaTunenl is for a data engineer, especially for a big data engineer.

Doc Searls: Okay, so you mentioned engineers are the ones kind of doing this integration. And also I noticed in your background information that there’s this field called data integration. Is that and that’s what this is in, and it’s a new field. And who all is in that and where does Apache SeaTunnel fit into that?

William Guo: Yeah, actually, we call that ETL. And that means extract, transform, and load. In all the days in, we called a data warehouse period, we just extract data from Oracle, DB2 and then load them into Teradata or DB2 warehouse edition. But nowadays we do this in different ways. We just do the data capture in real-time. And we load them into Snowflake in real-time. And we do data analytics in real-time. Now we found another very interesting story, many developers try to use SeaTunnel to sync data from SaaS or from databases to the target, ChatGPT. It is very hard. As you know, ChatGPT only knows the knowledge from the internet, but cannot chat with your data because ChatGPT does not know your data or the data in your database. But SeaTunnel can extract data from more than 100 data sources and we are developing a connector of ChatGPT. When it’s finished, I think ChatGPT can connect to your database, and then you chat with your data, no matter the data on your Google DOC, Notion, Oracle, MySQL, or MongoDB. So it’s very interesting.

Shawn Powers: Yeah, this connector idea is really cool. But so part of, the idea of a connector. And there’s like the source and then the sync, the source is where the data is coming from. The sink is where the data ends up, how much transformation can take place in that interim step? And I realize this is mostly stream or real-time kind of thing. And obviously, there has to be some translation because these are different structures and stuff. But can other transforms happen? I mean, can you do stuff to make the data not just the different format, but have some transforms take place in the interim? And I have a follow-up question because I think I know the answer there, but can stuff be done in the interim or not really? Or is it just a structure change?

William Guo: We can do some transformation between the source and sink. Because different databases will have different data types, and SeaTunnel engine will transform the source data type into the sink data type automatically. So you needn’t do the transformation about the data type. But if you have another requirement, for example, I want to change from 0 to male and 1 to female, you can use a transformer in SeaTunnel to do that. Actually, you can use SQL, like code to do the transformation in SeaTunnel Engine.

Shawn Powers: Oh, that was the exact answer to my question then. So yeah, you can do transforms. And you thought of a much better example. I couldn’t think of an example off the top of my head, so that’s perfect.

My follow-up question then leads right up to that. This is something that the SeaTunnel Engine does. But I noticed that a SeaTunnel as the entire project, can use the newer SeaTunnel engine, but can also use Flink or could use Spark. Flink and Spark already have people who are like, why would I ever use this over that? You know, why would I ever want to switch to, you know, Flink and Spark is everything I want. What does SeaTunnel add to that argument that makes it a better fit for this? And if it’s so much better, why are there options still to use Spark or Flink?

William Guo: Yeah, it’s a very good question. Because actually, the first SeaTunnel supports Flink and Spark engines. But we found that Spark and Flink are designed for computation, not for synchronization.

For example, in our use case, our users have more than 1000 tables and they want to synchronize them to the other database. But if you use Spark or Flink, there will be a lot of JDBC connections to be handled. Actually, that’s a very heavy load for the source database. SeaTunnel creates a connection pool that you can reuse the JDBC connections. Something like that. And also, there’s another feature we called schema evolution. That’s a technique word. What does it mean? That means if you change the data model from your source table and want the same thing exactly to happen to the target table, we call it the sink table. Flink and Spark are not designed for that because they’re designed for computation, group joins, aggregation, and so on. So we have to design another synchronization engine. We call it SeaTunnel Zeta to do this kind of thing. And it is only designed for synchronization, so the performance will be better because we do not care about the computation or some complex functions. So the performance will be better than Flink and Spark under synchronization user scenarios.

Shawn Powers: Okay, good. So my guess was, the implementation or the ability to use Flink or Spark was that just because there wasn’t a highly developed, SeaTunnel engine yet. They seem in an outfit for, you know, for SeaTunnel was designed cool. And then when I saw that you could use Spark or Flink, too, so I was a little confused. Like, but that’s not really what those do. That’s more like, data extraction for like running real-time, whatever. You know, I was a little surprised. So ideally the SeaTunnel Engine itself is the better use case. Is that fair to say?

William Guo: Yeah. And also Sparkle and Flink, they do not have so many source and sink connectors, you know?

Shawn Powers: So I didn’t realize that you did a sink at all. I mean, I thought it was like an extraction of their own. And yeah, so that makes sense. And so the source and the sink, I don’t know if they’re plugins. I don’t know the terminology, but those are designed engine specific, so like there are SeaTunnel engine sources and sink-like code. But then if you want to use Spark or Flink, you have to use something specifically designed for those engines. Is that so? The bulk of development is towards the SeaTunnel Engine.

William Guo: Actually, for SeaTunnel engine connectors, we sometimes adopt Spark and Flink because people use Spark and Flink. But if you use Spark or Flink, you will not have the features such as Schema Evolution or get a better performance, or some other features. So actually SeaTunnel support Flink and Spark, but if you want to get a better performance or better functions, you can use a SeaTunnel Engine itself. So SeaTunnel Engine will help you to do more and offer you more functions.

Shawn Powers: All right, can the flow, whatever the connection from one database to another, can it be one to many? Or is that another thing where you set up just another connection? You know, like say we have our SQL database and I want it sent to use your examples. Like I want tables put on Google Docs, and I also want to set up my Notion personal database. Is that one to many or do I need to set two different connections?

William Guo: Yeah, you can use one to many. And we called it load ones and sink many times, also you load data from a Google DOC or Kafka, then you can sink them into Snowflake, Redshift, S3 on AWS, etc. That’s a feature of SeaTunnel.

Shawn Powers: So you don’t have to do 3 queries. You can, whatever the terminology would be, you don’t have to pull the data three times. Okay, that’s really nice.

William Guo: Yeah. So, it’s very interesting because, at first time I think synchronization is a very easy thing, but when we create SeaTunnel, we found that there are a lot of user scenarios that are quite different from Flink and Spark because synchronization is another kind of user story, I think.

Shawn Powers: Yeah, I never thought about Spark and Flink is like cramming data back into a database. So that was I again, that’s why I didn’t understand how they were engines in the same way that the SeaTunnel Engine would have been. So yeah, thank you for the clarification there because I feel, yeah, the SeaTunnel Engine is definitely the next one of the most efficient ways to go. So thank you.

Doc Searls: I’m quickly reading about Spark and Flink because again, I’m new to this stuff. And I see in a piece about Spark and Flink, they are also earlier Apache projects, I guess, and they are called the third and fourth-generation data processing frameworks. Does that make SeaTunnel the fifth generation? I’m not sure. Or are they different species altogether?

William Guo: I don’t think SeaTunnel is a fifth generation because I think a fifth generation will be another story. For example, a quantum computation. But for Spark in a Flink, I think, because they are focusing on computation. Just like you said, it is about data processing, and data process processing is quite different from data synchronization. So actually, I think, our project is in different ways from Spark and Flink.

Doc Searls: So, I’m wondering if, why this may be a self-answering question given what you just said that one is focused on process and the other in synchronization. And do any of the same people that worked on Spark and Flink also work on SeaTunnel? Or is it a different set of experts and forms of expertise? I guess a related question is where are you getting your developers and what are the developers working on? Do you know what it had come out of? I know that the DolphinScheduler was involved in this part of it. So that may be a way to transfer to that question as well.

William Guo: Yeah. I actually think, for the developers, we call the contributors of open-source projects. Some of Flink and Spark contributors are in ApacheSeaTunnel and contribute to code because they need SeaTunnel to synchronize data from a data source to the target database. And then they will use Flink or Spark to do the data processing. So actually our users are using SeaTunnel, Flink, or Spark to do the data synchronization and the data processing. Because when you do the data processing, you have to store the data in one database, you have to extract data from a different database, and then load them into one database, such as Snowflake. So that’s what SeaTunnel doing. And when they load the data into the one database, they can use Spark to do the computation or use Fink to do the real-time processing. So actually we are not competitors.

Shawn Powers: So speaking of developers, again, I keep thinking about myself and about I had to struggle with databases when I was the manager of a database department. It was the software that was developed by DataTown and bought by Ellucian. But nonetheless, this database, which was designed in the 80s, had a really, really, weird, non compatible with anything SQL. It had multi-value fields. It was just a weird database. Okay, and so there was this ridiculously complicated spaghetti of a database interaction that was just custom designed because something like SeaTunnel didn’t exist. My question is, if I were to tackle this today, are the connectors modular? Is that something that somebody could develop their own connector to a database and use that with SeaTunnel? Or is it not modular where a person could develop something for their own ridiculous backend database that they want to synchronize with a more modern version? We’ve literally had to run it in this like an old VM. It was like a Star Trek, you know, with a voyager in the middle of this enormous alien-like monstrosity. That’s basically what this database thing was like, a tiny VM with this old database. And we had to figure out a way to connect to it. It was miserable. So is it modular to connect to things like that?

William Guo: Actually, everyone can contribute a connector to SeaTunnel. And we have a very good example. Informix is a very old database, which is older than me. Some people want to synchronize data from that old database to Oracle, so then they can use Oracle instead of Informix. Then, they can change the application design based on Oracle to transfer data from Informix to our code. Everyone can contribute the architecture to SeaTunnel because it’s an open source project, and everyone else who has the same issue can use this connector to solve their problem.

Shawn Powers: Okay, that’s awesome. And then in your example, so there is this older database that some applications connected directly to. And then they want to synchronize that with an Oracle database. So some other newer front end, whatever, could connect to it. That kind of leads back to my original question about two-way synchronization. Would there just be 2 setups? I mean, if somebody’s working on the Oracle database and they make a change and they wanted that reflected back in that older database, would there be two different connectors or two different connectors? Two different synchronizations taking place.

William Guo: We don’t need it. We don’t suggest you do that because people will do the new things in the new database and they do not insert or feedback the data back to the old database. They only want old data in the new database. And if you have a two-way synchronization, I think that will confuse the system because, I don’t know, the system will not know where this new record is from, is it from the old database or from the new database. I didn’t meet that kind of scenario before.

Shawn Powers: I’m glad we talked about it because that was my original question. And if it did two-way, how would it handle conflicts? Because I, you know, a whole bunch of like problems that could happen there. But in my use case, the two synchronization way would have been great because when we developed programs or interfaces, we did not want to hook to that old archaic database directly, we needed to get data back and forth from it.

William Guo: Yeah. It’s a very good discussion and a very good requirement. We can consider this at a later time. And now I think there may be a solution, we have a third data store like Kafka, and you can do the two-way synchronization between Kafka and the two different databases. I think there will be some solution for the requirements that you mentioned.

Shawn Powers: For what it’s worth, I’m not going back to that job. So I don’t really need a solution. 10 years ago, I needed a solution and I left the job.

Doc Searls: This is an interesting question. So I’m wondering, again as an outsider to this, a couple of things. There’s the connector API somewhere. And somebody using this synchronized database must call something. What do they call it? I mean, is it known as the SeaTunnel database? How does a company or a user or a customer use SeaTunnel, what do they call that and where does it live? I mean, it may live in the cloud or somewhere. I’m not sure.

William Guo: Actually, many companies use SeaTunnel. In America, there is J.P.Morgan, and another investment bank uses it to extract data from AWS Aurora to AWS Redshift. And many other companies such as Bilibili, which is a video company similar to YouTube. And also we have VIP.com which is similar to Amazon. And internet companies such as TikTok, and many other companies in Japan, China, Singapore, and America are using SeaTunnel. There are many new cloud databases, but there is no very good open-source project for cloud database synchronization. So they are using SeaTunnel to solve this problem. We know there are many old software Informatica used in North America and Europe, but this kind of tool doesn’t support cloud databases very well. So that’s why there are many internet companies that use SeaTunnel, because it’s open source, and free, I think.

Doc Searls: Yeah. In some ways, it makes it harder to track. I suppose you know that somebody could be using it. And, you know, they’re not paying customers. But I mentioned they might have developers in the case. I know Sean has a question, but we need to take a break and we’ll be back right after this.

Shawn Powers: Okay, so, again, I just have so many questions and I didn’t think I was going to, but I really enjoyed the conversation here. So let’s say I’m using SeaTunnel and I assume I could run it in aDocker containers. I assume it’s just an engine that’s running. And then it talks about the different data sources. If I have a database, do I have to keep the entire database and sync across the connector? Can I do it piecemeal? Can I sync with only this part of the database, or do I want my contact database kept in sync with my Notion, or whatever, and just have that stay in sync? And if so, I assume yes, I mean, it would be silly if I could only do the entire database. But, what triggers the sync? Does SeaTunnel stay connected to the data source watching for a change? Or is there something that has to trigger SeaTunnel? You know, these send data to it. How does that connection actually happen?

William Guo: Yeah, a very good question. There are two ways. One way we call it the batch job, which means you have to trigger SeaTunnel to extract data from the database and load it into the other databases. And you can use Airflow or DolphinScheduler(job orchestration tools) to trigger that batch job.

If you have a real-time job, we call it real-time data synchronization. And then SeaTunnel will watch the original database. If you have a new record or you change some data records, SeaTunnel will know that, and it will synchronize the data into your database in no more than one second.

Shawn Powers: Yeah. Is there a downside to having SeaTunne monitor directly itself for that real-time change? I mean, is there a performance hit on the source database? Or if you’re already using something like DolphinScheduler to do these batch changes, you just want to do that. Is there, I guess, a best practice or is it all just dependent on what kind of data you’re working with?

William Guo: Yeah. If you don’t want to get the real-time data and you needn’t use a real-time mode, because I think, the real-time mode, will affect some regional database performance. But it’s very tiny because we did not read the database, we read the database bin logged, which means a file that is not the database. So we read the bin log that will affect the disk space to some extent, but I think it won’t affect the database so much. So it doesn’t actually involve. Yeah.

Shawn Powers: So it doesn’t connect to the database unless it detects something that it would want from the logs. Okay, yeah, that makes sense. And then, I guess with something like DolphinScheduler, there would be no connection at all, and then would DolphinScheduler actually pass the data or would it just trigger SeaTunnel to come in and grab the data?

William Guo: Yeah, DolphinScheduler is just the orchestration tool and a trigger, and it also can trigger Spark or other EMR on AWS.

Shawn Powers: Yeah, alright. I guess I see why you could do it two different ways and probably the performance isn’t drastically better one way or the other.

William Guo: Yeah, because you have a lot of data, you have to extract the data in batch mode because the data is too big.

Shawn Powers: Yeah, that makes sense.

William Guo: If that data is not so big, you can do it in real-time mode.

Shawn Powers: Yeah. If you’re watching 1 field, you wouldn’t have to do a batch for one name changed.

William Guo: Yes.

Doc Searls: There are a couple of questions in the back channel after somebody says they want to do work only with the coolest named OSS products and SeaTunnel is definitely on that list. Is there a chance of errors with a sync that fast? And did they already talk roll-back on transactions? Especially if someone has some files open at the time this sync happens. So, can you address those?

William Guo: It is also a very good question. Actually, SeaTunnel can deploy one server or in the cluster mode. And we have a Global Snapshot technique. That means if some error happened and the whole data synchronization process will roll back to the last global snapshot, we call it a checkpoint. So you don’t need to worry that you will lose the data because we will have many checkpoints, and you can define the checkpoint by yourself to ensure the data will be lost. We called it distributed checkpoint. And then also we have some other functions to assure that the data will read the data again and synchronize from the last checkpoint to future data synchronization. So SeaTunnel assures that the data synchronize from the source to the target exactly once, we cut Exactly-Once, which means when you upload your data, your data will not be lost.

Shawn Powers: Unless you’re trying to sink data two ways at the same time. In which case, all bets are off.

Doc Searls: So I have a question there was a piece written by somebody with Apache DolphinScheduler in a very provocative headline-Train your own private ChatGPT model for the cost of a Starbucks coffee. And I read the paragraph that opens that you can own your own trained open-source Large-scale Model. It can be fine-tuned according to different training data, and directions to enhance various skills such as medical programming, stock trading, and love advice, making your large-scale model more understanding of you. Let’s try training. And then it goes into how you can do that.

And I have a particular question about that because I want my own chatbot on my own data, in my household, all my property, all health records, all my financial records, all my contacts and calendars, my travels where I’ve been, you know, like, where was I when I had that medical thing that happened? You know, and what doctors that I see. About that, I mean, I’m just making that up. But those are the kind of things I think when the likes of ChatGPT become relevant to individuals is going to. I think we’re sort of at a moment now where we will start having our own databases in our house, in our homes that are not relevant to the world. And we think of it as easily for companies because companies have gigantic databases in most cases and wish to know a lot about themselves. And they would apply there. And that’s probably where most of the uses are going to be early on. We had a company on here a few weeks ago talking about control planes. It was called Cross Plane and it was about doing, you have multiple control planes within a company. And we have control planes in our own lives. And I’m thinking, this seems relevant to me, especially when you’re saying it’s cheap. Do you have any thoughts about that?

William Guo: Yeah, Actually, we call it the private LLM because, what you need is only a GPU better than, 3 0, 9 0, then you can use DolphinSchedule to train your own ChatGPT with that GPU. Yeah, I think it’s about 2 to 24 hours to train your own ChatGPT. But what you need to do is to prepare the data and let the open source large model, for example, LaMDA or LaMDA 2, and DolphinScheduler will download the LaMDA or LaMDA 2 automatically, and help you to train with your personal data in your own personal computer or personal laptop. You needn’t worry about the data leak that your personal data will be uploaded somewhere else. Because you can train your data on your own laptop. This is very interesting.

Doc Searls: This is something I very much want. I mentioned this on earlier shows, but it’s worth bringing up again. One of the hackers among us took everything I had written for Linux Journal, which Shawn and I both used to work for over 24 years. I wrote many, many articles and had it query those. In other words, it trained on those. I don’t know what model he used, whether LaMDA or ChatGPT. But there was one and it gave good answers and it put them in the form of a Haiku as well. It gave you the complete answer and was remarkably right and helpful. And that was just one thing. I mean, one could actually look back through all of one’s emails, you know? How many times did I talk to Shawnn? When did this come up? I mean, I’d love to have it for we’ve done this show for 16 years. It’ll be really great to go back and say, hey, you know, when did we last talk to William? When did we last talk? So who would we want to have back? And what questions were left then answered? I mean, there are lots of possibilities there. And until I started learning about this, I wasn’t thinking about how possible this was at a relatively low cost. So that’s intriguing.

William Guo: Yeah. Actually, DolphinScheduler let everyone have their own ChatGPT. But I think the hard part is to prepare the data. Some people are doing are creating a connector in SeaTunnl for preparing the data, for example, you can extract data from your PowerPoint or from your Word and then do the data preparation for the LaMDA. And then you can train the LaMDA with DolphinScheduler.to have your own private ChatGPT. But I think it is hard. If he succeeds, I think everyone will be happy because everyone can do the data preparation very easily.

Doc Searls: Last one question, Jonathan Bennett, another co-host who we mentioned earlier here and who has been on our chat often brings up, which is the weirdest use you’ve seen so far and what’s really unusual or stands out an as an exception.

William Guo: Actually, I think the weirdest thing I met in this project is that I think the connectors will be growing slowly. Because when we entered the Apache incubator, we only have 20 connectors and we just double that connector in one year from 20 to 40 in our company. Now it has more than 100 connectors. And I think the power of the open source community is more powerful than I thought. It is interesting because I never think that the connectors will grow so fast, and I never thought there will be so many users who’d love to contribute their connectors to this open-source project. That’s why I think that was weird for me.

Doc Searls: Yeah, well, that’s great. And given that it’s growing that fast and given that we will have SeaTunnel running in a year or two or less, maybe if we follow that path will be able to see how far it’s gone and have you back on a future show? Hope that will be great.

William Guo: Yeah. Our goal is to connect SeaTunnel to every data source in the world.

Doc Searls: So I think so. It will be really good.

Doc Searls: Thank you so much, William, for being on the show.

William Guo: Yeah, thank you, everyone.

About Apache SeaTunnel

Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

Data loss and duplication
Task buildup and latency
Low throughput
Long application-to-production cycle time
Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

Massive data synchronization
Massive data integration
ETL of large volumes of data
Massive data aggregation
Multi-source data processing

Features of Apache SeaTunnel

Rich components
High scalability
Easy to use
Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/seatunnel/issues

Contribute code to:

https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel