Which Data Synchronization Method is Senior?

Apache SeaTunnel
9 min readSep 5, 2024

--

The importance of data synchronization methods is self-evident for practitioners in the field of data integration, Choosing the right data synchronization method can make the results of data synchronization work twice the result with half the effort. Many data synchronization tools on the market offer multiple data synchronization methods. What’s the difference among these methods? How do I choose a data synchronization method that suits my business needs? This article will provide an in-depth analysis of this issue and details on the functions and advantages of WhaleTunnel in data synchronization to help readers better understand its application in enterprise data management.

Pros and Cons of different data synchronization methods

Data synchronization refers to maintaining data consistency and synchronization between different systems, databases, or files. Different data synchronization methods are used depending on the application scenario, data volume, and requirements. Choosing the appropriate method has a crucial impact on the overall architecture, performance, stability, business requirements, and security of the system.

In general, a proper synchronization strategy can not only ensure the consistency and integrity of data but also reduce the development and maintenance cost of the system and improve the reliability of the system. Therefore, when designing and implementing data synchronization solutions, enterprises must fully consider business needs, data volume, real-time, performance, resources, maintenance costs, security, and other factors to make the best choice.

The following are common ways to synchronize data:

1. Full Synchronization

  • Full synchronization refers to the transfer and update of all data at each synchronization. It is suitable for scenarios where the amount of data is small and does not need to be updated frequently.
  • Advantages: It is easy to implement and suitable for scenarios with a small amount of data.
  • Disadvantages: Low efficiency and high network and storage overhead when the amount of data is large.

2. Incremental Sync

  • Incremental synchronization means that only the data that has changed since the last synchronization is transferred and updated at each synchronization. It is suitable for scenarios with a large amount of data and frequent changes.
  • Advantages: High efficiency, reduced data transfer volume and system burden.
  • Disadvantages: The implementation is complex and requires the ability to accurately detect changes in the data.

3. REAL-TIME SYNC

  • Real-time synchronization means that data is synchronized to the target system as soon as it changes. Message queues (such as Kafka and RabbitMQ) or Change Data Capture (CDC) are commonly used.
  • Advantages: Data can be almost consistent in real-time, which is suitable for businesses that require high real-time data performance.
  • Disadvantages: High requirements for network and system performance, and high implementation complexity.

4. Scheduled Synchronization

  • Scheduled synchronization is performed based on a set time interval (for example, hourly or daily). It is suitable for scenarios that do not require real-time performance but have a large amount of data.
  • Advantages: Flexible and controllable, suitable for batch processing.
  • Disadvantages: Data synchronization is not real-time, and there may be data lag.

5. Two-way synchronization

  • Bidirectional synchronization refers to the synchronization between two or more systems, i.e., changes in data are passed and synchronized with each other. It is commonly used in scenarios such as distributed databases or active data centers.
  • Advantages: Consistency between systems, data can be written to multiple points.
  • Disadvantages: The implementation is complex and prone to data conflicts and consistency issues.

6. Log-based synchronization

  • Log-based synchronization leverages the transaction log or binlog of the database to capture data changes and synchronize them to the target system. Commonly used tools include Debezium, Canal, etc.
  • Advantages: Incremental capture of data changes with high real-time performance.
  • Disadvantages: Depends on the logging mechanism of the database, which may affect the performance of the database.

7. File Sync

  • File synchronization is the transfer of data using files, such as exporting data to CSV, JSON, or XML files, and then synchronizing data through FTP and SFTP. It is suitable for scenarios where the data structure is not complex.
  • Advantages: Simple implementation and good compatibility.
  • Disadvantages: Poor real-time performance, not suitable for complex and frequently changing data.

Different data synchronization methods are suitable for different business scenarios, and the selection of the appropriate synchronization method requires comprehensive consideration of factors such as data volume, real-time requirements, network and system performance, etc.

Explore WhaleTunnel’s data synchronization feature

In the modern data-driven enterprise, efficient data synchronization between systems and platforms is essential. WhaleTunnel is a data integration product developed by WhaleOps, which aims to meet the challenges of modern enterprise data management by providing powerful data synchronization capabilities. Based on the Apache SeaTunnel project, WhaleTunnel provides a complete set of data integration capabilities such as batch processing, real-time, and change data capture (CDC).

1. Batch Data Synchronization

Batch data synchronization refers to the periodic transfer of large amounts of data from the source system to the target system. T

he batch-stream architecture of WhaleTunnel supports offline full synchronization and incremental synchronization, which is suitable for scenarios that require large amounts of data to be loaded, such as data warehouses and data lakes. In batch mode, WhaleTunnel’s Zeta engine continuously creates distributed snapshots, which can be used to restore or restart synchronization after a task failure, so as to ensure data consistency.

2. Real-time data synchronization

Real-time data synchronization can transmit data from the source system to the target system in real-time and is suitable for data scenarios that require fast response, such as message queues (such as Kafka). WhaleTunnel is configured by setting the job type to STREAMING

The WhaleTunnel Zeta engine continuously creates distributed snapshots during real-time synchronization to save processing checkpoints. If a task fails, the system rolls back to the last successful checkpoint, ensuring that data is processed only once, preventing data loss or duplication in the target database.

3. Change Data Capture (CDC) synchronization

Change Data Capture (CDC) is a method that captures changes in data by reading database logs. WhaleTunnel supports CDC real-time synchronization and CDC offline incremental synchronization and can capture and apply all changes (insertion, deletion, and update) in the source data to the downstream target system. This approach is particularly useful for scenarios that require data to be continuously updated and consistent.

WhaleTunnel’s Zeta engine continuously takes distributed snapshots when performing CDC synchronization to ensure that the processing checkpoints for each task are preserved. When a task fails, the system rolls back to the last successful checkpoint to ensure data integrity and consistency.

4. Incremental data integration without primary key

The traditional incremental synchronization integration method has the following drawbacks:

  • The required auto-increment ID or other fields in the table that can be used to determine whether the data is incremental limit the scope of tables that can be synchronized incrementally.
  • The deleted data and modified data cannot be identified, and only the newly added data can be synchronized, and the data in the destination data source cannot be consistent with the data on the source side.

However, WhaleTunnel’s CDC integration mode will record the latest vertex N in the current database log when the offline synchronization job is running, and then start incremental processing from the vertex M recorded at the end of the previous task.

In the absence of CDC offline synchronization, if we want to synchronize the database incrementally, we must specify a field to determine whether it is new data, which can be an auto-incremented ID field or a data write time field.

With offline CDC sync, the above two problems can be completely solved. The core of offline CDC synchronization is to obtain data changes by reading and parsing the change log of the source data and can obtain all the change information such as new data, deleted data, and modified data, and then apply the change operation to the target database. In this mode, there are no restrictions on the source table, and all data operation types can be changed, so that the destination data source is truly consistent with the source database.

WhaleTunnel supports both real-time synchronization and offline synchronization modes. Perform a full read of historical data from the source table, and then automatically switch to incremental logs, parse them, send them to the downstream table, and write them to the destination data source. If there is no more data update operation in the source table, the synchronization job will not be stopped, and will always wait for new data to come in.

During the real-time CDC synchronization process, Zeta, the synchronization engine of WhaleTunnel, continuously runs distributed snapshots to save the nodes processed by each task, and if the task fails, Zeta will roll back the task from the last successful processing point. In this way, WhaleTunnel can ensure that the data will only be processed once, and that there will be no data loss or duplication in the target database.

5. Database-wide synchronization and automatic table schema changes

WhaleTunnel supports database-wide synchronization and automatic table schema changes.

  • Table schema changes
    Traditional CDC cannot detect upstream table schema changes and synchronously apply table schema changes to the next target data source. WhaleTunnel CDC has a feature called Schema Evolution that solves this problem perfectly, and after enabling this feature, the downstream engine that supports table changes can selectively “synchronize” the table structure changes to the downstream system, reducing manual participation.
  • CDC multi-table synchronization and full-database synchronization
CDC Basic Process

At present, most synchronization products in the industry need to start a job for each table to synchronize when doing CDC synchronization, and when many tables need to be synchronized, it will cause a lot of waste of computing resources and cause excessive use of database connections. Database connections are very important resources, and if too many database connections are used unlimitedly, the data source may become unstable.

WhaleTunnel CDC solves this problem by synchronizing multiple tables. Users can specify that data from multiple tables be synchronized in a job, and Zeta will start one or more synchronization threads to process the data in the database logs according to the user’s configuration, and each synchronization thread only needs one database connection. This can greatly reduce the use of database connections, in the past, we needed 10,000 database connections to synchronize the data of 10,000 tables, but with WhaleTunnel CDC, we can even use only one database connection to complete the synchronization.

  • CDC database and table sharding synchronization
    WhaleTunnel supports data synchronization of database and table sharding, and each source task can synchronize database and table sharding in a database instance when the job is configured. WhaleTunnel Zeta optimizes the following jobs into three pipelines, each of which is independently scheduled for fault-tolerant scheduling, which means that even if one MySQL instance is suspended, it will not affect the data synchronization on the other two MySQL instances to the downstream ClickHouse.
  • CDC synchronizes dynamic table addition in real-time
    When you synchronize the entire CDC database in real-time and CDC in real time to multiple tables, you sometimes need to add new tables to synchronize the data. WhaleTunnel CDC allows you to dynamically add a new synchronization table without stopping the synchronization job. The newly added table is performed by a new thread in the snapshot read phase of historical data, and The incremental log read phase is then completed by the incremental synchronization threads that are already in use by other tables.
Dynamically discover new tables

As you can see, WhaleTunnel provides a variety of data synchronization methods for users to choose the most appropriate way to deploy applications in different business scenarios.

Crucially, WhaleTunnel is positioned as a cloud-native data synchronization platform from the very beginning, supporting deployment on Kubernetes (K8S) and leveraging cloud storage as metadata storage. In addition, it provides a complete visual development interface to support task development, management, scheduling, and monitoring, simplifying the definition and execution of data integration tasks, making data synchronization operations easier, and greatly reducing users’ data management costs.

Conclusion

As a leading data synchronization and integration tool, WhaleTunnel provides an efficient, stable, and flexible solution for enterprises’ data integration needs by supporting multiple synchronization methods (batch, real-time, CDC), as well as incremental synchronization without primary keys and automatic table schema changes. For enterprises looking to optimize data management and improve data utilization, WhaleTunnel is undoubtedly an option worth considering.

👉👉WhaleTunnel on AWS Marketplace: https://aws.amazon.com/marketplace/pp/prodview-kzzy36f3blbxu?sr=0-2&ref_=beagle&applicationId=AWSMPContessa

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.