Innovations in Feature and Sample Centers: An Integrated Approach to Feature Engineering
Hello everyone, I’m Fan Weitai, working at OPPO’s Smart Recommendation System Algorithm Engineering Team, mainly responsible for feature platform-related work. Today’s sharing is completed by me and my colleague Wang Zichao.
Wang Zichao
Senior Backend Development Engineer at OPPO
Fan Weitai
Senior Backend Development Engineer at OPPO
01 Background Overview
The main content of my sharing is divided into two parts: The first part is about the background overview and an introduction to the overall platform architecture; the second part is about the two core modules of the feature platform, namely the Feature Center and Sample Center, which my colleague Zichao will share.
Firstly, let’s take a look at the business panorama of the smart recommendation system, which can be divided into three modules. The first module is the business layer, which involves the APP Store of the APP, as well as advertising business, mall business, information flow, user growth, and search, etc. Below the business layer, there are some service layers, such as the data pipeline, whose main purpose is to collect and report client buried log data to the data warehouse.
The next module is our topic today — the feature platform. As a kind of basic data, it is used for the production and processing of features, serving as the basic data of the smart recommendation system, including user portraits, sample services, and is provided for business engine use.
Additionally, as a recommendation system, we also have two important components, namely experiments and reports. Experiments are used to conduct different experiments and evaluate their effects; reports are used to display recommendation effects. The entire platform is supported by the underlying big data platform, StarFire (machine learning platform), and the end-side workbench, and is built on OPPO’s Andes Smart Cloud System.
Under this platform, let’s take a look at the process of data flow in the feature platform, taking user requests as an example.
When a user initiates a data request, the request reaches the recommendation engine, which will obtain the portrait information of the user and materials, and combine some recall information, finally pushing the results to the inference service for inference calculation and return the results.
During this process, we will take the user’s feature snapshot at that time as the basic sample data for feature snapshot dump. In addition, users will provide certain feedback on the recommended results, that is, user behavior logs.
We collect and process these log data, part of which is used as labels under the sample snapshot, that is, the marking of positive feedback or negative feedback, and another part is used to calculate the dynamic portrait features of users and materials for recall or model training.
These features and labels are combined with sample data, as the basis for training machine learning models, and then the model is combined with the inference service into the entire recommendation engine, forming a closed-loop data flow. In the entire system, the timeliness and consistency of the feature platform data are guaranteed.
Next, let’s take a look at the development history of the feature platform. In early development, tools were chaotic, including development based on MR, Spark, and Flink, and code using languages such as Java and Python. There is no uniform definition of data formats, and there are problems with repeated processing and poor reusability, and the timeliness and data consistency are also low.
Against this background, Apache SeaTunnel was introduced, and combined with its own needs, the feature platform and sample platform were built. Currently, our development tools are unified, data formats are consistent, and the timeliness and consistency of data have been improved.
Next, let’s take a look at the advantages of Apache SeaTunnel and why choose it.
The early version of Apache SeaTunnel has supported mainstream data engines such as Flink and Spark, has high advantages in data processing and calculation, and has active community support.
In addition, Apache SeaTunnel has also launched its own engine specifically for data integration.
In some scenarios, the Spark or Flink engines for big data have certain defects for data integration or data synchronization, so SeaTunnel designed its own engine to achieve decoupling. It is not strongly dependent on Spark or Flink, giving users more choices.
In addition, through the sharing of thread pools and connections, it reduces the pressure on the Source and Sink server and improves the utilization efficiency. SeaTunnel also implemented caching technology inside the engine, such as caching expired Source data in advance for multiple Sinks to use, and when the Sink encounters problems, it ensures efficient processing capacity through fine-grained fault tolerance technology.
In addition, Apache SeaTunnel has a high degree of process abstraction and clear logic, which makes it low cost for unknown people to understand and easy to use. Through the configuration of the three top-level abstract components Source, Transformer, and Sink, task development can be carried out.
SeaTunnel also has the characteristics of modularization and plug-in, which is convenient for integration and expansion. In the V2 version, SeaTunnel launched its own Source, Transformer, and Sink components and solved the problems of plug-in extension and engine version adaptation.
The current version has already adapted to mainstream versions such as Flink and Spark.
02 Feature Platform — Overall Architecture
Returning to the overall architecture of our feature platform, based on Apache SeaTunnel, the core services are divided into the Feature Center and Sample Center. Additionally, there are some common services, including management of business Schema source data, development mode of task orchestration, bloodline management between task flows, and authority management, etc.
The Feature Center is mainly responsible for feature registration, production, selection (evaluating feature quality), feature mapping, and feature sharing to avoid duplication. Users can reuse features through the feature sharing function.
The Sample Center is responsible for sample storage management, sample generation and extraction, as well as sample repatriation, which is particularly important when adding new features or experimenting.
Public services include a logging service and a monitoring system that runs throughout the development center. It is built on OPPO cloud-centered Andes Intelligent Cloud, relying on big data storage, computation, and scheduling systems.
In terms of feature development, we have continued the highly abstract processing flow of the previous Source and Transformer and integrated Apache SeaTunnel. Development is done through configuration. We adopted Flink’s Table SQL approach, which is more conducive to the management of source data.
Moreover, we provide a front-end configuration and visualization method as a one-stop IDE, offering a rich set of operators and flexible distribution modes. This includes common Kafka, Ranks, HDS, etc., Source and Sink operators, as well as Transformer adaptation algorithms such as Interval Join, CoGroup, Session Window, etc., custom components and latency monitoring.
Additionally, we provide a plugin management and version control method. Users can upload their own plugins and quickly generate tasks by combining source data and Transformers with metadata to meet the demand for feature production.
To meet the needs of feature processing, besides custom plugins, we also introduced Hive. This is mainly because many users are familiar with Hive and have migrated some features using Hive. By rewriting the Hive module and combining it with a whitelist approach, we’ve made it possible to use Hive functions in Flink, enabling users to seamlessly use Hive functions in Flink development without the need for additional schema settings.
In addition to this, we’ve introduced Hive’s syntax and rewritten the Cantolog for creating tables in Source and Transformer. The goal is to directly use Hive tables created elsewhere and manage table Schema in memory to reduce access to Hive Meta Store.
At the same time, with the help of some of Hive’s data Meta capabilities, such as ORC, CSV, and Sequence, etc., it has strong compression and optimization capabilities.
In summary, by combining the advantages of Apache SeaTunnel and Hive, we’ve enhanced the development and production capabilities of the feature platform, and realized configurable and visualized development methods, making it convenient for feature registration, production, and processing.
This is the overall architecture and development process of our feature platform.
03. Feature Platform — Sample Center
Good afternoon, everyone. I’m Wang Zichao, working in the same team as Fan Weitai. I’ll introduce the Feature Center of our feature platform to you. Our general platform mainly consists of two parts: the Feature Center and the Sample Center. The Feature Center is responsible for the registration, production, selection, and sharing of features, while the Sample Center is responsible for the storage management, generation, and back-migration of samples.
First, let’s take a look at the computational architecture of the Feature Center. The architecture is divided into three parts: online, near-line, and offline. The online part is mainly the online service of the feature recommendation engine; the near-line part is the real-time stream, using Flink as the computational engine and Kafka as the messaging component; the offline part uses Spark and HDFS as data storage.
However, this architecture has two problems: first, there is a lot of pressure in terms of operation and maintenance, as it needs to maintain two sets of computational logic; second, there is a risk of data differences between the near-line and offline, which might affect the recommendation results.
To solve these problems, we chose to use Flink as the technical basis to realize a computational architecture that integrates both streaming and batch processing. In this way, the same code can be used for both offline and near-line logic, operation, and maintenance is more convenient, and the risk of data differences between near-line and offline is reduced.
Next, let’s discuss feature storage.
After features are computed, they need to be stored in devices or Redis for online use. Storage is divided into offline storage and online storage. Offline storage uses HDFS, while online storage is mainly for real-time features. We use Redis as the storage engine.
For the storage optimization of offline features, we use a merge write method and adopt separate read and write threads and speed limiting measures to improve read and write performance. Additionally, we have developed a Parker storage engine that uses RocksDB as memory storage and file system storage. Compared with Redis storage, the cost has been reduced by more than 30%.
After the generation of features, they will not be immediately put into online use. Instead, the algorithm team needs to select and evaluate the features to ensure that the features have differentiation and contribution to model building. Our platform provides feature selection functionality that can evaluate features and sort them by importance, and presents this to the algorithm team through reports or line graphs. At the same time, for features used online, we also provide some monitoring and analysis tools, such as monitoring the differentiation and coverage of labels.
In order to ensure the timeliness of features, we have taken a series of monitoring and alert measures. In the data link, we use Obus for standardized collection and add universal monitoring points in the task link, such as Kafka consumption delay and data processing consumption delay.
Monitoring indicators are reported to TShouse, and reports are formed through Grafana. At the same time, alarm strategies are taken for key nodes, and timely sent to relevant personnel. Feature monitoring is divided into universal monitoring and task-based monitoring. We add uniform monitoring points by rewriting the transformation before task generation in the Flink architecture.
In addition, for the consistency assurance of offline features, we provide a consistency check function, which ensures the consistency of offline and online storage by comparing hash values and pushes the check results to relevant personnel.
04. Feature Platform — Sample Center
In terms of the Sample Center, the generation of samples is the process of combining user behavior logs and portrait features, and generating training samples after a series of processing.
We have optimized the sample process, converted storage into column storage, and carried out compression and sorting, which saves storage space and improves read-write speed.
Furthermore, we have developed a data set management function through Flink’s Catalog API. Users can register data sources and build data tasks, simplifying the development process.
In terms of sample data traceability, we have introduced the capability of a data lake, optimized the sample process, and ensured stability.
Future Planning
Finally, we have three plans for the feature platform:
- To deepen the function of source data, achieve fine-grained data traceability capabilities;
- To provide a development system of multiple engines and modes, meet different needs of users;
- To explore the combination with ChatGPT, implement the entire process of feature generation and sample generation through chat dialogue.
That’s the end of our sharing, thank you, everyone!