This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version.
Native Lineage Support #
As organisations look to govern their data ecosystems; understanding data lineage, where data is coming from and going to, becomes critical. As Apache Flink is widely used for data ingestion and ETL in Streaming Data Lakes, we need an end to end lineage solution for scenarios including but not limited to:
Data Quality Assurance: Identifying and rectifying data inconsistencies by tracing data errors back to their origin within the data pipeline.Data Governance: Establishing clear data ownership and accountability by documenting data origins and transformations.Regulatory Compliance: Ensuring adherence to data privacy and compliance regulations by tracking data flow and transformations throughout its lifecycle.Data Optimization: Identifying redundant data processing steps and optimizing data flows to improve efficiency.
Apache Flink provides a native lineage support by providing an internal lineage data model and Job Status Listener for developer to integrate lineage metadata into external lineage system, for example OpenLineage. When a job is created in Flink runtime, the JobCreatedEvent contains the Lineage Graph metadata that will be sent to Job Status Listeners.
Lineage Data Model #
Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines the extended interfaces for Table and DataStream independently. The interface and class relationships are defined in the diagram below.
By default, Table related lineage interfaces or classes are used in Flink Table environment, thus Flink users doesn’t need to touch these interfaces. The Flink community will gradually support all of the common connectors, such as Kafka, JDBC, Cassandra, Hive. If you have a customized connector defined, you need to have customized source/sink implementations of the LineageVertexProvider interface. Within a LineageVertex, a list of Lineage Datasets are defined as metadata for Flink source/sink.
@PublicEvolving
public interface LineageVertexProvider {
LineageVertex getLineageVertex();
}
For the interface details, please refer to FLIP-314.