The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific. Link to join the meeting
All are welcome.
Attendees:
Michael Collado: Datakin
Julien Le Dem: OpenLineage Project Lead, Datakin
Maciej Obuchowski: GetInData, OpenLineage
Willy Lulciuc: Marquez, OpenLineage
Mandy Chessel: Egeria Project Lead, working on OpenLineage
Ross Turk: VP marketing at Datakin talk about the website
Minkyu Park: interested in contributing to Datakin
Peter Hicks: Marquez contributor, OpenLineage user
Open discussions:
Azure purview team hackathon ongoing to consumer OpenLineage events
Design docs discussion:
proposal to add design doc for proposal.
goal:
Similar to the process of projects like Kafka, Flink: for specs and bigger features
not for bug fixes.
options:
proposal directory for docs as markdown
Open PRs against wiki pages: proposals wiki.
Manage status:
list of designs that are implemented vs pending.
table of open proposals.
vote for prioritization:
Every proposal design doc has an issue opened and link back to it.
good start for the blog talking about that feature
New committee on data ops: Mandy will be speaking about Egeria and OpenLineage
Scope:
How the foundation projects should work together around the topic.
Establish OpenLineage is important.
https://wiki.lfaidata.foundation/display/DL/DataOps+Committee
Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
Michael Collado: Datakin, OpenLineage
Update on OpenLineage latest release (0.2.1)
dbt integration demo
OpenLineage 0.3 scope discussion
Facet versioning mechanism (Issue #153)
OpenLineage Proxy Backend (Issue #152)
OpenLineage implementer test data and validation
Kafka client
Roadmap
Open discussion
added to the agenda a Discussion of Iceberg requirements for OpenLineage.
Demo of dbt:
really easy to try
when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
Presentation of Proxy Backend design:
Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
Two ways to use Egeria with OpenLineage
receives HTTP events and forwards to Kafka
A consumer receives the Kafka events in Egeria
Proxy Backend in OpenLineage:
direct HTTP endpoint implementation in Egeria
Depending on the user they might pick one or the other and we'll document
Use a direct OpenLineage endpoint (like Marquez)
Deploy the Proxy Backend to write to a queue (ex: Kafka)
Follow up items:
The transport abstraction (Backend interface) could be usable directly from the client or from the Proxy Backend. The user can decide if they want the intermediate proxy. See #269
We should add a distribution client symmetric to the Proxy Backend. It reads from Kafka and sends event to an OpenLineage HTTP endpoint. Marquez would use it, for example to consume OpenLineage events produced by Egeria. See #270
presentation of Iceberg model
Manifest and manifest list: 2-level tree structure tracking data files.
root metadata version file. Points to manifest list (It knows all of the previous versions of the dataset that we want to keep)
Iceberg collect various metadata about the scans and data being produced and wants to expose it through OpenLineage. It can already expose metadata but there is no listener yet.
Ryan: added the metadata list presented to the Iceberg ticket: See #167
Ryan Blue
Maciej Obuchowski
Michael Collado
Daniel Henneberger
Willy Lulciuc
Mandy Chessell
Julien Le Dem
Peter Hicks
Minkyu Park
Daniel Avancini
How do we define extension points for integrations? For example hooks, spark and airflow for the user to add adapters/facets without having to modify OL.
Mission statement:
Overall consensus on the statement.
TODO: vote by commenting on the ticket
Spec versioning mechanism:
The goal is to commit to compatible changes once 0.1 is published
We need a follow up to separate core facet versioning
The lineage event should have a field that identifies what version of the spec it was produced with
=> TODO: create a github issue for this
TODO: Add issue to document version number semantics (SCHEMAVER)
Extend Event State notion:
where do we capture more precise state transitions like RESTART?
Discussion should happen here: https://github.com/OpenLineage/OpenLineage/issues/9
OpenLineage 0.1:
finalize a few spec details for 0.1 : a few items left to discuss.
In particular job naming
parent job model
Importing Marquez integrations in OpenLineage
Open Discussion:
connecting the consumer and producer
TODO: ticket to track distribution mechanism
options:
Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
proxy similar to OpenTelemetry Collector.
Send to Kafka: https://github.com/OpenLineage/OpenLineage/issues/70
We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
Do we have a mutual third party or the client know where to send?
Source code location finalization
job naming convention
you don't always have a nested execution
can call a parent
parent job
You can have a job calling another one.
always distinguish a job and its run
need a separate notion for job dependencies
need to capture event driven: TODO: create ticket.
Agenda:
project communication
Technical charter review
medium term roadmap discussion
Notes:
project communication
github: for specs, designs, reviews and building consensus (issues and PRs)
email: for announcements, notes, etc
Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)
monthly meeting: recorded, notes and recording published on the wiki
Technical Charter review:
TODO: Finalize the mission statement. TSC members to comment in the doc.
Roadmap discussion:
TODO: please comment in the doc. Julien to update the OpenLineage project in github: https://github.com/OpenLineage/OpenLineage/projects/1