The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific and the link to join the meeting is https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09
All are welcome.
Aug 11th 2021
- Attendees:
- TSC:
Ryan Blue
Maciej Obuchowski
Michael Collado
Daniel Henneberger
Willy Lulciuc
Mandy Chessell
Julien Le Dem
- And:
Peter Hicks
Minkyu Park
Daniel Avancini
- TSC:
- Meeting recording:
- zoom link
- Passcode: =RBUj01C
- Meeting notes:
- Agenda:
- Coming in OpenLineage 0.1
- OpenLineage spec versioning
- Clients
- Marquez integrations imported in OpenLineage
- Apache Airflow:
- BigQuery
- Postgres
- Snowflake
- Redshift
- Great Expectations
- Apache Spark
- dbt
- Apache Airflow:
- OpenLineage 0.2 scope discussion
- Facet versioning mechanism (Issue #153)
- OpenLineage Proxy Backend (Issue #152)
- Kafka client
- Roadmap
- Open discussion
- Coming in OpenLineage 0.1
- Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14
- Notes:
- OpenLineage 0.1 is being published
- Coming in OpenLineage 0.1
- OpenLineage spec versioning
- Clients (Java, Python)
- Marquez integrations imported in OpenLineage
- Apache Airflow:
- BigQuery
- Postgres
- Snowflake
- Redshift
- Great Expectations
- Apache Spark
- dbt
- Question: How is airflow capturing openlineage events?
- openlineage-airflow installed on the airflow instance
- adapters per operator
- Apache Airflow:
- OpenLineage 0.2 scope discussion
- Facet versioning mechanism (Issue #153)
- OpenLineage Proxy Backend (Issue #152)
- Questions:
- What is the advantage of the proxy backend?
- The consumer does not need to implement an endpoint and can consume from kafka
- can configure what to do with events independently of various integrations
- first step to having a routing mechanism:
- to send events to multiple consumer
- to have rule-based routing
- to enable archiving the event in addition to sending them
- Is it included in OpenLineage?
- Yes (Otherwise it would have to be in Egeria)
- Does it include error management or retry policy? What if the proxy dies? Do we care about durability?
- Yes we care about durability
- first implementation to be synchronous. single transaction to Kafka per event.
- future might be configurable to adjust depending on context (guaranteed delivery vs performance batching)
- What technology should we use?
- Proposed: Java + spring boot (like Egeria)
- discussion to use Java + dropwizard like Marquez
- general consensus on using java. (framework TBD)
- In the future, might have a go implementation to enable lightweight sidecar pattern
- What is the advantage of the proxy backend?
- Questions:
- Kafka client
- Roadmap
- Open discussion
How do we define extension points for integrations? For example hooks, spark and airflow for the user to add adapters/facets without having to modify OL.
- TODO: create a ticket to track this
- Apache Iceberg interest in OpenLineage:
- Would want to add additional notifications
- how many files read or written
- How long a commit took.
- How many attempts to commit were needed?
- TODO: create ticket to enable Iceberg facets to be added to OpenLineage events
- Iceberg needs to send events independently of where the library is used. (example: plain java process or other)
- TODO: need ticket for this => #167 Iceberg integration
- TODO: ticket for PrestoDB/Trino integrations
- => #164 Trino and #165 PrestoDB
- Would want to add additional notifications
- Egeria has a weekly community call
- September 1st will be about OpenLineage
- Also an incoming webinar
- Agenda:
July 14th 2021
- Attendees:
- TSC:
- Julien Le Dem
- Mandy Chessel
- Michael Collado
- Willy Lulciuc
- TSC:
- Meeting recording:
- zoom link
- Passcode: =!oZ?&0A
- Meeting notes
- Agenda:
- Finalize the OpenLineage Mission Statement
- Review OpenLineage 0.1 scope
- Roadmap
- Open discussion
- Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46
- Notes:
Mission statement:
Overall consensus on the statement.
TODO: vote by commenting on the ticket
Spec versioning mechanism:
The goal is to commit to compatible changes once 0.1 is published
We need a follow up to separate core facet versioning
=> TODO: create a separate github ticket.The lineage event should have a field that identifies what version of the spec it was produced with
=> TODO: create a github issue for this
TODO: Add issue to document version number semantics (SCHEMAVER)
Extend Event State notion:
where do we capture more precise state transitions like RESTART?
Discussion should happen here: https://github.com/OpenLineage/OpenLineage/issues/9
OpenLineage 0.1:
finalize a few spec details for 0.1 : a few items left to discuss.
In particular job naming
parent job model
Importing Marquez integrations in OpenLineage
Open Discussion:
connecting the consumer and producer
TODO: ticket to track distribution mechanism
options:
Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
proxy similar to OpenTelemetry Collector.
Send to Kafka: https://github.com/OpenLineage/OpenLineage/issues/70
We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
Do we have a mutual third party or the client know where to send?
Source code location finalization
job naming convention
you don't always have a nested execution
can call a parent
parent job
You can have a job calling another one.
always distinguish a job and its run
need a separate notion for job dependencies
need to capture event driven: TODO: create ticket.
TODO(Julien): update job naming ticket to have the discussion.
- Agenda:
June 9th 2021
- Attendees:
- TSC:
Julien Le Dem: Marquez, Datakin
Drew Banin: dbt, CPO at fishtown analytics
Maciej Obuchowski: Marquez, GetIndata consulting company
Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
Daniel Henneberger: building a database, interested in lineage
Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
Willy Lulciuc: co-creator of Marquez
Michael Collado: Datakin, OpenLineage end-to-end holistic approach. - And:
Kedar Rajwade: consulting on distributed systems.
Barr Yaron: dbt, PM at Fishtown analytics on metadata.
Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue - Excused: Ryan Blue, James Campbell
- TSC:
- Meeting recording:
- zoom link
- Passcode: +ge1Akp9
- Meeting notes:
Agenda:
project communication
Technical charter review
medium term roadmap discussion
Notes:
project communication
github: for specs, designs, reviews and building consensus (issues and PRs)
email: for announcements, notes, etc
Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)
monthly meeting: recorded, notes and recording published on the wiki
Technical Charter review:
TODO: Finalize the mission statement. TSC members to comment in the doc.
Roadmap discussion:
TODO: please comment in the doc. Julien to update the OpenLineage project in github: https://github.com/OpenLineage/OpenLineage/projects/1