The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific and the link to join the meeting is https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09
All are welcome.
Sept 8th 2021
- Attendees:
- TSC:
Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
Michael Collado: Datakin, OpenLineage
- Maciej Obuchowski: GetInData. OpenLineage integrations
- Willy Lulciuc: Marquez co-creator.
- Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
- And:
- Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
- Minkyu Park: Datakin. learning about OpenLineage
- Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
- TSC:
- Meeting recording:
- zoom link
- Passcode:
- Meeting notes:
- Discussions:
added to the agenda a Discussion of Iceberg requirements for OpenLineage.
Demo of dbt:
really easy to try
when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
Presentation of Proxy Backend design:
- summary of discussions in Egeria
Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
Two ways to use Egeria with OpenLineage
receives HTTP events and forwards to Kafka
A consumer receives the Kafka events in Egeria
Proxy Backend in OpenLineage:
direct HTTP endpoint implementation in Egeria
Depending on the user they might pick one or the other and we'll document
- summary of discussions in Egeria
- Discussions:
Use a direct OpenLineage endpoint (like Marquez)
Deploy the Proxy Backend to write to a queue (ex: Kafka)
Follow up items:
The transport abstraction (Backend interface) could be usable directly from the client or from the Proxy Backend. The user can decide if they want the intermediate proxy. TODO: add github issue to track discussion.
We should add a distribution client symmetric to the Proxy Backend. It reads from Kafka and sends event to an OpenLineage HTTP endpoint. Marquez would use it, for example to consume OpenLineage events produced by Egeria.. TODO: add github issue
Iceberg integration:
presentation of Iceberg model
Manifest and manifest list: 2-level tree structure tracking data files.
root metadata version file. Points to manifest list (It knows all of the previous versions of the dataset that we want to keep)
Iceberg collect various metadata about the scans and data being produced and wants to expose it through OpenLineage. It can already expose metadata but there is no listener yet. TODO(Ryan): add the metadata list presented to the Iceberg ticket.
Aug 11th 2021
- Attendees:
- TSC:
Ryan Blue
Maciej Obuchowski
Michael Collado
Daniel Henneberger
Willy Lulciuc
Mandy Chessell
Julien Le Dem
- And:
Peter Hicks
Minkyu Park
Daniel Avancini
- TSC:
- Meeting recording:
- zoom link
- Passcode: =RBUj01C
- Meeting notes:
- Agenda:
- Coming in OpenLineage 0.1
- OpenLineage spec versioning
- Clients
- Marquez integrations imported in OpenLineage
- Apache Airflow:
- BigQuery
- Postgres
- Snowflake
- Redshift
- Great Expectations
- Apache Spark
- dbt
- Apache Airflow:
- OpenLineage 0.2 scope discussion
- Facet versioning mechanism (Issue #153)
- OpenLineage Proxy Backend (Issue #152)
- Kafka client
- Roadmap
- Open discussion
- Coming in OpenLineage 0.1
- Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14
- Notes:
- OpenLineage 0.1 is being published
- Coming in OpenLineage 0.1
- OpenLineage spec versioning
- Clients (Java, Python)
- Marquez integrations imported in OpenLineage
- Apache Airflow:
- BigQuery
- Postgres
- Snowflake
- Redshift
- Great Expectations
- Apache Spark
- dbt
- Question: How is airflow capturing openlineage events?
- openlineage-airflow installed on the airflow instance
- adapters per operator
- Apache Airflow:
- OpenLineage 0.2 scope discussion
- Facet versioning mechanism (Issue #153)
- OpenLineage Proxy Backend (Issue #152)
- Questions:
- What is the advantage of the proxy backend?
- The consumer does not need to implement an endpoint and can consume from kafka
- can configure what to do with events independently of various integrations
- first step to having a routing mechanism:
- to send events to multiple consumer
- to have rule-based routing
- to enable archiving the event in addition to sending them
- Is it included in OpenLineage?
- Yes (Otherwise it would have to be in Egeria)
- Does it include error management or retry policy? What if the proxy dies? Do we care about durability?
- Yes we care about durability
- first implementation to be synchronous. single transaction to Kafka per event.
- future might be configurable to adjust depending on context (guaranteed delivery vs performance batching)
- What technology should we use?
- Proposed: Java + spring boot (like Egeria)
- discussion to use Java + dropwizard like Marquez
- general consensus on using java. (framework TBD)
- In the future, might have a go implementation to enable lightweight sidecar pattern
- What is the advantage of the proxy backend?
- Questions:
- Kafka client
- Roadmap
- Open discussion
How do we define extension points for integrations? For example hooks, spark and airflow for the user to add adapters/facets without having to modify OL.
- TODO: create a ticket to track this
- Apache Iceberg interest in OpenLineage:
- Would want to add additional notifications
- how many files read or written
- How long a commit took.
- How many attempts to commit were needed?
- TODO: create ticket to enable Iceberg facets to be added to OpenLineage events
- Iceberg needs to send events independently of where the library is used. (example: plain java process or other)
- TODO: need ticket for this => #167 Iceberg integration
- TODO: ticket for PrestoDB/Trino integrations
- => #164 Trino and #165 PrestoDB
- Would want to add additional notifications
- Egeria has a weekly community call
- September 1st will be about OpenLineage
- Also an incoming webinar
- Agenda:
...