...
- OpenLineage and Azure Purview [Shrikanth]
...
Nov 10th 2021 (9am PT)
Attendees:
- TSC
- Mike Collado: Eng
- Ryan Blue: Tabular, Apache Iceberg
- Mandy Chessel: Lead Egeria project
- Maciej Obuchowski: Eng GetInData, OpenLineage contributor
- Willy Lulciuc: Co-creator of Marquez
- Julien: OpenLineage Project lead
- And:
- Michael Robinson: dev rel
- Peter Hicks: Marquez contributor
- Ross Turk: VP marketting Datakin
- John Thomas: Support eng at Datakin
- Minkyu Park: Dev at Datakin, learning about MQZ and OL.
Agenda:
- OL Client use cases for Apache Iceberg [Ryan]
- Proxy Backend and Egeria integration progress update (Issue #152) [Mandy]
- OpenLineage last release overview (0.3.1)
- Facet versioning
- Airflow 2 / Spark 3 support, dbt improvements
- OpenLineage 0.4 scope review
- Proxy Backend (Issue #152)
- Spark, Airflow, dbt improvements (documentation, coverage, ...)
- improvements to the OpenLineage model
- Open discussion
Meeting recording:
Notes:
SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time
Iceberg requirements:
ability for Iceberg to add facets without having to depend on the context it's running in.
Avoid depending on allowing the Sources to expose facets in the Spark API as it would be a hard change to get into Spark.
Ryan:
...
Proposal to have a logger style API.
similar to SLF4J or dropwizard metrics => Create a logging/metrics object. Independent of logging backend.
Facets can be emitted and the backend can be configured independently whether those facets are picked up or not.
Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y
=> broad agreement on principle
Open Questions:
when facets are sent?
do we send them immediately? Do we wait?
iceberg not creating a facet until Spark asks for the splits
preference to sending events as they go.
does that it fit with the OpenLineage view of the world? => yes
Spark, bound to a context thread:
loggers depend on thread
listener is on different thread
Report for a given job run
Ryan: runcontext is threadlocal: sets the executionid.
the "logger backend can grab the sql execution id"
The client side should be able to send an event immediately vs sent when you get a chance.
Need to have a guide to defining a facet.
Who needs to do this?
Michael C.: TODO: Design Doc on logging
Willy: Do we need a "RUNNING" event?
Flink:
how to handle long running job
[Ryan] [Mandy] long running jobs need to be defined
TODO: Julien, post a ticket for long running jobs
Also need for OSS trino integration, tabular might contribute
Proxy Backend update [Mandy]
- draft PR #500: Thanks Willy for the initial setup.
Looking for feedback
Issues:
Initial implementation was using the provided beans to deserialize but it didn't quite work (TODO: ticket)
Instead just pass through. faster, but no validation
- OL is the dynamic lineage solution for Egeria
used postman for 3rd party
released in a few weeks
https://odpi.github.io/egeria-docs/features/lineage-management/overview/#the-openlineage-standard
- proposal for new facets.
RequestFacet => should be a runfacet, maps to the run args in Marquez
https://github.com/OpenLineage/OpenLineage/issues/256
Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket
Oct 13th 2021
Attendees:
- TSC:
Michael Collado: Datakin
Julien Le Dem: OpenLineage Project Lead, Datakin
Maciej Obuchowski: GetInData, OpenLineage
Willy Lulciuc: Marquez, OpenLineage
Mandy Chessel: Egeria Project Lead, working on OpenLineage
- And:
Ross Turk: VP marketing at Datakin talk about the website
Minkyu Park: interested in contributing to Datakin
Peter Hicks: Marquez contributor, OpenLineage user
...