Page History

...

OpenLineage and Azure Purview [Shrikanth]

...

Nov 10th 2021 (9am PT)

Attendees:

TSC
- Mike Collado: Eng
- Ryan Blue: Tabular, Apache Iceberg
- Mandy Chessel: Lead Egeria project
- Maciej Obuchowski: Eng GetInData, OpenLineage contributor
- Willy Lulciuc: Co-creator of Marquez
- Julien: OpenLineage Project lead
And:
- Michael Robinson: dev rel
- Peter Hicks: Marquez contributor
- Ross Turk: VP marketting Datakin
- John Thomas: Support eng at Datakin
- Minkyu Park: Dev at Datakin, learning about MQZ and OL.

Agenda:

OL Client use cases for Apache Iceberg [Ryan]
Proxy Backend and Egeria integration progress update (Issue #152) [Mandy]
OpenLineage last release overview (0.3.1)
- Facet versioning
- Airflow 2 / Spark 3 support, dbt improvements
OpenLineage 0.4 scope review
- Proxy Backend (Issue #152)
- Spark, Airflow, dbt improvements (documentation, coverage, ...)
- improvements to the OpenLineage model
Open discussion

Meeting recording:

zoom link
Passcode: $0?6ty6E
Slides

Notes:

SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time

Iceberg requirements:

ability for Iceberg to add facets without having to depend on the context it's running in.
Avoid depending on allowing the Sources to expose facets in the Spark API as it would be a hard change to get into Spark.

Ryan:

...

Proposal to have a logger style API.

similar to SLF4J or dropwizard metrics => Create a logging/metrics object. Independent of logging backend.
Facets can be emitted and the backend can be configured independently whether those facets are picked up or not.

Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y

=> broad agreement on principle

Open Questions:

when facets are sent?

do we send them immediately? Do we wait?
iceberg not creating a facet until Spark asks for the splits

preference to sending events as they go.
does that it fit with the OpenLineage view of the world? => yes

Spark, bound to a context thread:

loggers depend on thread
listener is on different thread
Report for a given job run
Ryan: runcontext is threadlocal: sets the executionid.

the "logger backend can grab the sql execution id"

The client side should be able to send an event immediately vs sent when you get a chance.

Need to have a guide to defining a facet.

Who needs to do this?

Michael C.: TODO: Design Doc on logging
Willy: Do we need a "RUNNING" event?

Flink:

how to handle long running job
[Ryan] [Mandy] long running jobs need to be defined
TODO: Julien, post a ticket for long running jobs

Also need for OSS trino integration, tabular might contribute

Proxy Backend update [Mandy]

draft PR #500: Thanks Willy for the initial setup.
Looking for feedback
Issues:
Initial implementation was using the provided beans to deserialize but it didn't quite work (TODO: ticket)
Instead just pass through. faster, but no validation

OL is the dynamic lineage solution for Egeria
used postman for 3rd party
released in a few weeks
https://odpi.github.io/egeria-docs/features/lineage-management/overview/#the-openlineage-standard

proposal for new facets.
RequestFacet => should be a runfacet, maps to the run args in Marquez

https://github.com/OpenLineage/OpenLineage/issues/256

Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket

Oct 13th 2021

Attendees:

TSC:
- Michael Collado: Datakin
- Julien Le Dem: OpenLineage Project Lead, Datakin
- Maciej Obuchowski: GetInData, OpenLineage
- Willy Lulciuc: Marquez, OpenLineage
- Mandy Chessel: Egeria Project Lead, working on OpenLineage
And:
- Ross Turk: VP marketing at Datakin talk about the website
- Minkyu Park: interested in contributing to Datakin
- Peter Hicks: Marquez contributor, OpenLineage user

...

Page tree

Versions Compared

Old Version 23

New Version 24

Key

Nov 10th 2021 (9am PT)

Oct 13th 2021