Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • OpenLineage and Azure Purview [Shrikanth]

...

Nov 10th 2021 (9am PT)

Attendees:

  • TSC
    • Mike Collado: Eng
    • Ryan Blue: Tabular, Apache Iceberg
    • Mandy Chessel: Lead Egeria project
    • Maciej Obuchowski: Eng GetInData, OpenLineage contributor
    • Willy Lulciuc: Co-creator of Marquez
    • Julien: OpenLineage Project lead
  • And:
    • Michael Robinson: dev rel
    • Peter Hicks: Marquez contributor
    • Ross Turk: VP marketting Datakin
    • John Thomas: Support eng at Datakin
    • Minkyu Park: Dev at Datakin, learning about MQZ and OL.

Agenda:

  • OL Client use cases for Apache Iceberg [Ryan]
  • Proxy Backend and Egeria integration progress update (Issue #152) [Mandy]
  • OpenLineage last release overview (0.3.1)
    • Facet versioning
    • Airflow 2 / Spark 3 support, dbt improvements
  • OpenLineage 0.4 scope review
    • Proxy Backend (Issue #152)
    • Spark, Airflow, dbt improvements (documentation, coverage, ...)
    • improvements to the OpenLineage model
  • Open discussion

Meeting recording:

Notes:

SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time

Iceberg requirements:

  • ability for Iceberg to add facets without having to depend on the context it's running in.

  • Avoid depending on allowing the Sources to expose facets in the Spark API as it would be a hard change to get into Spark.

Ryan:

...

  Proposal to have a logger style API.

  • similar to SLF4J or dropwizard metrics => Create a logging/metrics object. Independent of logging backend.

  • Facets can be emitted and the backend can be configured independently whether those facets are picked up or not.

Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y

=> broad agreement on principle

Open Questions:

  • when facets are sent?

    • do we send them immediately? Do we wait?

    • iceberg not creating a facet until Spark asks for the splits

      • preference to sending events as they go.

      • does that it fit with the OpenLineage view of the world? => yes

  • Spark, bound to a context thread:

    • loggers depend on thread

    • listener is on different thread

    • Report for a given job run

    • Ryan: runcontext is threadlocal: sets the executionid.

      • the "logger backend can grab the sql execution id"

  • The client side should be able to send an event immediately vs sent when you get a chance.

    • Need to have a guide to defining a facet.

      • Who needs to do this?

  • Michael C.: TODO: Design Doc on logging

  • Willy: Do we need a "RUNNING" event?

Flink:

  • how to handle long running job

  • [Ryan] [Mandy] long running jobs need to be defined

  • TODO: Julien, post a ticket for long running jobs

Also need for OSS trino integration, tabular might contribute


Proxy Backend update [Mandy]

  • draft PR #500: Thanks Willy for the initial setup.
    Looking for feedback
    Issues:
    Initial implementation was using the provided beans to deserialize but it didn't quite work (TODO: ticket)
    Instead just pass through. faster, but no validation
  • proposal for new facets.
    RequestFacet => should be a runfacet, maps to the run args in Marquez

https://github.com/OpenLineage/OpenLineage/issues/256

Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket

Oct 13th 2021

Attendees:

  • TSC:
    • Michael Collado: Datakin

    • Julien Le Dem: OpenLineage Project Lead, Datakin

    • Maciej Obuchowski: GetInData, OpenLineage

    • Willy Lulciuc: Marquez, OpenLineage

    • Mandy Chessel: Egeria Project Lead, working on OpenLineage

  • And:
    • Ross Turk: VP marketing at Datakin talk about the website

    • Minkyu Park: interested in contributing to Datakin

    • Peter Hicks: Marquez contributor, OpenLineage user

...