Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

All are welcome.

Table of Contents

Next meeting: Apr 13th, 2022 (9am PT)

Mar 9th, 2022 (9am PT)

...

Attendees:

  • TSC:
    • Mike Collado: Staff Software Engineer, Datakin
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Mandy Chessel: Egeria Project Lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Michael Robinson: Dev Rel Engineer
    • Ross Turk: VP of Marketing, Datakin
    • Minkyu Park: Senior Software Engineer, Datakin
    • Srikanth Venkat: Product Manager, Privacera
    • John Thomas: Support Engineer, Datakin
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Paweł Leszczyński, Software Engineer, GetinData
    • Sheeri Cabral, Product Manager, Collibra
    • Michal Bartos, Software Engineer, MANTA
    • Chandru
    • Caroline Fahrenkrog, Product Manager, MANTA Scanners
    • John Montroy, Backend Engineer

Agenda:

  • New committers [Julien]
  • Release overview (0.6.0-0.6.1) [Michael R.] 
  • Process for blog posts [Ross]
  • Retrospective: Spark integration [Willy et al.]
  • Open discussion  

Notes:

  • New committers [Julien]
    • 4 new committers were voted in last week
    • We had fallen behind
    • Congratulations to all
  • Release overview (0.6.0-0.6.1) [Michael R.]
    • Added
      • Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
      • Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
        • These first two additions are similar to SQL facet
        • Offer the ability to see top-level code
      • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
        • Captures when someone is conducting dataset operations (overwrite, create, etc.)
      • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
        • Collects environment variables
        • Depends on Databricks runtime but can be reused in other environments
      • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
        • The first iteration of the Dagster integration to get lineage from Dagster
      • Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
        • Small addition to Java client feat. better types; was string
    • Fixed
      • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
        • The former was a particular issue with the Great Expectations integration
      • Reduce logging level for import errors to info @rossturk (0.6.0)
        • Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
        • This fix reduced the level to Info
      • Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
        • Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
        • Some keys are Snowflake-specific, but more can be added from other data sources
      • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
        • Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
      • Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
        • Previously when an OL event failed to emit, this could break an integration
        • This fix catches possible failures and logs them
  • Process for blog posts [Ross]
    • Moving the process to Github Issues
    • Follow release tracker there

    • Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts

    • No one will have a monopoly

    • Proposals for blog posts also welcome and we can support your efforts with outlines, feedback

    • Throw your ideas on the issue tracker on Github

  • Retrospective: Spark integration [Willy et al.]
    • Willy: originally this part of Marquez – the inspiration behind OL

      • OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)

      • Donated the integration to OL

    • Srikanth: #559 very helpful to Azure

    • Pawel: is anything missing from the Spark integration? E.g., column-level lineage?

    • Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome

    • Maciej: should be more active about tracking projects we have integrations with; add to test matrix 

    • Julien: let’s open some issues to address these

  • Open Discussion
    • Flink updates? [Julien]
      • Maciej: initial exploration is done

        • challenge: Flink has 4 APIs

        • prioritizing Kafka lineage currently because most jobs are writing to/from Kafka

        • track this on Github milestones, contribute, ask questions there

      • Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage? 

      • Maciej: trying to model entire Flink run as one event

      • Srikanth: proposed two separate streams, one for data updates and one for metadata

      • Julien: do we have an issue on this topic in the repo?

      • Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc

      • Julien: see notes for ticket number; MC will create the ticket

      • Srikanth: we can collaborate offline

    Tentative agenda:
    • New committers 
    • Release overview (0.6.0-0.6.1) 
    • Process for blog posts
    • Retrospective: Spark integration
    • Open discussion

Feb 9th 2022 (9am PT)

Attendees:

...

  • And:
    • Michael Robinson: Dev Rel Engineer
    • Ross Turk: VP of Marketing, Datakin
    • Minkyu Park: Senior Software Engineer, Datakin
    • Srikanth Venkat: Product Manager, Privacera
    • John Thomas: Support Engineer, Datakin
    • Peter Scharling: EI Group
    • Peter Hicks: Senior Software Engineer, Datakin
    • Dalin Kim: Data Engineer, Northwestern Mutual
    • Kevin Mellott: Data Engineer, Northwestern Mutual
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Kelsy Brennan: EI Group
    • Aaron Colcord: Data Engineer, Northwestern Mutual

Agenda:

  • OpenLineage recent release overview (0.5.1) [Julien]
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
  • Apache Flink integration [Julien]
  • Dagster integration demo [Dalin]
  • Open Discussion

Notes:

  • OpenLineage recent release overview (0.5.1) [Julien]
    • No 0.5.0 due to bug
    • Support for dbt-spark adapter
    • New backend to proxy OL events
    • Support for custom facets
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
    • Integration runs on worker side
    • Will be in next OL release of airflow (2.3)
    • Thanks to Maciej for his work on this
  • Apache Flink integration [Julien]
    • Ticket for discussion available
    • Integration test setup
    • Early stages
  • Dagster integration demo [Dalin]
    • Initiated by Dalin Kim
    • OL used with Dagster on orchestration layer
    • Utilizes Dagster sensor
    • Introduces OL sensor that can be added to Dagster repo definition
    • Uses cursor to keep track of ID
    • Looking for feedback after review complete
    • Discussion:
      • Dalin: needed: way to interpret Dagster asset for OL
      • Julien: common code from Great Expectations/Dagster integrations
      • Michael C: do you pass parent run ID in child job when sending the job to MZ?
      • Hierarchy can be extended indefinitely – parent/child relationship can be modeled
      • Maciej: the sensor kept failing – does this mean the events persisted despite being down?
      • Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
      • Dalin: hoping for more feedback
      • Julien: slides will be posted on slack channel, also tickets
  • Open discussion
    • Will: how is OL ensuring consistency of datasets across integrations? 
    • Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
    • Julien: need for tutorial on creating integrations
    • Srikanth: have done some of this work in Atlas
    • Kevin: are there libraries on the horizon to play this role? (Julien: yes)
    • Srikanth: it would be good to have model spec to provide enforceable standard
    • Julien: agreed; currently models are based on the JSON schema spec
    • Julien: contributions welcome; opening a ticket about this makes sense
    • Will: Flink integration: MZ focused on batch jobs
    • Julien: we want to make sure we need to add checkpointing
    • Julien: there will be discussion in OLMZ communities about this
      • In MZ, there are questions about what counts as a version or not
    • Julien: a consistent model is needed
    • Julien: one solution being looked into is Arrow
    • Julien: everyone should feel welcome to propose agenda items (even old projects)
    • Srikanth: who are you working with on the Flink comms side? Will get back to you.

...