Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Announcements [Julien]:
    1. Our first annual ecosystem survey is live and accepting responses: https://bit.ly/ecosystem_survey. Your participation matters!
    2. We recently published the first issue of our monthly newsletter: https://mailchi.mp/18826f97904e/openlineage-news-may-2023. It's a great way to learn about upcoming meetups and recent blog posts, etc.
    3. Two meetups are happening soon:
      1. New York on 6/22 at Collibra's HQ: https://www.meetup.com/data-lineage-meetup/events/294065396/
      2. San Francisco on 6/27 at Astronomer: https://www.meetup.com/meetup-group-bnfqymxe/events/293448130/
    4. Upcoming talks:
      1. Paweł Leszczyński and Maciej Obuchowski, “Column Lineage is Coming to the Rescue,” Berlin Buzzwords, June 18-20, 2023
      2. Julien Le Dem and Willy Lulciuc, “Cross-platform Data Lineage with OpenLineage,” Data+AI Summit, June 28-29, 2023
      3. Maciej Obuchowski, “OpenLineage in Airflow: A Comprehensive Guide,” Airflow Summit, September 19-21, 2023
  2. Recent releases [Michael R.]:
    1. OpenLineage 0.25.0
    2. OpenLineage 0.26.0
    3. OpenLineage 0.27.1
    4. OpenLineage 0.27.2
  3. Static Lineage Progress Update [Paweł]:
    1. Overview
      1. Up to this point, operational/runtime metadata has been the focus of OpenLineage
      2. But there is also a need for lineage metadata about datasets not associated with runs
      3. To address this, a proposal has been created
        1. It answers the question: how can we add new data types to support static lineage?
        2. We decided to add two new types:
          1. job event
          2. dataset event
        3. A schemaURL provides a distinguishing mechanism
        4. Generic client code will not be affected
    2. Demo
      1. Approach taken: serialize and deserialize without modifying the database
    3. Conclusion
      1. This approach does not break existing usage scenarios while nonetheless adding new event types
      2. Changes will be implemented in the clients and the spec
    4. Q&A
      1. Initial work on Marquez to support static lineage has also been completed (adding the capability to distinguish between the event types), but Marquez is not currently able to store static lineage metadata
      2. Ability to convert from static to dynamic anticipated?
        1. Formats not very different
        2. Job event is subtype of a run event, making it easy to extract the data you care about
        3. Marquez UI should not change
      3. Ownership change notification possible?
        1. This data accessible via the REST API but not currently built in
        2. Contribution of such a feature would be welcome
        3. Alternative solution: add a listener
      4. Job events are static but not dataset events?
        1. Both are static events
  4. Discussion items
    1. Marquez search – how robust?
      1. Recommended: visit the GitHub repo and use GitPod to try it out (or use the up.sh script in the docker directory there to deploy locally)
        1. Tags are accessible in some facets in the UI, which would provide one way
    2. Row-based lineage – are there any facets or models that would help with this use case?
      1. We are trying to keep the metadata store smaller than the data itself
      2. Row-level lineage could be captured in a data model, which would be accessible in Marquez
      3. Challenge: the volume of data
      4. It might be helpful to have a doc about solutions for this in the project
    3. Another good forum for asking questions: https://bit.ly/OLslack

May 11, 2023 (10am PT)

Attendees:

...