You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Next »

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific. Link to join the meeting

All are welcome.

Next meeting: Nov 10th 2021 (9am PT)

Oct 13th 2021

Attendees:

  • TSC:
    • Michael Collado: Datakin

    • Julien Le Dem: OpenLineage Project Lead, Datakin

    • Maciej Obuchowski: GetInData, OpenLineage

    • Willy Lulciuc: Marquez, OpenLineage

    • Mandy Chessel: Egeria Project Lead, working on OpenLineage

  • And:
    • Ross Turk: VP marketing at Datakin talk about the website

    • Minkyu Park: interested in contributing to Datakin

    • Peter Hicks: Marquez contributor, OpenLineage user

  • Meeting recording:
  • Notes:
    • OpenLineage website: https://openlineage.io/
      • Gatsby based (markdown) in OpenLineage/website repo 
      • generates a static site hosted in github pages. OpenLineage/OpenLineage.github.io
      • deployment is currently manual. Automation in progress
      • Please open PRs on /website to contribute a blog posts.
        • Getting started with Egeria?
      • Suggestions:
        • Add page on open governance and how to join the project.
        • Add LFAI & data banner to the website?
        • Egeria is using MKdocs: very nice to navigate documentation.
    • upcoming 0.3.0:
      • Facet versioning:
        • each facet schema is versioned individually.
        • client/server code generation to facilitate producing/consuming openlineage events
      • Spark 3.x support
      • new mechanism for airflow 2.x
        • working with airflow maintainer to improve that.
    • Proxy Backend update (planned for OL 0.4.0):
      • mapping to egeria backend
      • planning to release for the Egeria webinar on the 8th of November
      • Willy provided a base module for ProxyBackend
    • Monthly release is a good cadence
    • Open discussions:

      • Azure purview team hackathon ongoing to consumer OpenLineage events

      • Design docs discussion:

        • proposal to add design doc for proposal.

        • goal:

          • Similar to the process of projects like Kafka, Flink: for specs and bigger features

          • not for bug fixes.

        • options:

          • proposal directory for docs as markdown

          • Open PRs against wiki pages: proposals wiki.

        • Manage status:

          • list of designs that are implemented vs pending.

          • table of open proposals.

        • vote for prioritization:

          • Every proposal design doc has an issue opened and link back to it.

        • good start for the blog talking about that feature

      • New committee on data ops: Mandy will be speaking about Egeria and OpenLineage

Sept 8th 2021

  • Attendees: 
    • TSC:
      • Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria

      • Michael Collado: Datakin, OpenLineage

      • Maciej Obuchowski: GetInData. OpenLineage integrations
      • Willy Lulciuc: Marquez co-creator.
      • Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
    • And:
      • Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
      • Minkyu Park: Datakin. learning about OpenLineage
      • Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
  • Meeting recording:
  • Meeting notes:
    • agenda: 
      • Update on OpenLineage latest release (0.2.1)

        • dbt integration demo

      • OpenLineage 0.3 scope discussion

        • Facet versioning mechanism (Issue #153)

        • OpenLineage Proxy Backend (Issue #152)

        • OpenLineage implementer test data and validation

        • Kafka client

      • Roadmap

        • Iceberg integration
      • Open discussion

    • Slides 

    • Discussions:
      • added to the agenda a Discussion of Iceberg requirements for OpenLineage.

    • Demo of dbt:

      • really easy to try

      • when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'

    • Presentation of Proxy Backend design:

      • summary of discussions in Egeria
        • Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage

        • Two ways to use Egeria with OpenLineage

          • receives HTTP events and forwards to Kafka

          • A consumer receives the Kafka events in Egeria

      • Proxy Backend in OpenLineage:

        • direct HTTP endpoint implementation in Egeria

      • Depending on the user they might pick one or the other and we'll document

    • Use a direct OpenLineage endpoint (like Marquez)

      • Deploy the Proxy Backend to write to a queue (ex: Kafka)

      • Follow up items:

    • The transport abstraction (Backend interface) could be usable directly from the client or from the Proxy Backend. The user can decide if they want the intermediate proxy. See #269

      • We should add a distribution client symmetric to the Proxy Backend. It reads from Kafka and sends event to an OpenLineage HTTP endpoint. Marquez would use it, for example to consume OpenLineage events produced by Egeria. See #270

  • Iceberg integration:
    • presentation of Iceberg model

      • Manifest and manifest list: 2-level tree structure tracking data files.

      • root metadata version file. Points to manifest list (It knows all of the previous versions of the dataset that we want to keep)

    • Iceberg collect various metadata about the scans and data being produced and wants to expose it through OpenLineage. It can already expose metadata  but there is no listener yet.

      • Ryan: added the metadata list presented to the Iceberg ticket: See #167

Aug 11th 2021

  • Attendees: 
    • TSC:
      • Ryan Blue

      • Maciej Obuchowski

      • Michael Collado

      • Daniel Henneberger

      • Willy Lulciuc

      • Mandy Chessell

      • Julien Le Dem

    • And:
      • Peter Hicks

      • Minkyu Park

      • Daniel Avancini

  • Meeting recording:
  • Meeting notes:
    • Agenda:
      • Coming in OpenLineage 0.1
        • OpenLineage spec versioning
        • Clients
        • Marquez integrations imported in OpenLineage
          • Apache Airflow:
            • BigQuery 
            • Postgres
            • Snowflake
            • Redshift
            • Great Expectations
          • Apache Spark
          • dbt
      • OpenLineage 0.2 scope discussion
      • Roadmap
      • Open discussion
    • Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14
    • Notes: 
      • OpenLineage 0.1 is being published 
      • Coming in OpenLineage 0.1
        • OpenLineage spec versioning
        • Clients (Java, Python)
        • Marquez integrations imported in OpenLineage
          • Apache Airflow:
            • BigQuery 
            • Postgres
            • Snowflake
            • Redshift
            • Great Expectations
          • Apache Spark
          • dbt
          • Question: How is airflow capturing openlineage events?
            • openlineage-airflow installed on the airflow instance
            • adapters per operator
      • OpenLineage 0.2 scope discussion
        • Facet versioning mechanism (Issue #153)
        • OpenLineage Proxy Backend (Issue #152)
          • Questions:
            • What is the advantage of the proxy backend?
              • The consumer does not need to implement an endpoint and can consume from kafka
              • can configure what to do with events independently of various integrations
              • first step to having a routing mechanism:
                • to send events to multiple consumer
                • to have rule-based routing
                • to enable archiving the event in addition to sending them
            • Is it included in OpenLineage?
              • Yes (Otherwise it would have to be in Egeria)
            • Does it include error management or retry policy? What if the proxy dies? Do we care about durability?
              • Yes we care about durability 
              • first implementation to be synchronous. single transaction to Kafka per event.
              • future might be configurable to adjust depending on context (guaranteed delivery vs performance batching)
            • What technology should we use?
              • Proposed: Java + spring boot (like Egeria)
              • discussion to use Java + dropwizard like Marquez
              • general consensus on using java. (framework TBD)
              • In the future, might have a go implementation to enable lightweight sidecar pattern
        • Kafka client
      • Roadmap
      • Open discussion
        • How do we define extension points for integrations? For example hooks, spark and airflow for the user to add adapters/facets without having to modify OL.

          • TODO: create a ticket to track this
        • Apache Iceberg interest in OpenLineage:
          • Would want to add additional notifications
            • how many files read or written
            • How long a commit took.
            • How many attempts to commit were needed?
          • TODO: create ticket to enable Iceberg facets to be added to OpenLineage events
          • Iceberg needs to send events independently of where the library is used. (example: plain java process or other)
          • TODO: ticket for PrestoDB/Trino integrations
        • Egeria has a weekly community call
          • September 1st will be about OpenLineage
          • Also an incoming webinar

July 14th 2021

  • Attendees: 
    • TSC:
      • Julien Le Dem
      • Mandy Chessel
      • Michael Collado
      • Willy Lulciuc
  • Meeting recording:
  • Meeting notes
    • Agenda:
    • Notes: 

      Mission statement:

      Spec versioning mechanism:

      • The goal is to commit to compatible changes once 0.1 is published

      • We need a follow up to separate core facet versioning


      => TODO: create a separate github ticket.
      • The lineage event should have a field that identifies what version of the spec it was produced with

        • => TODO: create a github issue for this

      • TODO: Add issue to document version number semantics (SCHEMAVER)

      Extend Event State notion:

      OpenLineage 0.1:

      • finalize a few spec details for 0.1 : a few items left to discuss.

        • In particular job naming

        • parent job model

      • Importing Marquez integrations in OpenLineage

      Open Discussion:

      • connecting the consumer and producer

        • TODO: ticket to track distribution mechanism

        • options:

          • Would we need a consumption client to make it easy for consumers to get events from Kafka for example?

          • OpenLineage provides client libraries to serialize/deserialize events as well as sending them.

        • We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.

        • Do we have a mutual third party or the client know where to send?

      • Source code location finalization

      • job naming convention

        • you don't always have a nested execution

          • can call a parent

        • parent job

        • You can have a job calling another one.

        • always distinguish a job and its run

      • need a separate notion for job dependencies

      • need to capture event driven: TODO: create ticket.


      TODO(Julien): update job naming ticket to have the discussion.

June 9th 2021

  • Attendees: 
    • TSC:
      Julien Le Dem: Marquez, Datakin
      Drew Banin: dbt, CPO at fishtown analytics
      Maciej Obuchowski: Marquez, GetIndata consulting company
      Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
      Daniel Henneberger: building a database, interested in lineage
      Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
      Willy Lulciuc: co-creator of Marquez
      Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
    • And:
      Kedar Rajwade: consulting on distributed systems.
      Barr Yaron: dbt, PM at Fishtown analytics on metadata.
      Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
    • Excused: Ryan Blue, James Campbell
  • Meeting recording:
  • Meeting notes:

    Agenda:

    • project communication

    • Technical charter review

    • medium term roadmap discussion

    Notes:

    • project communication

      • github: for specs, designs, reviews and building consensus (issues and PRs)

      • email: for announcements, notes, etc

      • Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)

      • monthly meeting: recorded, notes and recording published on the wiki

    • Technical Charter review:

      • TODO: Finalize the mission statement. TSC members to comment in the doc.


  • No labels