You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 55 Next »

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific. Link to join the meeting

All are welcome.

Next meeting: May 11th, 2022 (9am PT)

Apr 13th, 2022 (9am PT)

Attendees:

  • TSC:
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Mandy Chessel: Egeria Project Lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Sheeri Cabral: Technical Product Manager, Lineage, Collibra
    • Michael Robinson: Software Engineer, Developer Relations, Astronomer
    • John Thomas: Support Engineer, Astronomer
    • Ross Turk: Senior Director of Community, Astronomer
    • Minkyu Park: Senior Software Engineer, Astronomer
    • Ernie Ostic: SVP of Product, Manta
    • Kelsy Brennan: Lead Developer, Environmental Intelligence Group
    • Dalin Kim: Data Engineer, Northwestern Mutual
    • Will Johnson: Microsoft, OL contributor
    • Jorge
    • Jakub Moravec: Software Architect, Manta
    • Chandru Sugunan: Product Manager, Azure Cloud, Microsoft

Agenda:

  • 0.6.2 release overview [Michael R.]
  • Transports in OpenLineage clients [Maciej]
  • Airflow integration update [Maciej]
  • Dagster integration retrospective [Dalin]
  • Open discussion

Notes:

  • Introductions
  • Communication channels overview [Julien]
  • Agenda overview [Julien]
  • 0.6.2 release overview [Michael R.]

Added

    • CI: add integration tests for Airflow's SnowflakeOperator and dbt-Snowflake @mobuchowski
      • #611
      • Workaround necessitated by the fact we have only 1 schema in the Snowflake db
      • This creates conflicts between different Airflow versions
      • By contrast: in BigQuery, different schemas are prefixed with Airflow versions
    • Introduce DatasetVersion facet in spec @pawel-big-lebowski
      • #580
      • Problem: the spec did not support dataset versioning (which is needed for providers like Iceberg, Delta)
      • Solution: this change introduced a DatasetVersionFacet in spec
    • Airflow: add external query ID facet @mobuchowski
      • #546
      • Issue: jobs that ran on external systems like BigQuery or Snowflake were identified by their query IDs.
      • This change added a facet that exposes this collected query ID, so that an OpenLineage job run can be associated with that external job.

Fixed

    • Complete Fix of Snowflake Extractor get_hook() Bug @denimalpaca
      • #589
      • In #507, an incorrect fix was made to the Snowflake Extractor to allow for the operator's new get_db_hook() method.
      • Solution: this change checks for the existence of the get_db_hook()  method in the underlying Operator, then get_hook() calls the correct version of the underlying method, enabling it
    • Update artwork @rossturk
      • #605
      • This change updated artwork in the README.md with the latest versions from recent presentations and other sources.


  • Transports in OpenLineage clients [Maciej]
    • Currently, OL clients can only read HTTP data
    • Common request: ability to read Kafka
    • This feature will offer a language-independent solution
    • Status: Python client implementation merged, Java implementation close to being merged
    • Timeline: next release (0.7.0)
  • Airflow integration [Maciej]
    • TaskInstance listener-based plugin not ready yet
    • Status: waiting for Airflow 2.3 to be merged (due by April 18, 2022)
    • Ready upon Airflow 2.3 release
    • New SQL parser
      • Used in Snowflake, Postgres, GE integrations
      • Missing: API for SQL queries
      • Formerly had a SQL parser but based on guesswork and fragile reliance on language patterns
      • Solution: AST (abstract syntax trees), not guesswork
      • Features strong typing, Enums, encapsulation
      • Language: Rust
        • Disadvantages: additional language, distribution
        • Advantages: high-quality libraries, possible new applications, e.g. Spark
      • Unified API: previous implementation still exists for users of older architectures
      • Utilizable in Java
      • Makes all tasks using SQL easier
      • Will J.: can I inject a different SQL parser that I want to use?
        • Unified API would make this possible
        • Goal is to work with different dialects, implementations
  • Dagster integration [Dalin]
    • Initial proposal: use custom OL executor as thin wrapper over existing executors
    • Challenges:
      • OL handling tightly coupled with actual job runs
      • Requires multiple custom executors to main flexibility
      • Incomplete events (only op-level)
    • Solution: use Dagster's OL sensor that tails Dagster event logs for tracking metadata
    • Lessons learned:
      • Non-sharded event log storage must be used for sensor to access all event logs across runs
      • Sensor's cursor does not get updated on an exception. Typical use of cursors is to submit a run request while tracking some state. To guarantee atomic operation with the cursor, the cursor update gets processed only after the sensor function exits.
    • Event type conversion
      • Dagster event types converted to OpenLineage events 
    • Architecture
      • Sensor defined under a repository then converted and sent to the OL backend
    • Lineage collected at job level only; dataset tracking being explored
      • Currently datasets being stored as Dagster assets
      • This a manual/custom solution
    • 3M event logs processed, used as part of published telemetry report
    • Will J.: what's been the timeline since inception of the idea to now?
      • December 2021; integrated within ~1 month's time
      • Bulk of time was spent on understanding Dagster
      • OL sensor is configurable and can be started late while still catching the first events
    • Willy: do you remember the issue # or title you were waiting for?
    • Julien: Dalin reached out on Slack initially. We started a new channel, my small contribution was to reach out to the Dagster community to facilitate collaboration; we can support new integrations in this way. Thanks to Sandy from the Dagster community for help with this.
      • Don't hesitate to reach out for help!
  • Open discussion
    • Mandy: where do I submit my blog? Two website repos are a source of confusion.
    • Julien: Ross and Michael R. can help. 
    • Ross: branching could solve this problem. We welcome blog posts from anyone in the community.
    • Will J.: parent/child relationships in OL. Problem in Azure: Databricks connector has a parent execution inside Spark and a child execution that is not connected. Spark issues a parent ID that's not being caught. Currently using a workaround. What's the right way to emit a parent/child relationship?
      • Julien: this is relevant to the ParentRunFacet in OL. Michael C. is working on this in Marquez. Recommended: create an issue about this and ping Michael C.
      • Maciej: this functional in the Airflow integration for Spark jobs.
      • Julien: this issue could be documented better.

Mar 9th, 2022 (9am PT)

Attendees:

  • TSC:
    • Mike Collado: Staff Software Engineer, Datakin
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Mandy Chessel: Egeria Project Lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Michael Robinson: Dev Rel Engineer
    • Ross Turk: VP of Marketing, Datakin
    • Minkyu Park: Senior Software Engineer, Datakin
    • Srikanth Venkat: Product Manager, Privacera
    • John Thomas: Support Engineer, Datakin
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Paweł Leszczyński, Software Engineer, GetinData
    • Sheeri Cabral, Technical Product Manager, Lineage, Collibra
    • Michal Bartos, Software Engineer, MANTA
    • Chandru Sugunan, Product Manager, Azure Cloud, Microsoft
    • Caroline Fahrenkrog, Product Manager, MANTA Scanners
    • John Montroy, Backend Engineer

Agenda:

  • New committers [Julien]
  • Release overview (0.6.0-0.6.1) [Michael R.] 
  • Process for blog posts [Ross]
  • Retrospective: Spark integration [Willy et al.]
  • Open discussion  

Notes:

  • New committers [Julien]
    • 4 new committers were voted in last week
    • We had fallen behind
    • Congratulations to all
  • Release overview (0.6.0-0.6.1) [Michael R.]
    • Added
      • Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
      • Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
        • These first two additions are similar to SQL facet
        • Offer the ability to see top-level code
      • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
        • Captures when someone is conducting dataset operations (overwrite, create, etc.)
      • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
        • Collects environment variables
        • Depends on Databricks runtime but can be reused in other environments
      • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
        • The first iteration of the Dagster integration to get lineage from Dagster
      • Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
        • Small addition to Java client feat. better types; was string
    • Fixed
      • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
        • The former was a particular issue with the Great Expectations integration
      • Reduce logging level for import errors to info @rossturk (0.6.0)
        • Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
        • This fix reduced the level to Info
      • Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
        • Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
        • Some keys are Snowflake-specific, but more can be added from other data sources
      • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
        • Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
      • Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
        • Previously when an OL event failed to emit, this could break an integration
        • This fix catches possible failures and logs them
  • Process for blog posts [Ross]
    • Moving the process to Github Issues
    • Follow release tracker there

    • Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts

    • No one will have a monopoly

    • Proposals for blog posts also welcome and we can support your efforts with outlines, feedback

    • Throw your ideas on the issue tracker on Github

  • Retrospective: Spark integration [Willy et al.]
    • Willy: originally this part of Marquez – the inspiration behind OL

      • OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)

      • Donated the integration to OL

    • Srikanth: #559 very helpful to Azure

    • Pawel: is anything missing from the Spark integration? E.g., column-level lineage?

    • Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome

    • Maciej: should be more active about tracking projects we have integrations with; add to test matrix 

    • Julien: let’s open some issues to address these

  • Open Discussion
    • Flink updates? [Julien]
      • Maciej: initial exploration is done

        • challenge: Flink has 4 APIs

        • prioritizing Kafka lineage currently because most jobs are writing to/from Kafka

        • track this on Github milestones, contribute, ask questions there

      • Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage? 

      • Maciej: trying to model entire Flink run as one event

      • Srikanth: proposed two separate streams, one for data updates and one for metadata

      • Julien: do we have an issue on this topic in the repo?

      • Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc

      • Julien: see notes for ticket number; MC will create the ticket

      • Srikanth: we can collaborate offline

Feb 9th 2022 (9am PT)

Attendees:

  • TSC:
    • Mike Collado: Staff Software Engineer, Datakin
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
  • And:
    • Michael Robinson: Dev Rel Engineer
    • Ross Turk: VP of Marketing, Datakin
    • Minkyu Park: Senior Software Engineer, Datakin
    • Srikanth Venkat: Product Manager, Privacera
    • John Thomas: Support Engineer, Datakin
    • Peter Scharling: EI Group
    • Peter Hicks: Senior Software Engineer, Datakin
    • Dalin Kim: Data Engineer, Northwestern Mutual
    • Kevin Mellott: Data Engineer, Northwestern Mutual
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Kelsy Brennan: EI Group
    • Aaron Colcord: Data Engineer, Northwestern Mutual

Agenda:

  • OpenLineage recent release overview (0.5.1) [Julien]
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
  • Apache Flink integration [Julien]
  • Dagster integration demo [Dalin]
  • Open Discussion

Notes:

  • OpenLineage recent release overview (0.5.1) [Julien]
    • No 0.5.0 due to bug
    • Support for dbt-spark adapter
    • New backend to proxy OL events
    • Support for custom facets
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
    • Integration runs on worker side
    • Will be in next OL release of airflow (2.3)
    • Thanks to Maciej for his work on this
  • Apache Flink integration [Julien]
    • Ticket for discussion available
    • Integration test setup
    • Early stages
  • Dagster integration demo [Dalin]
    • Initiated by Dalin Kim
    • OL used with Dagster on orchestration layer
    • Utilizes Dagster sensor
    • Introduces OL sensor that can be added to Dagster repo definition
    • Uses cursor to keep track of ID
    • Looking for feedback after review complete
    • Discussion:
      • Dalin: needed: way to interpret Dagster asset for OL
      • Julien: common code from Great Expectations/Dagster integrations
      • Michael C: do you pass parent run ID in child job when sending the job to MZ?
      • Hierarchy can be extended indefinitely – parent/child relationship can be modeled
      • Maciej: the sensor kept failing – does this mean the events persisted despite being down?
      • Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
      • Dalin: hoping for more feedback
      • Julien: slides will be posted on slack channel, also tickets
  • Open discussion
    • Will: how is OL ensuring consistency of datasets across integrations? 
    • Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
    • Julien: need for tutorial on creating integrations
    • Srikanth: have done some of this work in Atlas
    • Kevin: are there libraries on the horizon to play this role? (Julien: yes)
    • Srikanth: it would be good to have model spec to provide enforceable standard
    • Julien: agreed; currently models are based on the JSON schema spec
    • Julien: contributions welcome; opening a ticket about this makes sense
    • Will: Flink integration: MZ focused on batch jobs
    • Julien: we want to make sure we need to add checkpointing
    • Julien: there will be discussion in OLMZ communities about this
      • In MZ, there are questions about what counts as a version or not
    • Julien: a consistent model is needed
    • Julien: one solution being looked into is Arrow
    • Julien: everyone should feel welcome to propose agenda items (even old projects)
    • Srikanth: who are you working with on the Flink comms side? Will get back to you.

Meeting:

Jan 12th 2022 (9am PT)

Attendees:

  • TSC:
    • Mike Collado: Eng, Datakin
    • Mandy Chessel: Lead Egeria project
    • Maciej Obuchowski: Eng GetInData, OpenLineage contributor
    • Willy Lulciuc: Co-creator of Marquez
    • Julien: OpenLineage Project lead
  • And:
    • Michael Robinson: Dev Rel
    • Ross Turk: VP Marketing Datakin
    • Minkyu Park: Dev at Datakin
    • Conor Beverland: Senior Dir of Product, Astronomer
    • Srikanth Venkat, Product Management, Privacera
    • Mark Taylor, Technical P.M., Microsoft
    • Harish Sune, Technical Architect, NE Analytics
    • Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
    • Arpita Grange, Senior Technical Lead for Business Intelligence Solutions, Asurion

Agenda:

Meeting: 

Notes:

0.4 release [Willy]:

  • Databricks install README and init scripts (by Will)
  • Iceberg integration (Pawel)
    • Iceberg adoption already strong
  • Kafka read and write support (Olek and Mike)
  • Arbitrary parameters supported in HTTP URL construction (Will)
  • Increased coverage (Pawel and Maciej)

0.5 preview [Willy]:

  • Add Spark support to openlineage-dbt lib. (by Maciej)
  • New extensible API to handle Spark events for openlineage-spark lib (Mike)
  • New proxy HTTP backend to route events to event streams (Mandy and Willy)
  • Increase coverage of sparkV2 cmds for openlineage-spark lib. (Pawel)
  • Added HTTP client to openlineage-java lib. (Willy)
  • Thanks go to Mike Collado for work on PRs, proposal; also to Mandy for work on HTTP backend over last two months
  • HTTP client will decrease confusion about how to capture metadata

Tasklistener for OL Integration [Maciej]:

1.10 required modifying each DAG, which was cumbersome and not compatible with 2.1

2.1: lineage backend comparable to Apache Atlas’ old backend

  • benefit: provides all info about events
  • downside: cannot notify about task starts/failures

2.3: Airflow Event Listener

  • Status: not merged yet, in final reviews for deployment with 0.6
  • Improvements: transparent, less exposure, enables pull model using queue, enables Egeria and other projects in the future (e.g., DataHub)
  • Discussion [Julien, Maciej, Willy, Mike]:
    • generic: supports additional functionality 
    • extendable to different kinds of events, e.g., scheduling
    • makes more data available 
    • much less brittle because depends on public API
    • requires little configuration
    • will not do away with registration of listeners/extractors
    • entry point mechanism comparable to service loaded in Java, requires env variables
    • theoretically possible to back port it to earlier versions of Airflow (as far as 1.10)
    • possibly helpful to document that we have 3 approaches but are not recommending older ones, mention that this changes only how we collate
    • older approaches can be deprecated; it will be important to monitor the community to determine timing of this

Egeria Support for OpenLineage [Mandy]:

  • Monthly releases
  • OpenLineage support ready in recent release
  • Metaphor: Lego blocks
    • OL events can be brought in through API or proxy backend with Kafka
    • events augmentable in Egeria, storable or publishable in Marquez or Kafka for distribution or to log store (e.g., file system)
  • Can validate that a process is running correctly
  • See documentation in Egeria about proxy backend and extensions, API mechanism
  • Diagram in documentation illustrates capabilities
  • Discussion [Julien, Mandy, Srikanth, Mike]:
    • Egeria sees value of OpenLineage
    • Engine is uncoupled from receivers
    • Endpoint is simple, allowing independent management of processes
    • Some transformation of payload during storage
    • Kafka integration coming in 0.5
    • Customers expect ability to filter data
    • Varying granularity of metadata already possible through versioning with Marquez 

Open Discussion:

Proposal to convert licenses to SPDX [Michael]: no objections

Dec 8th 2021 (9am PT)

Attendees:

TSC:

  • Mike Collado, Staff Engineer, Datakin
  • Willy Lulciuc, Co-creator of Marquez, Datakin
  • Mandy Chessel, Egeria Project Lead
  • Julian Le Dem, OpenLineage Project Lead, CTO Datakin

And:

  • Peter Hicks, Software Engineer, Datakin
  • Srikanth Venkat, Product Management, Microsoft
  • Ross Turk, VP Marketing, Datakin
  • Maciej Obuchowski: Engineer GetInData, OpenLineage contributor
  • John Thomas, Support Engineer, Datakin
  • Minkyu Park, Engineer, Datakin
  • Michael Robinson, Dev Rel Engineer
  • Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
  • Mark Taylor, Principal Technical PM, Microsoft
  • Travis Hilbert, Associate Consultant, Microsoft

Agenda:

  • SPDX headers [Mandy]
  • Azure Purview + OpenLineage [Will and Mark]
  • Logging backend (OpenTelemetry) [Julien]
  • Open discussion

Meeting recording:

Notes:

Software Package Data Exchange (SPDX) Tags [Mandy]

  • Open standard for creating software bill of materials
  • Includes set of short identifiers for open source licenses
    • both human readable and machine processable
    • easy to maintain and validate
  • Full license added in License file at top of git repository
  • Each file includes the SPDX-License-Identifier tag
  • Proposed: we use this approach in OpenLineage
  • Becoming a best practice in open source development
  • Julien: "a no brainer" 
  • Next question: how to integrate (implement going forward or add tags throughout project?)
  • Willy: throughout existing; should also do with Marquez
  • Mike: update build check to check for tags in new source files? 
  • Julien: must find right build plugins, two passes might be necessary
  • Julien: all agreed?; adopted; someone should create issue
  • Julien: Maven plugins exist to check and add tag if missing

Azure Purview Integration [Srikanth, Will]

  • Overview of Azure Purview
    • Metadata and governance platform across MS, new 
    • End-to-end governance practices
    • Goal is to fill gaps in lineage
  • Database Lineage in Azure Purview
    • Began as hackathon project at Microsoft
    • Sought way to send lineage data directly to Purview (rather than use architecture of Marquez)
    • Azure Functions used to send data from Databricks through serverless compute and event hub to Purview
    • Required adapter pattern to make emissions conform to Atlas
    • Challenges:
      • automating getting most recent OL jar into Databricks; created PR for this with emit script
      • needed to use API key passed in URL parameter; support for this integrated with PR
    • Have goal of extending use of OpenLineage inside of Spark further 
    • Motivation: didn't want to be dependent on catalog API, particular flavor of Spark
    • Plans include other integrations, including dbt
    • Want to be respectful of OpenLineage's global scope, even if it means metadata on Purview side not real-time 
    • Want to incorporate filtering capability, make it customizable based on particular connector
    • Interest extends beyond Databricks (e.g., Snowflake)
    • Eager to see issue #181 addressed: ability to tack on a MS jar to installation where OpenLineage is
    • Possible PR in future: emit metadata outside a run (e.g., as dataset facets); would meet need at MS

Logging backends [Julien]

  • Open suggestion: add ability to send events to a logging aggregator (e.g., Datadog)
  • Mandy: needed in addition to proxy backend?
  • Proxy backend could be distribution endpoint, first location for this
  • Use case: experimentation
  • Proposed: open a ticket

Discussion

  • Azure PRs, other merged PRs will be in 0.4

Nov 10th 2021 (9am PT)

Attendees:

  • TSC
    • Mike Collado: Eng
    • Ryan Blue: Tabular, Apache Iceberg
    • Mandy Chessel: Lead Egeria project
    • Maciej Obuchowski: Eng GetInData, OpenLineage contributor
    • Willy Lulciuc: Co-creator of Marquez
    • Julien: OpenLineage Project lead
  • And:
    • Michael Robinson: dev rel
    • Peter Hicks: Marquez contributor
    • Ross Turk: VP marketting Datakin
    • John Thomas: Support eng at Datakin
    • Minkyu Park: Dev at Datakin, learning about MQZ and OL.

Agenda:

  • OL Client use cases for Apache Iceberg [Ryan]
  • Proxy Backend and Egeria integration progress update (Issue #152) [Mandy]
  • OpenLineage last release overview (0.3.1)
    • Facet versioning
    • Airflow 2 / Spark 3 support, dbt improvements
  • OpenLineage 0.4 scope review
    • Proxy Backend (Issue #152)
    • Spark, Airflow, dbt improvements (documentation, coverage, ...)
    • improvements to the OpenLineage model
  • Open discussion

Meeting recording:

Notes:

SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time

Iceberg requirements:

  • ability for Iceberg to add facets without having to depend on the context it's running in.

  • Avoid depending on allowing the Sources to expose facets in the Spark API as it would be a hard change to get into Spark.

Ryan:
  Proposal to have a logger style API.

  • similar to SLF4J or dropwizard metrics => Create a logging/metrics object. Independent of logging backend.

  • Facets can be emitted and the backend can be configured independently whether those facets are picked up or not.

Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y

=> broad agreement on principle

Open Questions:

  • when facets are sent?

    • preference to sending events as they go.

    • does that it fit with the OpenLineage view of the world? => yes

    • do we send them immediately? Do we wait?

    • iceberg not creating a facet until Spark asks for the splits

  • Spark, bound to a context thread:

    • the "logger backend can grab the sql execution id"

    • loggers depend on thread

    • listener is on different thread

    • Report for a given job run

    • Ryan: runcontext is threadlocal: sets the executionid.

  • The client side should be able to send an event immediately vs sent when you get a chance.

    • Who needs to do this?

    • Need to have a guide to defining a facet.

  • Michael C.: TODO: Design Doc on logging

  • Willy: Do we need a "RUNNING" event?

Flink:

  • how to handle long running job

  • [Ryan] [Mandy] long running jobs need to be defined

  • TODO: Julien, post a ticket for long running jobs

Also need for OSS trino integration, tabular might contribute


Proxy Backend update [Mandy]

  • draft PR #500: Thanks Willy for the initial setup.
    Looking for feedback
    Issues:
    Initial implementation was using the provided beans to deserialize but it didn't quite work (TODO: ticket)
    Instead just pass through. faster, but no validation
  • proposal for new facets.
    RequestFacet => should be a runfacet, maps to the run args in Marquez

https://github.com/OpenLineage/OpenLineage/issues/256

Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket

Oct 13th 2021

Attendees:

  • TSC:
    • Michael Collado: Datakin

    • Julien Le Dem: OpenLineage Project Lead, Datakin

    • Maciej Obuchowski: GetInData, OpenLineage

    • Willy Lulciuc: Marquez, OpenLineage

    • Mandy Chessel: Egeria Project Lead, working on OpenLineage

  • And:
    • Ross Turk: VP marketing at Datakin talk about the website

    • Minkyu Park: interested in contributing to Datakin

    • Peter Hicks: Marquez contributor, OpenLineage user

  • Meeting recording:
  • Notes:
    • OpenLineage website: https://openlineage.io/
      • Gatsby based (markdown) in OpenLineage/website repo 
      • generates a static site hosted in github pages. OpenLineage/OpenLineage.github.io
      • deployment is currently manual. Automation in progress
      • Please open PRs on /website to contribute a blog posts.
        • Getting started with Egeria?
      • Suggestions:
        • Add page on open governance and how to join the project.
        • Add LFAI & data banner to the website?
        • Egeria is using MKdocs: very nice to navigate documentation.
    • upcoming 0.3.0:
      • Facet versioning:
        • each facet schema is versioned individually.
        • client/server code generation to facilitate producing/consuming openlineage events
      • Spark 3.x support
      • new mechanism for airflow 2.x
        • working with airflow maintainer to improve that.
    • Proxy Backend update (planned for OL 0.4.0):
      • mapping to egeria backend
      • planning to release for the Egeria webinar on the 8th of November
      • Willy provided a base module for ProxyBackend
    • Monthly release is a good cadence
    • Open discussions:

      • Azure purview team hackathon ongoing to consumer OpenLineage events

      • Design docs discussion:

        • proposal to add design doc for proposal.

        • goal:

          • Similar to the process of projects like Kafka, Flink: for specs and bigger features

          • not for bug fixes.

        • options:

          • proposal directory for docs as markdown

          • Open PRs against wiki pages: proposals wiki.

        • Manage status:

          • list of designs that are implemented vs pending.

          • table of open proposals.

        • vote for prioritization:

          • Every proposal design doc has an issue opened and link back to it.

        • good start for the blog talking about that feature

      • New committee on data ops: Mandy will be speaking about Egeria and OpenLineage

Sept 8th 2021

  • Attendees: 
    • TSC:
      • Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria

      • Michael Collado: Datakin, OpenLineage

      • Maciej Obuchowski: GetInData. OpenLineage integrations
      • Willy Lulciuc: Marquez co-creator.
      • Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
    • And:
      • Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
      • Minkyu Park: Datakin. learning about OpenLineage
      • Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
  • Meeting recording:
  • Meeting notes:
    • agenda: 
      • Update on OpenLineage latest release (0.2.1)

        • dbt integration demo

      • OpenLineage 0.3 scope discussion

        • Facet versioning mechanism (Issue #153)

        • OpenLineage Proxy Backend (Issue #152)

        • OpenLineage implementer test data and validation

        • Kafka client

      • Roadmap

        • Iceberg integration
      • Open discussion

    • Slides 

    • Discussions:
      • added to the agenda a Discussion of Iceberg requirements for OpenLineage.

    • Demo of dbt:

      • really easy to try

      • when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'

    • Presentation of Proxy Backend design:

      • summary of discussions in Egeria
        • Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage

        • Two ways to use Egeria with OpenLineage

          • receives HTTP events and forwards to Kafka

          • A consumer receives the Kafka events in Egeria

      • Proxy Backend in OpenLineage:

        • direct HTTP endpoint implementation in Egeria

      • Depending on the user they might pick one or the other and we'll document

    • Use a direct OpenLineage endpoint (like Marquez)

      • Deploy the Proxy Backend to write to a queue (ex: Kafka)

      • Follow up items:

    • The transport abstraction (Backend interface) could be usable directly from the client or from the Proxy Backend. The user can decide if they want the intermediate proxy. See #269

      • We should add a distribution client symmetric to the Proxy Backend. It reads from Kafka and sends event to an OpenLineage HTTP endpoint. Marquez would use it, for example to consume OpenLineage events produced by Egeria. See #270

  • Iceberg integration:
    • presentation of Iceberg model

      • Manifest and manifest list: 2-level tree structure tracking data files.

      • root metadata version file. Points to manifest list (It knows all of the previous versions of the dataset that we want to keep)

    • Iceberg collect various metadata about the scans and data being produced and wants to expose it through OpenLineage. It can already expose metadata  but there is no listener yet.

      • Ryan: added the metadata list presented to the Iceberg ticket: See #167

Aug 11th 2021

  • Attendees: 
    • TSC:
      • Ryan Blue

      • Maciej Obuchowski

      • Michael Collado

      • Daniel Henneberger

      • Willy Lulciuc

      • Mandy Chessell

      • Julien Le Dem

    • And:
      • Peter Hicks

      • Minkyu Park

      • Daniel Avancini

  • Meeting recording:
  • Meeting notes:
    • Agenda:
      • Coming in OpenLineage 0.1
        • OpenLineage spec versioning
        • Clients
        • Marquez integrations imported in OpenLineage
          • Apache Airflow:
            • BigQuery 
            • Postgres
            • Snowflake
            • Redshift
            • Great Expectations
          • Apache Spark
          • dbt
      • OpenLineage 0.2 scope discussion
      • Roadmap
      • Open discussion
    • Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14
    • Notes: 
      • OpenLineage 0.1 is being published 
      • Coming in OpenLineage 0.1
        • OpenLineage spec versioning
        • Clients (Java, Python)
        • Marquez integrations imported in OpenLineage
          • Apache Airflow:
            • BigQuery 
            • Postgres
            • Snowflake
            • Redshift
            • Great Expectations
          • Apache Spark
          • dbt
          • Question: How is airflow capturing openlineage events?
            • openlineage-airflow installed on the airflow instance
            • adapters per operator
      • OpenLineage 0.2 scope discussion
        • Facet versioning mechanism (Issue #153)
        • OpenLineage Proxy Backend (Issue #152)
          • Questions:
            • What is the advantage of the proxy backend?
              • The consumer does not need to implement an endpoint and can consume from kafka
              • can configure what to do with events independently of various integrations
              • first step to having a routing mechanism:
                • to send events to multiple consumer
                • to have rule-based routing
                • to enable archiving the event in addition to sending them
            • Is it included in OpenLineage?
              • Yes (Otherwise it would have to be in Egeria)
            • Does it include error management or retry policy? What if the proxy dies? Do we care about durability?
              • Yes we care about durability 
              • first implementation to be synchronous. single transaction to Kafka per event.
              • future might be configurable to adjust depending on context (guaranteed delivery vs performance batching)
            • What technology should we use?
              • Proposed: Java + spring boot (like Egeria)
              • discussion to use Java + dropwizard like Marquez
              • general consensus on using java. (framework TBD)
              • In the future, might have a go implementation to enable lightweight sidecar pattern
        • Kafka client
      • Roadmap
      • Open discussion
        • How do we define extension points for integrations? For example hooks, spark and airflow for the user to add adapters/facets without having to modify OL.

          • TODO: create a ticket to track this
        • Apache Iceberg interest in OpenLineage:
          • Would want to add additional notifications
            • how many files read or written
            • How long a commit took.
            • How many attempts to commit were needed?
          • TODO: create ticket to enable Iceberg facets to be added to OpenLineage events
          • Iceberg needs to send events independently of where the library is used. (example: plain java process or other)
          • TODO: ticket for PrestoDB/Trino integrations
        • Egeria has a weekly community call
          • September 1st will be about OpenLineage
          • Also an incoming webinar

July 14th 2021

  • Attendees: 
    • TSC:
      • Julien Le Dem
      • Mandy Chessel
      • Michael Collado
      • Willy Lulciuc
  • Meeting recording:
  • Meeting notes
    • Agenda:
    • Notes: 

      Mission statement:

      Spec versioning mechanism:

      • The goal is to commit to compatible changes once 0.1 is published

      • We need a follow up to separate core facet versioning


      => TODO: create a separate github ticket.
      • The lineage event should have a field that identifies what version of the spec it was produced with

        • => TODO: create a github issue for this

      • TODO: Add issue to document version number semantics (SCHEMAVER)

      Extend Event State notion:

      OpenLineage 0.1:

      • finalize a few spec details for 0.1 : a few items left to discuss.

        • In particular job naming

        • parent job model

      • Importing Marquez integrations in OpenLineage

      Open Discussion:

      • connecting the consumer and producer

        • TODO: ticket to track distribution mechanism

        • options:

          • Would we need a consumption client to make it easy for consumers to get events from Kafka for example?

          • OpenLineage provides client libraries to serialize/deserialize events as well as sending them.

        • We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.

        • Do we have a mutual third party or the client know where to send?

      • Source code location finalization

      • job naming convention

        • you don't always have a nested execution

          • can call a parent

        • parent job

        • You can have a job calling another one.

        • always distinguish a job and its run

      • need a separate notion for job dependencies

      • need to capture event driven: TODO: create ticket.


      TODO(Julien): update job naming ticket to have the discussion.

June 9th 2021

  • Attendees: 
    • TSC:
      Julien Le Dem: Marquez, Datakin
      Drew Banin: dbt, CPO at fishtown analytics
      Maciej Obuchowski: Marquez, GetIndata consulting company
      Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
      Daniel Henneberger: building a database, interested in lineage
      Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
      Willy Lulciuc: co-creator of Marquez
      Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
    • And:
      Kedar Rajwade: consulting on distributed systems.
      Barr Yaron: dbt, PM at Fishtown analytics on metadata.
      Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
    • Excused: Ryan Blue, James Campbell
  • Meeting recording:
  • Meeting notes:

    Agenda:

    • project communication

    • Technical charter review

    • medium term roadmap discussion

    Notes:

    • project communication

      • github: for specs, designs, reviews and building consensus (issues and PRs)

      • email: for announcements, notes, etc

      • Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)

      • monthly meeting: recorded, notes and recording published on the wiki

    • Technical Charter review:

      • TODO: Finalize the mission statement. TSC members to comment in the doc.


  • No labels