The OpenLineage Technical Steering Committee meetings are Monthly on the Second Thursday from 10:00am to 11:00am US Pacific. Here's the meeting info.

All are welcome.

Next meeting: June 8, 2023 (10am PT)

Tentative agenda:

  1. Announcements
  2. Recent releases
  3. Static lineage progress update
  4. Open discussion

May 11, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

  1. Announcements [Julien]:
    1. Upcoming meetups
      1. Boston Data Lineage Meetup (tentatively scheduled for June)
      2. San Francisco OpenLineage Meetup at Astronomer (tentatively scheduled for June 27)
    2. Upcoming talks
      1. Paweł Leszczyński and Maciej Obuchowski, “Column Lineage is Coming to the Rescue,” Berlin Buzzwords, June 18-20, 2023
      2. Julien Le Dem and Willy Lulciuc, “Cross-platform Data Lineage with OpenLineage,” Data+AI Summit, June 28-29, 2023
      3. Maciej Obuchowski, “OpenLineage in Airflow: A Comprehensive Guide,” Airflow Summit, September 19-21, 2023
  2. Recent releases [Michael R.]
    1. OpenLineage 0.24.0
      1. Additions
        1. Support custom transport types #1795 @nataliezeller1
        2. Airflow: dbt Cloud integration #1418 @howardyoo @JDarDagran
        3. Spark: support dataset name modification using regex #1796 @pawel-big-lebowski
      2. https://github.com/OpenLineage/OpenLineage/releases/tag/0.24.0
      3. https://github.com/OpenLineage/OpenLineage/compare/0.23.0...0.24.0
  3. Custom transport types support [Natalie]
    1. OpenLineage supports a set of predefined transport types (HTTP, Kafka, others)
    2. Previously, adding a new or custom type required changing the transport config and transport factory to recognize the new type
    3. This change allows for extending functionality without having to change anything in the OpenLineage codebase
    4. Example: my company, where we work with an OpenMetadata backend
      1. This required a custom transport type
      2. With this change I can do this without changing anything
    5. Implementation
      1. New interface: TransportBuilder
      2. Implementable via methods:
        1. getType() // set in transport.type config param
        2. getConfig() // extension of TransportConfig, containing the required configuration
        3. Transport build(TransportConfig config) // builds a custom Transport instance based on the custom configuration
      3. Additionally you need to have a file (META-INF/services/io.openlineage.client.transports.TransportBuilder) that must be included in a jar in the class path, containing the fully qualified name of the implementing class
      4. Using the service loader pattern, implementations of TransportBuilder will be discovered and loaded at runtime.
    6. Q&A
      1. What are some use cases for other cool transport mechanisms?
        1. Native cloud, your queue system to send events
        2. Preferred way: the provider, data catalog, or something to implement over the lineage
        3. Maybe someone wants to do MSMQ or MQSeries
        4. You can also apply some transformation logic as part of your transport provider, so you can have your own ways of transporting the data
      2. Should we have some sort of repository where people can put their custom transport types that their building in a single place?
        1. They can put them in the repo; I don't think we need a separate place, at least right now
  4.  dbt Cloud integration [Jakub]
    1. Previously:
      1. The dbt-ol script invoked dbt metadata processing and sent OpenLineage events
      2. Worked only with a local dbt project
      3. How events were created:
        1. each run was a separate supported dbt node
        2. parent run reflected dbt-ol command call
    2. New dbt Cloud integration:
      1. each run in dbt Cloud might have multiple steps, each producing separate JSON files
      2. Each step is considered a parent run
      3. DbtArtifactProcessor was separated as a parent for DbtCloudArtifactProcessor and DbtLocalArtifactProcessor classes; the naming convention stays the same
      4. Used with DbtCloudRunJobOperator & DbtCloudJobRunSensor operators in Airflow integration, also makes use of DbtCloudHook to retrieve metadata from the dbt Cloud API
    3. Artifact retrieval and processing
      1. Due to a 10-sec thread timeout in the OpenLineage-Airflow integration, there is the following process for fetching dbt metadata:
        1. each run is a separate supported dbt node (models, tests, sources, snapshots)
        2. parent run reflects dbt-ol command call
      2. The issue will be resolved with the Airflow OpenLineage provider release (learn more about AIP-53 here)
  5. Discussion items
    1. Can we help ensure efficiency by narrowing the scope in some pragmatic ways? For example: is validation necessary in the case that an OpenLineage client is being used to send events? Are there other similar cases where validation might not be necessary?
      1. Work on adding validation to the project is ongoing, e.g., in the proxy where there is some schema validation happening
      2. It would be useful to have some testing facility, e.g., for people consuming OpenLineage and potential implementers
      3. From a producer's point of view, we could check if the consumer consumes them; this would have to be specific to each consumer
      4. We could have a dataset of events that contain all the assets, which would be useful for anyone who wants to do their own testing – like examples of all the facets that exist (instead of having to create them by hand for internal teams)
      5. Maybe just pump demo payloads out to disk and keep them somewhere
    2. Improving column lineage: there are lots of other elements that would be useful
      1. People want to add selected rules and filters
        1. Is there an anticipated traffic level, typical volume in a plan for design lineage
      2. Column metadata is well covered by other standards in the industry, but there are some lineage ones related to expected performance, flags that people want such as for PII data that's being managed on that edge, etc.
      3. One question: are those properties of a transformation itself, or just a property of a resulting column?
        1. In some cases, transformation; in others the actual edge, which is interesting. Option: have the ability to define the kinds of edges
        2. for PII, there is a tagging facet we were discussing that is still in progress
        3. Action item: get feedback on this and complete it
    3. Spark integration: merge into and aggregate functions don't provide column lineage
      1. A fix has recently been made, but when will this be released?
      2. Anyone can request a release in the #general Slack channel. You're encouraged to do this if you'd like a fix before the next regularly scheduled release (on the first work day of the month).

April 20, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. Updates (new!)
    1. OpenLineage in Airflow AIP
    2. Static lineage support
  3. Recent release overview
  4. A new consumer
  5. Caching support for column lineage
  6. Discussion items
    1. Snowflake tagging
  7. Open discussion

Meeting:

Notes:

  1. Announcements [Julien]
    1. A New York meetup will be happening on 4/26 at the Astronomer offices in the Flatiron District
    2. Julien Le Dem will be speaking at the Data+AI Summit in June: "Cross-platform Data Lineage with OpenLineage"
    3. Recent talks:
      1. Last month: Ross Turk, Paweł Leszczyński and Maciej Obuchowski all spoke at Big Data Technology Warsaw Summit 2023
      2. Also last month: Julien spoke at Data Council Austin
    4. Recent meetups:
      1. Last month: OpenLineage Meetup at Data Council Austin
      2. Last month: Data Lineage Meetup in Providence, RI
  2. Updates [Julien]
    1. OpenLineage in Airflow (AIP-53)
      1. Goal: make operators responsible for their own lineage
      2. Goal requires additions to the Airflow infrastructure
      3. Development process will progress in 3 phases
        1. add an OpenLineage library conforming to Airflow processes and coding style
        2. work on other providers, implementing OpenLineage methods
        3. add OpenLineage support to TaskFlow and Python operators
      4. Timeline: aiming for June Providers release
      5. We have begun with the Snowflake operator
      6. A significant benefit: operators will support it
    2. Static lineage support
      1. Next stage: add formal proposal to the OpenLineage repo, where it will be easier for members to comment
      2. To recap:
        1. OL is designed to capture lineage as pipelines run, as well as some info that is more static (schema, schema changes, etc.)
        2. Goal: capture lineage about views, etc., that have not run yet
        3. Focus will remain on everything that has been deployed
        4. Parallel discussion: lineage from job-less events, e.g., ad-hoc events
          1. challenge: these could pollute the namespace
        5. Basic proposal: to make the job name optional, which will require changes on the Marquez side, as well
      3. Comments are welcome
        1. See the #general channel in Slack for links to the two relevant docs
  3. Caching support for column lineage [Paweł]
    1. Personal opinion: the Spark integration is amazing because it extracts from the logical plan; also, it is easy to configure (requiring just 4 lines of code)
    2. Caching: a popular concept for Spark jobs
      1. a separate logical plan is used for cached datasets, meaning that two logical plans must be merged
      2. we will know how inputs are affecting outputs even when logical plans have been merged
  4. Open discussion
    1. A question about duplicated events when setting env variables [Anirudh]
      1. we have needed to employ filtering
      2. Spark reuses jobs for actions that are not really jobs

March 9, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Slides:

Notes:

February 9, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

January 12, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:


December 8, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

November 10, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

October 13, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

September 8, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

August 11, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

July 14, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Slides: https://bit.ly/3c9o1U1

Notes:

June 9th, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

May 19th, 2022 (10am PT)

Agenda:

Attendees:

Meeting:

Notes:

Apr 13th, 2022 (9am PT)

Attendees:

Agenda:

Meeting info:

Notes:

Added

Fixed


Mar 9th, 2022 (9am PT)

Attendees:

Agenda:

Meeting:

Notes:

Feb 9th 2022 (9am PT)

Attendees:

Agenda:

Meeting:

Slides

Notes:


Jan 12th 2022 (9am PT)

Attendees:

Agenda:

Meeting: 

Slides

Notes:

0.4 release [Willy]:

0.5 preview [Willy]:

Tasklistener for OL Integration [Maciej]:

1.10 required modifying each DAG, which was cumbersome and not compatible with 2.1

2.1: lineage backend comparable to Apache Atlas’ old backend

2.3: Airflow Event Listener

Egeria Support for OpenLineage [Mandy]:

Open Discussion:

Proposal to convert licenses to SPDX [Michael]: no objections

Dec 8th 2021 (9am PT)

Attendees:

TSC:

And:

Agenda:

Meeting recording:

Slides

Notes:

Software Package Data Exchange (SPDX) Tags [Mandy]

Azure Purview Integration [Srikanth, Will]

Logging backends [Julien]

Discussion

Nov 10th 2021 (9am PT)

Attendees:

Agenda:

Meeting recording:

Slides

Notes:

SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time

Iceberg requirements:

Ryan:
  Proposal to have a logger style API.

Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y

=> broad agreement on principle

Open Questions:

Flink:

Also need for OSS trino integration, tabular might contribute


Proxy Backend update [Mandy]

https://github.com/OpenLineage/OpenLineage/issues/256

Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket

Oct 13th 2021

Attendees:

Slides

Sept 8th 2021

Aug 11th 2021

July 14th 2021

June 9th 2021