The OpenLineage Technical Steering Committee meetings are Monthly on the Second Thursday from 10:00am to 11:00am US Pacific. Here's the meeting info.

All are welcome.

Next meeting: December 14, 2023 (10am PT)

Tentative agenda:

November 9, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. Recent releases
  3. Recent additions to the Flink integration
  4. Recent additions to the Spark integration
  5. Proposal updates
  6. Discussion items
  7. Open discussion

Meeting:

Notes:

Announcements

Recent Releases

Recent Additions to the Flink Integration - Peter Huang (Apple)

Recent Additions to the Spark Integration - Paweł Leszczyński (GetInData)

Proposals in Discussion - Julien Le Dem (Project Lead)

October 12, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. Recent releases
  3. Airflow Summit recap
  4. Tutorial/demo: migrating to the OpenLineage Airflow Provider
  5. Discussion: observability for OpenLineage+Marquez
  6. Open discussion

Meeting:

Notes:

Announcements

Recent releases

Migration from standalone Open Lineage package to Airflow provider
    - Jakub explained how to migrate from the standalone openly the flow package to the airflow provider. He gave reasons why they wanted to become an airflow provider, including making sure that the metadata collected in airflow is not breaking airflow itself.
    - They also keep the latest code up to date with all the providers and become part of these providers of the operators. There were a couple of changes introduced in the provider package, and the main question is how to migrate.
    - The simplest way is to just do the install for the specific package. One of the things they would like to walk away from this customer structures, and there was and still is a possibility to write a customer structure that was controlled by the open infrastructures environment variable.
    - Jakub explains that if a user has implemented some get open age assets method previously based on the old module and class, they do not need to worry about it because it is translated. However, if they install opening flow, they will fail to import the old class and need to change the import path.
    - There are changes introducing configuration, and there is a whole section called open image in conflict. Many of the features that were previously available in opening package are also compatible with the provider.
    - People usually like open in URL, which is pretty common and still works. But some entries in the open in age section take precedence over what's been previously handled by environment variables.
    - Jakub gives examples of how the logic for like conflict takes precedence over open in URL. He mentions that the documentation has more information on how it works.
    - He also explains how to add new integration in the provider or other providers that make use of opening provider. They want to give up on using open in age common data set module and use just the classes from the open in age python client.
    - Jakub gives quick advice on how to grab some information from execution of the operator. Previously, when they didn't have any control or influence on how to grab some information from execution of the operator, they needed to read the code and see that maybe job ID is returned as an ex come.
    - Now when they added the integration in the query operator itself, they can just change the code so it saves the job ideas and attributes.
    - Jakub gives a quick demo of how it works. He is using breeze, which is a mostly development environment and cli for airflow.
    - He is using on two point seven point one and is also using integration open in age, which instant Marcus also that's an option that they have in their flow. The only package that he is using is posts because he'll be using or provider.
    - He shows how it works and mentions that the beauty of e-mail life is that he doesn't know if it should work.
    - Jakub says that it should work in a minute.
    - Jakub types in his password.
    - Jakub says that he doesn't need to run post scripts, but actually he doesn't have just to prove he doesn't have any.
    - Jakub says that it's working. He is running some example that uses focus as back end.
    - Jakub says that previously, there was nothing to configure more if a user has like opening the CR.
    - Jakub explains that he changed the next piece and this is development, but the name is changed because he hasn't experimented with something. Eventually, the events came to market.
    - Jakub tries it again.
    - Jakub demonstrates a quick demo of three options for package installation and rerunning history. Julien thanks Jakub and asks if there are any questions about migration from the old open age integration into the new airflow provider.

Observability for Open Lineage markers
    - Julien introduces the discussion topic of observability for opening age markers and invites Harel to start. Harel asks the audience about ensuring liability of lineage collection and what kind of operability they would like to see, such as distributed tracing.
    - He suggests gathering feedback on a slack channel. Julien thinks the metrics added to the airflow integration by Harel are a good starting point for observability.
    - Hloomba mentions enabling retention policy on all environments and suggests observability on database retention to help with memory or CPU performance. Harel suggests enabling metrics out of the box and instrumenting more functions using drop wizard as a web server.
    - Julien and William discuss having metrics on the retention job to track how the data retention job keeps the database small.
    - Jeevan asked about the possibility of having an open lineage event for Spark applications, and Pawelleszczynski explained the need for a parent run faster to identify each Spark action as part of a bigger entity, the Spark application. Jens suggested having unique job names for Spark actions and the parent Spark application.
    - Pawelleszczynski explained that the current job name is constructed based on the name of the operator or Spark logical note and appended with a dataset name, but they can make it optional to have a human-readable job name or use a hash on the logical plan to ensure uniqueness.
    - Harel mentioned having good news for Bob and suggested discussing it next week.
    - Jens added that having unique job names would help distinguish each Spark action and its runs, and Pawelleszczynski explained the current job naming convention and the possibility of making it unique using a hash on the logical plan.
    - Julien asked if anyone had more comments on the topic.

Creating a registry for consumers and producers
    - Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
    - Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
    - The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
    - The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Proposing an optional contract for providers for Airflow operators
    - Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
    - Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
    - The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
    - The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Spark integration
    - Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
    - Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
    - The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
    - The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Certification process in the Open Lineage ecosystem
    - Julien discussed the need for a certification process in the Open Lineage ecosystem, and suggested creating a document to start a discussion on how to implement it. He mentioned the possibility of providing data set support for scans and action notes, and creating a contract for implementing data sources to expose lineage in relation notes.
    - Julien also talked about the goal of Open Lineage to be built into systems like Airflow, and encouraged attendees to share their opinions and ask questions on Slack.
    - Julien discussed the need for a certification process in the Open Lineage ecosystem, and suggested creating a document to start a discussion on how to implement it. He mentioned the possibility of providing data set support for scans and action notes, and creating a contract for implementing data sources to expose lineage in relation notes.
    - Julien also talked about the goal of Open Lineage to be built into systems like Airflow, and encouraged attendees to share their opinions and ask questions on Slack.

September 14, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. Recent releases
  3. Demo: Spark integration tests in Databricks runtime
  4. Discussion items
  5. Open discussion

Meeting:

Notes:

  1. Announcements [Julien]
  2. Recent releases [Michael R.]
  3. Recent Releases
        - Michael shared a release update on 1.1.0, including support for configuring OpenLineage based on the Flink integration, solving the problem of multiple jobs writing to different data sets with the same job name in Spark, and adding missing Java docs to the Java client. The default behavior can be turned off with an environment variable, and more information is available in the release notes.
        - Michael also thanked new contributors and mentioned bug fixes.
        - Maciej and Julien discussed the fact that Airflow changes are not included in the changelog and that the Airflow-OpenLineage is now part of the Airflow project.

  4. Demo: Spark integration tests in Databricks runtime [Pawel]
        - Pawel thanked the participants and introduced himself. He talked about upgrading the Spark version and the issues they faced with Databricks integration.
        - They had to manually test the changes which was time-consuming. However, Databricks released a Java library that allowed them to run integration tests easily.
        - They also implemented a file transport system to capture lineage events and verify that the events contain what they expected. This change helped speed up their work and have better code.
        - Julien asked if there were any questions.

  5. Discussion items
    1. Open Lineage Registry Proposal [Julien]
          - Julien explained the concept of OpenLineage and the need for a registry to define custom facets and producers. He shared a Google doc for feedback and listed the goals of the registry, including allowing third parties to register their implementation or custom extension and shortening the producer and skim URL values.
          - Custom facets are an easy way to extend the spec without requiring any approval, and producers and consumers can do the list of facets they produce without requiring approval.
          - Mandy joined the call and expressed support for the idea of a registry but suggested that facets should be themed to avoid every producer defining their own facets. She proposed having a set of themes like data facets and meeting assets to cluster similar facets together in the registry.
          - Mandy expresses concern about naming custom facets after specific technologies, as it can lead to unnecessary duplication. Julien explains that the airflow facet is specific to airflow and provides benefits for generic things.
          - Core facets are sometimes added, and there are things specific to what people are doing. Mandy agrees and gives an example of how types are aligned with technologies, leading to duplication.
          - Ernie suggests adding a protocol for something in the registry to become a core facet. Julien explains that there is a template for adding to the spector and that custom facets can be defined as long as they have a prefix to the facet name and publish the schema.
          - To become a core facet, a proposal can be opened on the open is project and usage of the custom facet can be leveraged to show that it works.
          - Mandy suggests having a state on the registry to show whether something is private, under proposal, or being adopted. Julien agrees and explains that some custom facets are specifically in the domain of the producer and should live in the registry, while others are shared.
          - Nick interjects and expresses his appreciation for the community aspect of the open lineage. He suggests that producers provide examples and tests for consumers to use.
          - Mandy asks for clarification on what he means by tests, and Nick explains that it could be a set of payloads or actually running the runtime to produce events.
          - Nick would like to see both examples and payloads for consumers and producers, respectively. He suggests that putting them in a registry would facilitate everything all around like the tests.
          - Julien explains that for the core spec, they have the definition of facets, Jason schema for each asset, and documentation. They also added an example of each core asset and a test for the schema validation.
          - He suggests making it easier for producers to describe what facet they're producing.
          - Mandy asks who did the recent addition, and Julien explains that it was part of getting data. Mandy thanks him for the information.
          - Julien suggests that there could be more done to make it easier for producers to describe what facet they're producing. Nick agrees and suggests a framework for testing where producers can provide enough information for the test to be generated.
          - Julien explains that they currently use schema validation, but it's just a small portion of what Nick is describing. Nick agrees that it's a start.
          - Julien suggests that producers need a registry mechanism to create their own facets and make them explicitly defined. Consumers would also benefit from a programmatic definition of facets they're consuming.
          - He mentions the open lineage website's ecosystem page and how it points to documentation, but a more programmatic definition would be great.
          - Nick agrees that it would be great to have a more programmatic definition of facets.
          - Julien proposed a registry and discussed the trade-offs between a self-contained registry and delegating to other registries. He also mentioned the benefits of using shorter URLs for custom facets.
          - Nick asked about how other communities handle this and suggested looking at successful practices of similar organizations. Pawelleszczynski agreed.
          - There were questions about whether there should be a registry folder under spec or in the opening tab organization, and how to handle core facets and versioning. The group discussed using an owners file in a repo to approve updates to the registry.
          - Julien emphasized that this was just to start the conversation and that there were many different ways to implement the registry.
          - Julien mentioned producing a list of schema URL as a third party and discussed the benefits of a self-contained registry, including the ability to run checks against it and ensure consistency.
          - Julien explained that defining a name and putting a list of information would allow for shorter URLs for custom facets.
          - Julien used ol: as an example of a shorter prefix for schema URLs.
          - Julien mentioned that there were questions about whether there should be a registry rep in the opening tab organization and whether it should be a registry folder under spec.
          - Julien discussed using a Jason file to contain information about customers and their defined names.
          - Julien compared the registry to the even repository and discussed using an owners file to approve updates to the registry.
          - Julien mentioned using ti to verify consistency and avoid breaking the registry.
          - Nick asked about successful practices of similar organizations in handling registries.
          - Nick mentioned that smaller organizations might be more flexible while larger organizations might have more legal requirements for using other registries.
          - Pawelleszczynski agreed with Nick's suggestion to look at successful practices of similar organizations.
          - Julien explains that data-driven decisions are important and mentions the trade-off of how complicated it is to maintain a repository and whether it is self-service for producers. He suggests adding files to an existing open source repo for small organizations, while big organizations may need legal approval to contribute.
          - He also mentions the need for licensing and PR processes.
          - Nick responds with agreement.
          - Julien shares that he will share the draft dock on Open Lineage Slack for feedback and follow the OpenLineage proposal process. He mentions other ideas for implementation, such as the Men repository and the Evan repository, and welcomes other examples.
          - He also asks if there are any questions or things people want to share about OpenLineage.

August 10, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. OpenLineage 1.0 overview
  3. OpenLineage Airflow Provider update
  4. Discussion items
  5. Open discussion

Meeting:

Notes:

  1. Announcements [Julien]
    1. Ecosystem Survey still needs responses: https://bit.ly/ecosystem_survey
    2. OpenLineage graduated from the LF AI on 7/27
    3. The 3rd issue of our monthly newsletter shipped on 7/31. Sign up here: https://bit.ly/OL_news
    4. Upcoming meetups:
      1. 8/30 in S.F. at Astronomer
      2. 9/18 in Toronto at Airflow Summit
      3. Marquez meetup on 10/5 in S.F.
  2. LF AI Update [Michael R.]
    1. Topics covered by Julien in presentation to LF AI TAC for graduation included trends in adoption
  3. Recent releases [Michael R.]
    1. 1.0.0: Added

      Removed

      Changed

    2. 0.30.1: Added

      Changed

  4. Update on the OpenLineage Airflow Provider [Maciej]
    1. Pypi package version 1.0.1 available at: https://pypi.org/project/apache-airflow-providers-openlineage/1.0.1/
      1. installable with pip install apache-airflow-providers-openlineage==1.0.1 
    2. Development progresses in the Airflow repo
    3. What's there already:
      1. Operator coverage:
        1. A lot of SQL-related operators, especially based on SQLExecuteQueryOperator
        2. Some GCP ones: BigQueryInsertJobOperator, GCStoGCSOperator 
        3. Some Sagemaker-related operators
        4. FTP, SFTP operators
        5. Basic support for Python and Bash operators
      2. Changed:
        1. Airflow: do not run plugin if OpenLineage provider is installed #1999 @JDarDagran
        2. Python: rename config to config_class #1998 @mobuchowski
    4. Next steps
      1. Operator coverage:
        1. Popular operators around BigQuery: BigQueryUpsertTableOperator…
        2. Transport operators, like MySQLToSnowflakeOperator, GCSToBigQueryOperator
        3. S3 support, like S3CopyObjectOperator
        4. Add support for XCom-native operators like BigQueryGetDataOperator
        5. This list is not a promise
      2. "Core" changes
        1. Add interfaces around OpenLineage-implementing operators - making implementation more native
        2. XCom dataset support - this relates to XCom operators mentioned above
        3. Hook-level lineage support
  5. OpenLineage 1.0 with Static Lineage Update
    1. Putting things together for 1.0 release
      1. Important features and PRs
        1. Clarify docs on `RunEvent` lifecycle (link)


July 13, 2023 (8am PT)

Attendees:

Agenda:

  1. Announcements
  2. Updates
  3. Recent releases
  4. DataGalaxy integration demo
  5. Open discussion

Meeting:

June 8, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. Recent releases
  3. Static lineage progress update
  4. Open discussion

Meeting:

Notes:

  1. Announcements [Julien]:
    1. Our first annual ecosystem survey is live and accepting responses: https://bit.ly/ecosystem_survey. Your participation matters!
    2. We recently published the first issue of our monthly newsletter: https://mailchi.mp/18826f97904e/openlineage-news-may-2023. It's a great way to learn about upcoming meetups and recent blog posts, etc.
    3. Two meetups are happening soon:
      1. New York on 6/22 at Collibra's HQ: https://www.meetup.com/data-lineage-meetup/events/294065396/
      2. San Francisco on 6/27 at Astronomer: https://www.meetup.com/meetup-group-bnfqymxe/events/293448130/
    4. Upcoming talks:
      1. Paweł Leszczyński and Maciej Obuchowski, “Column Lineage is Coming to the Rescue,” Berlin Buzzwords, June 18-20, 2023
      2. Julien Le Dem and Willy Lulciuc, “Cross-platform Data Lineage with OpenLineage,” Data+AI Summit, June 28-29, 2023
      3. Maciej Obuchowski, “OpenLineage in Airflow: A Comprehensive Guide,” Airflow Summit, September 19-21, 2023
  2. Recent releases [Michael R.]:
    1. OpenLineage 0.25.0
    2. OpenLineage 0.26.0
    3. OpenLineage 0.27.1
    4. OpenLineage 0.27.2
  3. Static Lineage Progress Update [Paweł]:
    1. Overview
      1. Up to this point, operational/runtime metadata has been the focus of OpenLineage
      2. But there is also a need for lineage metadata about datasets not associated with runs
      3. To address this, a proposal has been created
        1. It answers the question: how can we add new data types to support static lineage?
        2. We decided to add two new types:
          1. job event
          2. dataset event
        3. A schemaURL provides a distinguishing mechanism
        4. Generic client code will not be affected
    2. Demo
      1. Approach taken: serialize and deserialize without modifying the database
    3. Conclusion
      1. This approach does not break existing usage scenarios while nonetheless adding new event types
      2. Changes will be implemented in the clients and the spec
    4. Q&A
      1. Initial work on Marquez to support static lineage has also been completed (adding the capability to distinguish between the event types), but Marquez is not currently able to store static lineage metadata
      2. Ability to convert from static to dynamic anticipated?
        1. Formats not very different
        2. Job event is subtype of a run event, making it easy to extract the data you care about
        3. Marquez UI should not change
      3. Ownership change notification possible?
        1. This data accessible via the REST API but not currently built in
        2. Contribution of such a feature would be welcome
        3. Alternative solution: add a listener
      4. Job events are static but not dataset events?
        1. Both are static events
  4. Discussion items
    1. Marquez search – how robust?
      1. Recommended: visit the GitHub repo and use GitPod to try it out (or use the up.sh script in the docker directory there to deploy locally)
        1. Tags are accessible in some facets in the UI, which would provide one way
    2. Row-based lineage – are there any facets or models that would help with this use case?
      1. We are trying to keep the metadata store smaller than the data itself
      2. Row-level lineage could be captured in a data model, which would be accessible in Marquez
      3. Challenge: the volume of data
      4. It might be helpful to have a doc about solutions for this in the project
    3. Another good forum for asking questions: https://bit.ly/OLslack

May 11, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

  1. Announcements [Julien]:
    1. Upcoming meetups
      1. Boston Data Lineage Meetup (tentatively scheduled for June)
      2. San Francisco OpenLineage Meetup at Astronomer (tentatively scheduled for June 27)
    2. Upcoming talks
      1. Paweł Leszczyński and Maciej Obuchowski, “Column Lineage is Coming to the Rescue,” Berlin Buzzwords, June 18-20, 2023
      2. Julien Le Dem and Willy Lulciuc, “Cross-platform Data Lineage with OpenLineage,” Data+AI Summit, June 28-29, 2023
      3. Maciej Obuchowski, “OpenLineage in Airflow: A Comprehensive Guide,” Airflow Summit, September 19-21, 2023
  2. Recent releases [Michael R.]
    1. OpenLineage 0.24.0
      1. Additions
        1. Support custom transport types #1795 @nataliezeller1
        2. Airflow: dbt Cloud integration #1418 @howardyoo @JDarDagran
        3. Spark: support dataset name modification using regex #1796 @pawel-big-lebowski
      2. https://github.com/OpenLineage/OpenLineage/releases/tag/0.24.0
      3. https://github.com/OpenLineage/OpenLineage/compare/0.23.0...0.24.0
  3. Custom transport types support [Natalie]
    1. OpenLineage supports a set of predefined transport types (HTTP, Kafka, others)
    2. Previously, adding a new or custom type required changing the transport config and transport factory to recognize the new type
    3. This change allows for extending functionality without having to change anything in the OpenLineage codebase
    4. Example: my company, where we work with an OpenMetadata backend
      1. This required a custom transport type
      2. With this change I can do this without changing anything
    5. Implementation
      1. New interface: TransportBuilder
      2. Implementable via methods:
        1. getType() // set in transport.type config param
        2. getConfig() // extension of TransportConfig, containing the required configuration
        3. Transport build(TransportConfig config) // builds a custom Transport instance based on the custom configuration
      3. Additionally you need to have a file (META-INF/services/io.openlineage.client.transports.TransportBuilder) that must be included in a jar in the class path, containing the fully qualified name of the implementing class
      4. Using the service loader pattern, implementations of TransportBuilder will be discovered and loaded at runtime.
    6. Q&A
      1. What are some use cases for other cool transport mechanisms?
        1. Native cloud, your queue system to send events
        2. Preferred way: the provider, data catalog, or something to implement over the lineage
        3. Maybe someone wants to do MSMQ or MQSeries
        4. You can also apply some transformation logic as part of your transport provider, so you can have your own ways of transporting the data
      2. Should we have some sort of repository where people can put their custom transport types that their building in a single place?
        1. They can put them in the repo; I don't think we need a separate place, at least right now
  4.  dbt Cloud integration [Jakub]
    1. Previously:
      1. The dbt-ol script invoked dbt metadata processing and sent OpenLineage events
      2. Worked only with a local dbt project
      3. How events were created:
        1. each run was a separate supported dbt node
        2. parent run reflected dbt-ol command call
    2. New dbt Cloud integration:
      1. each run in dbt Cloud might have multiple steps, each producing separate JSON files
      2. Each step is considered a parent run
      3. DbtArtifactProcessor was separated as a parent for DbtCloudArtifactProcessor and DbtLocalArtifactProcessor classes; the naming convention stays the same
      4. Used with DbtCloudRunJobOperator & DbtCloudJobRunSensor operators in Airflow integration, also makes use of DbtCloudHook to retrieve metadata from the dbt Cloud API
    3. Artifact retrieval and processing
      1. Due to a 10-sec thread timeout in the OpenLineage-Airflow integration, there is the following process for fetching dbt metadata:
        1. each run is a separate supported dbt node (models, tests, sources, snapshots)
        2. parent run reflects dbt-ol command call
      2. The issue will be resolved with the Airflow OpenLineage provider release (learn more about AIP-53 here)
  5. Discussion items
    1. Can we help ensure efficiency by narrowing the scope in some pragmatic ways? For example: is validation necessary in the case that an OpenLineage client is being used to send events? Are there other similar cases where validation might not be necessary?
      1. Work on adding validation to the project is ongoing, e.g., in the proxy where there is some schema validation happening
      2. It would be useful to have some testing facility, e.g., for people consuming OpenLineage and potential implementers
      3. From a producer's point of view, we could check if the consumer consumes them; this would have to be specific to each consumer
      4. We could have a dataset of events that contain all the assets, which would be useful for anyone who wants to do their own testing – like examples of all the facets that exist (instead of having to create them by hand for internal teams)
      5. Maybe just pump demo payloads out to disk and keep them somewhere
    2. Improving column lineage: there are lots of other elements that would be useful
      1. People want to add selected rules and filters
        1. Is there an anticipated traffic level, typical volume in a plan for design lineage
      2. Column metadata is well covered by other standards in the industry, but there are some lineage ones related to expected performance, flags that people want such as for PII data that's being managed on that edge, etc.
      3. One question: are those properties of a transformation itself, or just a property of a resulting column?
        1. In some cases, transformation; in others the actual edge, which is interesting. Option: have the ability to define the kinds of edges
        2. for PII, there is a tagging facet we were discussing that is still in progress
        3. Action item: get feedback on this and complete it
    3. Spark integration: merge into and aggregate functions don't provide column lineage
      1. A fix has recently been made, but when will this be released?
      2. Anyone can request a release in the #general Slack channel. You're encouraged to do this if you'd like a fix before the next regularly scheduled release (on the first work day of the month).

April 20, 2023 (10am PT)

Attendees:

Agenda:

  1. Announcements
  2. Updates (new!)
    1. OpenLineage in Airflow AIP
    2. Static lineage support
  3. Recent release overview
  4. A new consumer
  5. Caching support for column lineage
  6. Discussion items
    1. Snowflake tagging
  7. Open discussion

Meeting:

Notes:

  1. Announcements [Julien]
    1. A New York meetup will be happening on 4/26 at the Astronomer offices in the Flatiron District
    2. Julien Le Dem will be speaking at the Data+AI Summit in June: "Cross-platform Data Lineage with OpenLineage"
    3. Recent talks:
      1. Last month: Ross Turk, Paweł Leszczyński and Maciej Obuchowski all spoke at Big Data Technology Warsaw Summit 2023
      2. Also last month: Julien spoke at Data Council Austin
    4. Recent meetups:
      1. Last month: OpenLineage Meetup at Data Council Austin
      2. Last month: Data Lineage Meetup in Providence, RI
  2. Updates [Julien]
    1. OpenLineage in Airflow (AIP-53)
      1. Goal: make operators responsible for their own lineage
      2. Goal requires additions to the Airflow infrastructure
      3. Development process will progress in 3 phases
        1. add an OpenLineage library conforming to Airflow processes and coding style
        2. work on other providers, implementing OpenLineage methods
        3. add OpenLineage support to TaskFlow and Python operators
      4. Timeline: aiming for June Providers release
      5. We have begun with the Snowflake operator
      6. A significant benefit: operators will support it
    2. Static lineage support
      1. Next stage: add formal proposal to the OpenLineage repo, where it will be easier for members to comment
      2. To recap:
        1. OL is designed to capture lineage as pipelines run, as well as some info that is more static (schema, schema changes, etc.)
        2. Goal: capture lineage about views, etc., that have not run yet
        3. Focus will remain on everything that has been deployed
        4. Parallel discussion: lineage from job-less events, e.g., ad-hoc events
          1. challenge: these could pollute the namespace
        5. Basic proposal: to make the job name optional, which will require changes on the Marquez side, as well
      3. Comments are welcome
        1. See the #general channel in Slack for links to the two relevant docs
  3. Caching support for column lineage [Paweł]
    1. Personal opinion: the Spark integration is amazing because it extracts from the logical plan; also, it is easy to configure (requiring just 4 lines of code)
    2. Caching: a popular concept for Spark jobs
      1. a separate logical plan is used for cached datasets, meaning that two logical plans must be merged
      2. we will know how inputs are affecting outputs even when logical plans have been merged
  4. Open discussion
    1. A question about duplicated events when setting env variables [Anirudh]
      1. we have needed to employ filtering
      2. Spark reuses jobs for actions that are not really jobs

March 9, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Slides:

Notes:

February 9, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

January 12, 2023 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:


December 8, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

November 10, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

October 13, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

September 8, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

August 11, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

July 14, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Slides: https://bit.ly/3c9o1U1

Notes:

June 9th, 2022 (10am PT)

Attendees:

Agenda:

Meeting:

Notes:

May 19th, 2022 (10am PT)

Agenda:

Attendees:

Meeting:

Notes:

Apr 13th, 2022 (9am PT)

Attendees:

Agenda:

Meeting info:

Notes:

Added

Fixed


Mar 9th, 2022 (9am PT)

Attendees:

Agenda:

Meeting:

Notes:

Feb 9th 2022 (9am PT)

Attendees:

Agenda:

Meeting:

Slides

Notes:


Jan 12th 2022 (9am PT)

Attendees:

Agenda:

Meeting: 

Slides

Notes:

0.4 release [Willy]:

0.5 preview [Willy]:

Tasklistener for OL Integration [Maciej]:

1.10 required modifying each DAG, which was cumbersome and not compatible with 2.1

2.1: lineage backend comparable to Apache Atlas’ old backend

2.3: Airflow Event Listener

Egeria Support for OpenLineage [Mandy]:

Open Discussion:

Proposal to convert licenses to SPDX [Michael]: no objections

Dec 8th 2021 (9am PT)

Attendees:

TSC:

And:

Agenda:

Meeting recording:

Slides

Notes:

Software Package Data Exchange (SPDX) Tags [Mandy]

Azure Purview Integration [Srikanth, Will]

Logging backends [Julien]

Discussion

Nov 10th 2021 (9am PT)

Attendees:

Agenda:

Meeting recording:

Slides

Notes:

SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time

Iceberg requirements:

Ryan:
  Proposal to have a logger style API.

Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y

=> broad agreement on principle

Open Questions:

Flink:

Also need for OSS trino integration, tabular might contribute


Proxy Backend update [Mandy]

https://github.com/OpenLineage/OpenLineage/issues/256

Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket

Oct 13th 2021

Attendees:

Slides

Sept 8th 2021

Aug 11th 2021

July 14th 2021

June 9th 2021