Page History

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Thursday from 10:00am to 11:00am on the Second Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

...

welcome.

Table of Contents

Next meeting: May 8, 2024 (9:30am PT)

April 10, 2024 (9:30am PT)

Attendees:

TSC:
- Julien Le Dem, OpenLineage project lead, LF AI & Data
- Michael Robinson, Community Manager, Astronomer
- Harel Shein, Lineage at Datadog
And:
- Sheeri Cabral, Product Manager, ETL, Collibra
- Eric Veleker, Partnerships, Atlan
- Jens Pfau, Engineering Manager, Lineage, Google
- David Twaddell, Architect, HSBC

Agenda:

Announcements
Recent release highlights
Discussion items
- supporting job-to-job dependencies in the spec
- improving naming conventions
Open discussion

Meeting:

Slides

March 13, 2024 (9:30am PT)

Tentative agenda:

Announcements
Recent release 1.9.1 highlights
Scala 2.13 support in Spark overview by @Damien Hawes
Circuit breaker in Spark & Flink, built-in lineage in Spark @Paweł Leszczyński
Discussion items
Open discussion

February 8, 2024 (10am PT)

Attendees:

TSC:
- Julien Le Dem, OpenLineage project lead, LF AI & Data
- Michael Robinson, Community Manager, Astronomer
- Damien Hawes, Booking.com
- Harel Shein, Datadog
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage committer
- Mike Collado, Sr. Software Engineer, Snowflake
And:
- Suraj Gupta, Atlan
- Eric Veleker, Atlan
- Sheeri Cabral, Product Manager, Collibra
- Ernie Ostic, IBM/Manta

Agenda:

Recent releases
Announcements
Coming soon: simplified job hierarchy in the Spark integration
Discussion items
Open discussion

Meeting:

Widget Connector

url	http://youtube.com/watch?v=O7-ZNCbt880

Widget Connector

url	http://youtube.com/watch?v=z-MdLO3lxR8

Widget Connector

url	http://youtube.com/watch?v=hvUIaziS2TI

Widget Connector

url	http://youtube.com/watch?v=Ql7DR59wdpE

Notes:

Summary

1. We have added a new communication resource, a LinkedIn company page.
2. We announced a new committer, Damien Hawes, from Booking.com, who has made significant contributions to the project.
3. Astronomer and Collibra are co-sponsoring a data lineage meetup on March 19th at the Microsoft New England Conference Center.
4. Members have talks upcoming at Kafka Summit and Data Council.
5. We discussed upcoming improvments to job hierarchy in Spark and how this can help answer questions about job scheduling and dependencies.
6. Damien shared his contributions to the Apache Spark integration, specifically addressing versioning conflicts with Scala.
7. Eric provided a general update on the interest in and adoption of OpenLineage, particularly in the enterprise space.
8. Atlan is considering releasing a DAG (Directed Acyclic Graph) instead of a plugin to help users with configuration and troubleshooting.
9. The next monthly call will be held at a different "location," and participants were encouraged to look out for the updated Zoom link.

Outline

Welcome and Announcements
- Michael Robinson welcomes everyone to the monthly call of the Open Lineage TSC, which is recorded and archived on the wiki. He mentions that the list has one more person since the last meeting and teases an exciting announcement.
- Michael Robinson shares a new communication resource, the LinkedIn company page, and asks for quick introductions from the participants.
- Harel introduces himself as an Open Lineage committer and hints at an interesting workplace announcement for the next meeting.
- Other participants introduce themselves, including their roles and companies.

Introductions
- Maciej introduces himself as a software engineer and warns about possible background noise due to copyright music.
- Eric, Suraj, and Damien introduce themselves and express their excitement to be part of the call.

Agenda Overview and New Committer Introduction
- Michael Robinson outlines the agenda for the call, including announcements, recent releases, and discussion items.
- Michael Robinson announces a change in the Zoom link and welcomes a new committer, Damien from Booking.com, who has made significant contributions to the project.
- Harel and Michael Robinson express their gratitude for Damien's contributions and explain how he added support for multiple Spark versions for the integration, which saved a lot of time and effort for the community.

Upcoming Events
- Michael Robinson announces a data lineage meet up on March 19th at the Microsoft New England Conference Center, co-sponsored by Astronomer and Collibra. More details and sign-up link available on Meetup.com.
- An updated agenda and information about speakers will be provided soon.
- Michael Robinson informs about an exciting talk at Kafka Summit on March 19th called "Open Lineage for Stream Processing" by Baimache and Pavel. There will also be a data standardization panel moderated by Julien at Data Council on March 27th, with participants to be finalized soon.

Recent Releases and Contributions
- Michael Robinson shares about the successful first London meetup with speakers from Decathlon, Confluent, and Astronomer. Decathlon's lineage graph was showcased, and more details about their architecture and use case will be shared in the future.
- Open Lineage 1.8 was released with contributions from Damian, Mata, Meta, Bertel, Peter, and Natalie.
- Michael Robinson thanks all contributors and welcomes Matea's first contributions to the project. Open Lineage 1.8 can be read about on the GitHub repo and docs.
- Maciej is asked if he would like to share his screen.

UI Feature and Streaming Integration
- Maciej explains two topics for the call: a store and a description of how they think of job-specific park. He discusses the job hierarchy and how they can answer questions about why a job ran at a certain time.
- He gives an example of a parent job and how it schedules events. He explains that for a spark job, there can be multiple events and actions, but they want to simplify it to one event at the start and one at the end with each action having a parent job.
- He gives a complex example of a sequence of events for a spark application. He explains that open consumers can collapse the information they receive for a simplified view of the spark application.
- Maciej explains the new UI feature that allows for a top-level view of data in spark levers, without distinguishing the internal actions. He also mentions the higher level execution feature that allows users to see what is scheduled across the platform.
- Harel praises the addition and mentions that it helps visualize dependencies and governance, making it easier to answer use cases visually. Maciej adds that the complete events feature allows users to know when a spark drop ended.
- Michael Collado asks about how well the feature works in the data bricks environment, which Maciej acknowledges as a great question and mentions that they need to try it more in data bricks, as it is always slightly different from the standard.
- Maciej explained that they wanted to have a streaming integration with Pink, which is currently the most popular streaming system. They had an idea to make a Pink integration, but the code they copied from the integration was not very beautiful and had a lot of reflection and instance checking.
- They decided to create a workaround to get as much value as they can and propose an interface that allows them to create a better integration. They had other things to do in the meantime, but then they discovered that a support customize job was created by Dance, which introduced several interfaces.
- They realized that the perfect interface was already created, but it had only one piece of information. The problem was that the IP had already passed, and the listener would have to know every connector Emerson to get information from it, which is impossible.
- Maciej explains the limitations of open source connectors and how it affects their integration process. They have resolved this issue by adding a data set interface to make connectors implement it and make the lineage vertex implement the list of data sets that they actually attach.
- This breaks the capping between the collector and listener because they both are bigger face that basically doesn't change and changes. It takes only forward compatible.
- The end result is that they have an interfacing thing that is open lineage but not quite named open lineage. This solution is easier to convince the community to create an interface, that there's concerns is done to be find like on a library that the third part and they can have a clear one to one mapping without breaking anything.
- Maciej asks if there are any questions.
- Michael Robinson thanks Maciej for his contribution and acknowledges that he joined the call after work hours. Julien also thanks Maciej for coordinating with the link commuters on this great collaboration.
- Eric offers to give a general update at the end of the call.

Open Discussion
- Michael Robinson moves on to open discussion and asks if anyone has any discussion items.

Update on Spark Integration
- Damien shared his experience with the scalar two point 13 support to Apache Spark integration. They deployed the open line spark integration into their own internal pipelines and it worked well.
- However, when they moved to new clusters running different versions of scalar, the jobs failed due to conflicting scalar major versions. The reason for this is that when Java code is compiled, the compiler injects the full class names or full type signature of a method, which includes what its return type is and what its input ran types are.
- When calling a method in Apache Spark, if the same method has two different types signatures, the JVM throws a runtime error. The solution to this is to compile the entire application for an entire project against the Apache Spark libraries.
- Damien explains how to configure the app to consume relevant jars and run integration tests for different versions of Spark, with the exception of Spark 2.4 which only uses Scala 2.12. Maciej thanks Damien for his contribution and expresses a desire for faster reviews.
- Michael Robinson congratulates Damien on becoming a committer and thanks him for his contributions. Eric provides a general update on interest in airflow and spark integrations, with a focus on enterprise adoption and versioning conflicts.
- They plan to release a Dag instead of a plugin to help with configurations. Michael Robinson concludes the call and announces the next meeting.

January 11, 2024 (10am PT)

Attendees:

TSC:
- Julien Le Dem, OpenLineage project lead, LF AI & Data
- Harel Shein, Datadog Engineering
- Michael Robinson, Community Manager, Astronomer
And:
- Tatiana Al-Chueyr, Staff Software Engineer, Astronomer
- Alex Jaglale, Executive, DataGalaxy
- Jens Pfau, Engineering Manager, Google
- Eric Veleker, Atlan

Agenda:

Recent releases
Announcements
Discussion items
Open discussion

Meeting:

Widget Connector

url	http://youtube.com/watch?v=6_XOON9kf6E

Widget Connector

url	http://youtube.com/watch?v=itbm8hHAtPQ

Notes:

Summary

1. We closed their first ever annual ecosystem survey and the results will be published soon.
2. There is a meetup coming up on January 31st in London, which will be our first in London. It will be an in-person event.
3. We have a talk at the Kafka Summit in London in March, with key contributors speaking.
4. We recently released version 1.7.0, with important compatibility notice for the Airflow integration.
5. There was a discussion about possible improvements to job hierarchy semantics in the Spark integration.
6. Julien updated the registry proposal and it is close to being implemented.
7. Eric (Atlan) shared that there is growing demand and adoption of OpenLineage, and organizations are pressing forward due to the perceived business value.
8. Eric mentioned the need for better documentation and support for different versions and integrations.
9. Jens suggested expanding the integration matrix to include more dimensions, such as types of data sources and facets.

Outline

Announcements and events
- Michael outlines the agenda for the meeting, including announcements, recent releases, updates on Airflow provider and Spark integration, and discussion items.
- Michael announces upcoming events, including the publication of the annual ecosystem survey results, a meetup in London on January 31st, and a talk at Kafka Summit in March.
- Alex asks if the meetup will be online as well, and Michael clarifies that it will be in person only.

Release updates
- Michael Robinson informs the participants about the recent release of version 1.7.0 and mentions an important compatibility notice regarding the airflow integration. He encourages the use of their official open lineage airflow provider and explains that the transition is easy and straightforward.
- He also mentions the addition of the parent run facet to all events in the airflow integration and the removal of support for airflow two. Michael thanks all contributors, including Koch Bermuda, who provided fixes for the release.

Spark job hierarchy
- Michael Robinson plays a recorded update from Maciej, who provides important updates on the provider and ongoing discussions of possible improvements to job hierarchy schematics in the spark integration. Julien acknowledges the recording, and Maciej provides updates on recent changes to the Airflow Provider.
- He mentions the addition of support for multiple GCS industry-related operators and bug fixes.
- He proposes having more granular event semantics and consensus on having a single parent run for all actions.
- Inputs and outputs of the job hierarchy for spark and the need for more information about how they are related are discussed. They mention the open lineage feature called "parent" that allows specifying that a run was scheduled by something else or is a sub-run.
- They agree on having a single parent run that contains all actions but note that it is still being discussed.
- Maciej explains how the application run and parent run work, allowing customers to correlate jobs and understand execution. He mentions the power it gives to consumers who want to display aggregate data and make sure users understand how jobs look like.
- Maciej shares links to issue #1672 and the PyPI doc and download for the Airflow Provider, encouraging questions or contributions to the ongoing discussion.

Simplify jump a key
- Michael invites discussion on topic #1672 and asks if anyone wants to add a topic. Jens brings up the simplify jump a key for spark issue and suggests having a quick discussion on it.
- Julien explains that the explanation they just saw was recorded because Maciej couldn't join the meeting. Jens realizes his check is not there and will discuss it with Maciej separately.
- Michael asks if there are any other items for discussion.

Registry proposal
- Julien updates the registry proposal and shares his screen to show the recent updates, including clarification for consumers to independently discover and support custom facet opening, acceptance guidelines for claiming a name and entity, and examples of how to use them. He believes it's close enough to implement the first version and see where they're going.
- Julien reviewed the recording of a meeting and integrated feedback. The core facets will be moved in the registry under the core name and follow the same rules as all other custom facets.
- Examples for each facet will be moved to the registry as well, ensuring consistency and validation. Additional metadata is available to show documentation on use cases.
- The first version of the registry will be managed and improved over time. Jens asked about the format for spec versions, which could be extended.
- Michael Robinson expressed happiness with the progress and thanked Sheeri for driving the conversation.
- Jens asked about the format for spec versions and Julien explained that it's currently exact only but could be extended. He suggested tagging Sheeri to discuss further on the extension of these versions.

Learning since last call
- Eric shared some learning since the last call.
- Eric reports growing demand and adoption of OpenLineage, with no hesitancy from organizations due to its perceived business value. He mentions the need for better documentation to accelerate adoption and optimize for speed in two areas: proper versions of everything in place and diagnosing if there are needs that the community needs to build out for support.
- Eric suggests an Airflow plugin to provide a report on misconfigurations and help stakeholders understand the details. He also mentions the need for access to the boundary or threshold of support to get organizations up and running and showing business value.
- Michael Robinson asks Eric about a specific document that would be helpful for version requirements and coverage information. Eric explains that the plugin they developed will identify things that need to be done to press forward for the organization implementing the lineage.
- He gives an example of an organization using AWS Glue and how they had to throw on the brakes because they didn't have knowledge of the community's investment in building up support where it's needed. Eric puts out a problem statement about the need for all the folks adjacent to the core community to know the boundary or threshold of support to get organizations implemented and up and running.
- Michael Robinson and Julien acknowledge the information.

Column lineage in Spark
- Eric explains that they have been reaching out to the community for information about coverage, but having it in one place would be helpful. He encourages opening issues and shares that a new resource on the subject is available.
- Julien agrees.
- Michael Robinson asks if anyone else has similar experiences.
- Eric asks if anyone else has experienced the same.
- Jens confirms that he understands the question and suggests having the information in a single place would be helpful.
- Eric thanks Jens.
- Michael Robinson shares a new resource on the subject and encourages opening issues. He asks Eric about plans for a plugin.
- Eric was looking at the repo and asks Michael to repeat the question. Michael asks about plans for the plugin.
- Eric suggests following up in the community slack and promises to contribute.
- Michael Robinson acknowledges Eric's contribution.

Integration matrix
- Jens suggests expanding on the integration matrix and mentions issues with iceberg support in Spark.
- Eric reflects on Jens' suggestion.
- Michael Robinson thanks Jens for the input.

December 14, 2023 (10am PT)

Attendees:

TSC:
- Julien Le Dem, OpenLineage project lead, LF AI & Data
- Harel Shein, Datadog Engineering
- Michael Robinson, Community Manager, Astronomer
- Mandy Chessell, Egeria Project Lead
- Pawel Leszczynski, Software Engineer, Astronomer/GetInData
And:
- Eric Veleker, Atlan

Agenda:

Recent releases
Announcements
Proposal updates
Open discussion

Meeting:

Widget Connector

url	http://youtube.com/watch?v=HW3Dd75UXLY

Widget Connector

url	http://youtube.com/watch?v=ozxLWjSOfiY

Widget Connector

url	http://youtube.com/watch?v=GN-ic0bjNoo

Notes:

Summary

1. Harel Shein provided announcements about upcoming meetups and shared metrics on community growth.
2. Harel Shein discussed the release of version 1.6.2, highlighting new features and bug fixes.
3. Harel Shein shared metrics on contributors and commits, showing an increase in both.
4. Jens Pfau presented two proposals for column-level lineage, focusing on transformation types and descriptions.
5. Mandy Chessell suggested including the name of the masking function as an additional property for masking transformations.
6. Harel Shein expressed appreciation for the proposals and encouraged community members to review and provide feedback.
7. Eric Veleker expressed gratitude for the momentum and adoption of open lineage, thanking the community for their hard work.
8. Harel Shein echoed Eric's sentiments and acknowledged the project's growth and industry standard status.
9. Harel Shein thanked all contributors and adopters for their contributions to the community.

Outline

- Michael Robinson from Astronomer welcomes everyone and goes through the agenda, which includes brief announcements, a release update, metrics on community growth, an update on dataset support in Spark, and open discussion items. He also reminds participants about the ecosystem survey and announces an upcoming meetup in London co-hosted with Confluence.
- He shares the success of a recent event in Warsaw and thanks contributors and attendees.
- Michael Robinson provides details on the recent release (1.6.2) which includes support for version 1.5, metadata sending without running a dbt command, and improvements to job listeners and lineage in Flink and Spark. He also mentions bug fixes and contributions from new contributors.
- He shares exciting news about streaming job support in Marks project and expects a larger release soon.
- Michael Robinson moves on to share some metrics on momentum and new partners added in the last year, including Google Cloud, Grai, and Metaphor. He directs participants to GitHub and the revamped ecosystem page at OpenLineage for more details.

Metrics on Community Growth
- Michael Robinson shares insights from the lfai and data dashboards, showing increases in total and active contributors as well as commits.
- Harel shared that there may be an issue with the way commits are being counted, but the general trend of 5,000+ commits per month is accurate. He also shared details about their global community membership and contributors using the Orbit tool.

New Job Facet: "Job Type"
- Pawel Leszczynski presented a new job facet called "job type" which contains information about processing type, integration, and pricing on the query command. This job type is used for streaming jobs and is already being implemented in their Link integration.
- Harel thanked him for the presentation.
- Harel expressed excitement about seeing events stream into Marquez and Pawel shared that they are able to merge the PR, but there are still some issues with CI.

Open Proposals Discussion
- Harel expressed excitement for an upcoming release and suggested that encouraging messages on Marquez might help. The next item on the agenda is discussing open proposals.
- Jens discusses two proposals related to the column level line asset, which have been discussed with Aba. He explains the current state of the column level line and its issues, including the lack of a clear contract between producers and consumers regarding transformation types.
- The first proposal is to create a taxonomy of types to address this issue. The second proposal addresses situations where the transformation type would be different for a given pair of input field and output field.
- Jens presents a document with more details on transformation types for column level lines, which should be complete, disjunct, unambiguous, and optional. He also proposes adding a transformation sub-type for more extension.
- Jens proposes adding a subtype and a separate field for masked transformation, creating a transformation object, and moving the fields related to transformation into their own object. Papa suggests adding a masked field to allow users to send information if they wish to.
- Harel asks about adding the name of the masking function as its own property, and Mandy suggests it could be a free form name or an extra property for masking algorithm. They agree to swap the masked field into the name of the mask, and recognize that masking can mean different things in different use cases.
- They discuss the possibility of coalescing on some naming convention or using reference data management to control values.
- Jens asks Mandy to check the GitHub issue with the proposal and provides the slide number. Harel links both proposals in the chat.
- Mandy thanks Harel for doing the proposal.
- Harel expresses gratitude for the proposals and invites others to open a proposal on the project. The next item on the agenda is discussion, but there are no items for this month.

Reviewing New Core Facets
- Jens asks about the process for reviewing new core facets and suggests discussing them before they get merged. Pawelleszczynski explains the process of creating a JSON file and creating a PR, and suggests waiting a few weeks for others to review the proposal.
- Jens agrees and suggests highlighting spec changes more frequently.
- Pawel suggests asking Julien for review and acknowledges that it may take longer during Christmas time. Harel emphasizes the responsibility of the community to each other and suggests allowing for more duration before merging and releasing.
- Eric presents another item on the agenda.
- Harel thanks everyone for their input and moves on to the next item on the agenda.

Adoption of Alina
- Eric shares details on adoption of Atlan supporting different flavors of implementation and how brands adopting OpenLineage speak to the momentum of the community building. He thanks all committers for backing something that's making a difference in the data ecosystem.
- Harel echoes Eric's words and appreciates everyone who contributed over the past few years, making this project an industry standard. He thanks all contributors and adopters like admin, Google, and everyone else on the call and in the ecosystem.

November 9, 2023 (10am PT)

Attendees:

TSC:
- Paweł Leszczyński, Software Engineer, GetInData
- Julien Le Dem, OpenLineage project lead
- Michael Robinson, Community team, Astronomer
- Jakub Dardziński, Software Engineer, GetInData
- Harel Shein, Engineering Manager, Astronomer
- Maciej Obuchowski, Software Engineer, Astronomer/GetInData, OpenLineage committer
- Paweł Leszczyński, Software Engineer, Astronomer/GetInData
And:
- Eric Veleker, Atlan
- Harsh Loomba, Engineer, Upgrade
- Sheeri Cabral, Product Manager, Collibra
- Peter Huang, Software Engineer, Apple
- Jens Pfau, Engineering Manager, Google
- Shubhambharadwaj, Associate Manager

Agenda:

Announcements
Recent releases
Recent additions to the Flink integration
Recent additions to the Spark integration
Proposal updates
Discussion items
Open discussion

Meeting:

Widget Connector

url	http://youtube.com/watch?v=Lc-IVvMleJU

Notes:

Announcements

A warm welcome to new committer Harel Shein (harels)! Harel's main contributions have been to project leadership, facilitating discussions, and advocating for the project. Thanks, Harel!
Upcoming talks include one by Paweł Leszczyński at the Data Science Summit in Warsaw/online, November 23-24, and another by Julien Le Dem at Scale By The Bay in Oakland, CA, on November 15.
The call for papers deadline for Data Council has been extended to November 17th.

Recent Releases

OpenLineage 1.5.0
- Added
  - Flink: add Flink lineage for Cassandra Connectors #2175 @HuangZhenQiu
  - Spark: support rdd and toDF operations available in Spark Scala API #2188 @pawel-big-lebowski
  - Spark: support Databricks Runtime 13.3 #2185 @pawel-big-lebowski
- Changed
  - Airflow: loosen attrs and requests versions #2107 @JDarDagran
  - dbt: render yaml configs lazily #2221 @JDarDagran
- Thanks to all the contributors, including new contributor @sophiely!

Recent Additions to the Flink Integration - Peter Huang (Apple)

I work on the Flink team at Apple with a focus on meeting legal requirements
Current priorities include improving lineage from Iceberg
Users here also employ Cassandra, so we have contributed Cassandra support
Apple has an open-source contribution review process, and I can't contribute more at the moment
I hope that the review process will be completed in the coming weeks, so we can make more contributions
Planned improvements include:
- addition of more catalog information to Iceberg lineage
- support for Flink 1.18

Recent Additions to the Spark Integration - Paweł Leszczyński (GetInData)

Added support for Spark 3.5
Added support for Databricks Runtime (most recent version)
2188: fix in Scala integration
- RDD issue was hard to reproduce
2233: Jackson library upgrade
- Jackson library in the project was an old version
- upgrade includes a security vulnerability fix
- merged but not yet released
Planned:
- Support for Iceberg and Delta for Spark 3.5
- Spark parentRun AKA Spark Application Events (by mobuchowski)
- Meetup talk: "How to become a spark-openlineage contributor in 5 steps?"

Proposals in Discussion - Julien Le Dem (Project Lead)

Open proposals:
- 2187: ColumnLineageDatasetFacet
  - privacy use cases
- 2186: formalizing transformation types
  - column lineage facet improvements
- 2163: define an integration certification process for OpenLineage
  - defines integration certification process
  - currently collecting use cases
  - related to registry proposal
  - input/feedback needed
- 2162: dataset support in Spark LogicalPlan Nodes
  - optional API we could add to the Nodes
  - prototype coming soon
- 2161: registry of producers and consumers
  - comments welcome on the PR on GitHub
  - producers would be able to register custom facet prefix, URI and link to documentation, etc.
  - consumers would be able to declare the facets you consume, link to documentation, etc.
  - name registration:
    - unique naming
    - name would be used in shorter URI prefixes
  - CI validation would enforce consistent facet naming and validate facet schemas
  - documentation would be published automatically
  - additional documentation for specific use cases
  - self-contained registry containing all facets for producers and consumers
    - name path in registry with CODEOWNERS file for delegation to circumvent review process
    - path for facet JSON
    - more information
  - Pros:
    - producers and consumers would be able to define codeowners to approve changes to the registry
    - CI could guarantee that changes would not produce inconsistencies
    - producers would not need to host and maintain their own subset of the registry
    - publication would be automated
    - freedom and independence for defining custom facets without the project being a bottleneck
  - Cons:
    - registered entities would have to maintain their list of codeowners
  - Q&A:
    - producers that define multiple facets?
      - granularity of this and other aspects might or might not be desirable
    - consumed facets: mandatory or optional?
      - always optional
    - custom facets or core facets?
      - core facets currently in a different dir, but it would be nice to move them to the registry
    - add tests as with core facets?
      - would be useful as examples and for validation
      - could be optional
      - please add this to the PR

October 12, 2023 (10am PT)

Attendees:

TSC:
- Paweł Leszczyński, Software Engineer, GetInData
- Julien Le Dem, OpenLineage project lead
- Michael Robinson, Community team, Astronomer
- Jakub Dardziński, Software Engineer, GetInData
- Willy Lulciuc, Marquez Project Lead
And:
- Harel Shein, Engineering Manager, Astronomer
- Harsh Loomba, Upgrade
- Sheeri Cabral, Product Manager, Collibra
- Ernie Ostic, Manta Software
- Jeevan Paul, Accel Data
- Ann Mary Justine, Research Engineer, HP Enterprise's CMF team
- Jason Yip, Grainger
- Sunder, JLR
- Peter Huang, engineer at <>, on Flink team
- Jens Pfau, engineering manager at Google working on GCP
- Martin Foltin, member, HP Enterprise's CMF team
- Austin Bennett, architect at Chartboost
- Eric Veleker, Atlan

Agenda:

Announcements
Recent releases
Airflow Summit recap
Tutorial/demo: migrating to the OpenLineage Airflow Provider
Discussion: observability for OpenLineage+Marquez
Open discussion

...

Migration from standalone Open Lineage package to Airflow provider
- Jakub explained how to migrate from the standalone openly the flow package to the airflow provider. He gave reasons why they wanted to become an airflow provider, including making sure that the metadata collected in airflow is not breaking airflow itself.
- They also keep the latest code up to date with all the providers and become part of these providers of the operators. There were a couple of changes introduced in the provider package, and the main question is how to migrate.
- The simplest way is to just do the install for the specific package. One of the things they would like to walk away from this customer structures, and there was and still is a possibility to write a customer structure that was controlled by the open infrastructures environment variable.
- Jakub explains that if a user has implemented some get open age assets method previously based on the old module and class, they do not need to worry about it because it is translated. However, if they install opening flow, they will fail to import the old class and need to change the import path.
- There are changes introducing configuration, and there is a whole section called open image in conflict. Many of the features that were previously available in opening package are also compatible with the provider.
- People usually like open in URL, which is pretty common and still works. But some entries in the open in age section take precedence over what's been previously handled by environment variables.
- Jakub gives examples of how the logic for like conflict takes precedence over open in URL. He mentions that the documentation has more information on how it works.
- He also explains how to add new integration in the provider or other providers that make use of opening provider. They want to give up on using open in age common data set module and use just the classes from the open in age python client.
- Jakub gives quick advice on how to grab some information from execution of the operator. Previously, when they didn't have any control or influence on how to grab some information from execution of the operator, they needed to read the code and see that maybe job ID is returned as an ex come.
- Now when they added the integration in the query operator itself, they can just change the code so it saves the job ideas and attributes.
- Jakub gives a quick demo of how it works. He is using breeze, which is a mostly development environment and cli for airflow.
- He is using on two point seven point one and is also using integration open in age, which instant Marcus also that's an option that they have in their flow. The only package that he is using is posts because he'll be using or provider.
- He shows how it works and mentions that the beauty of e-mail life is that he doesn't know if it should work.
- Jakub says that it should work in a minute.
- Jakub types in his password.
- Jakub says that he doesn't need to run post scripts, but actually he doesn't have just to prove he doesn't have any.
- Jakub says that it's working. He is running some example that uses focus as back end.
- Jakub says that previously, there was nothing to configure more if a user has like opening the CR.
- Jakub explains that he changed the next piece and this is development, but the name is changed because he hasn't experimented with something. Eventually, the events came to market.
- Jakub tries it again.
- Jakub demonstrates a quick demo of three options for package installation and rerunning history. Julien thanks Jakub and asks if there are any questions about migration from the old open age integration into the new airflow provider.

Observability for Open Lineage OpenLineage markers
- Julien introduces the discussion topic of observability for opening age markers and invites Harel to start. Harel asks the audience about ensuring liability of lineage collection and what kind of operability they would like to see, such as distributed tracing.
- He suggests gathering feedback on a slack channel. Julien thinks the metrics added to the airflow integration by Harel are a good starting point for observability.
- Hloomba mentions enabling retention policy on all environments and suggests observability on database retention to help with memory or CPU performance. Harel suggests enabling metrics out of the box and instrumenting more functions using drop wizard as a web server.
- Julien and William discuss having metrics on the retention job to track how the data retention job keeps the database small.
- Jeevan asked about the possibility of having an open lineage event for Spark applications, and Pawelleszczynski explained the need for a parent run faster to identify each Spark action as part of a bigger entity, the Spark application. Jens suggested having unique job names for Spark actions and the parent Spark application.
- Pawelleszczynski explained that the current job name is constructed based on the name of the operator or Spark logical note and appended with a dataset name, but they can make it optional to have a human-readable job name or use a hash on the logical plan to ensure uniqueness.
- Harel mentioned having good news for Bob and suggested discussing it next week.
- Jens added that having unique job names would help distinguish each Spark action and its runs, and Pawelleszczynski explained the current job naming convention and the possibility of making it unique using a hash on the logical plan.
- Julien asked if anyone had more comments on the topic.

...

Page tree

Versions Compared

Old Version 198

New Version Current

Key

Next meeting: May 8, 2024 (9:30am PT)

April 10, 2024 (9:30am PT)

March 13, 2024 (9:30am PT)

February 8, 2024 (10am PT)

January 11, 2024 (10am PT)

December 14, 2023 (10am PT)

November 9, 2023 (10am PT)

October 12, 2023 (10am PT)