Page History

...

All are welcome.

Table of Contents

Next meeting: November 9, 2023 (10am PT)

October 12, 2023 (10am PT)

Tentative agendaAgenda:

Announcements
Recent releases
Airflow Summit recap
Tutorial/demo: migrating to the OpenLineage Airflow Provider
Discussion: observability for OpenLineage+Marquez
Open discussion

Meeting:

Widget Connector

url	http://youtube.com/watch?v=LMuS0DJoOtc

Notes:

Announcements

The first annual Ecosystem Survey is still open. Submit your response today: https://bit.ly/ecosystem_survey
Our next meetup will be on November 29th in Warsaw, Poland, at Google. Sign up: https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

Recent releases

1.2.2
- Added
  - Spark: publish the ProcessingEngineRunFacet as part of the normal operation of the OpenLineageSparkEventListener#2089 @d-m-h
  - Spark: capture and emit spark.databricks.clusterUsageTags.clusterAllTags variable from databricks environment #2099 @Anirudh181001
  Thanks to all the contributors, including new contributors @d-m-h, @tati and @xli-1026!
1.3.1
- Added
  - Airflow: add some basic stats to the Airflow integration #1845 @harels
  - Airflow: add columns as schema facet for airflow.lineage.Table (if defined) #2138 @erikalfthan
  - DBT: add SQLSERVER to supported dbt profile types #2136 @erikalfthan
  - Spark: support for latest 3.5 #2118 @pawel-big-lebowski
  Thanks to all the contributors, including new contributor @erikalfthan!
1.4.1
- Added
  - Client: allow setting client's endpoint via environment variable #2151 @mars-lan
  - Flink: expand Iceberg source types #2149 @HuangZhenQiu
  - Spark: add debug facet #2147 @pawel-big-lebowski
  - Spark: enable Nessie REST catalog #2165 @julwin
  Thanks to all the contributors, including new contributors @julwin and @HuangZhenQiu!

Migration from standalone Open Lineage package to Airflow provider
- Jakub explained how to migrate from the standalone openly the flow package to the airflow provider. He gave reasons why they wanted to become an airflow provider, including making sure that the metadata collected in airflow is not breaking airflow itself.
- They also keep the latest code up to date with all the providers and become part of these providers of the operators. There were a couple of changes introduced in the provider package, and the main question is how to migrate.
- The simplest way is to just do the install for the specific package. One of the things they would like to walk away from this customer structures, and there was and still is a possibility to write a customer structure that was controlled by the open infrastructures environment variable.
- Jakub explains that if a user has implemented some get open age assets method previously based on the old module and class, they do not need to worry about it because it is translated. However, if they install opening flow, they will fail to import the old class and need to change the import path.
- There are changes introducing configuration, and there is a whole section called open image in conflict. Many of the features that were previously available in opening package are also compatible with the provider.
- People usually like open in URL, which is pretty common and still works. But some entries in the open in age section take precedence over what's been previously handled by environment variables.
- Jakub gives examples of how the logic for like conflict takes precedence over open in URL. He mentions that the documentation has more information on how it works.
- He also explains how to add new integration in the provider or other providers that make use of opening provider. They want to give up on using open in age common data set module and use just the classes from the open in age python client.
- Jakub gives quick advice on how to grab some information from execution of the operator. Previously, when they didn't have any control or influence on how to grab some information from execution of the operator, they needed to read the code and see that maybe job ID is returned as an ex come.
- Now when they added the integration in the query operator itself, they can just change the code so it saves the job ideas and attributes.
- Jakub gives a quick demo of how it works. He is using breeze, which is a mostly development environment and cli for airflow.
- He is using on two point seven point one and is also using integration open in age, which instant Marcus also that's an option that they have in their flow. The only package that he is using is posts because he'll be using or provider.
- He shows how it works and mentions that the beauty of e-mail life is that he doesn't know if it should work.
- Jakub says that it should work in a minute.
- Jakub types in his password.
- Jakub says that he doesn't need to run post scripts, but actually he doesn't have just to prove he doesn't have any.
- Jakub says that it's working. He is running some example that uses focus as back end.
- Jakub says that previously, there was nothing to configure more if a user has like opening the CR.
- Jakub explains that he changed the next piece and this is development, but the name is changed because he hasn't experimented with something. Eventually, the events came to market.
- Jakub tries it again.
- Jakub demonstrates a quick demo of three options for package installation and rerunning history. Julien thanks Jakub and asks if there are any questions about migration from the old open age integration into the new airflow provider.

Observability for Open Lineage markers
- Julien introduces the discussion topic of observability for opening age markers and invites Harel to start. Harel asks the audience about ensuring liability of lineage collection and what kind of operability they would like to see, such as distributed tracing.
- He suggests gathering feedback on a slack channel. Julien thinks the metrics added to the airflow integration by Harel are a good starting point for observability.
- Hloomba mentions enabling retention policy on all environments and suggests observability on database retention to help with memory or CPU performance. Harel suggests enabling metrics out of the box and instrumenting more functions using drop wizard as a web server.
- Julien and William discuss having metrics on the retention job to track how the data retention job keeps the database small.
- Jeevan asked about the possibility of having an open lineage event for Spark applications, and Pawelleszczynski explained the need for a parent run faster to identify each Spark action as part of a bigger entity, the Spark application. Jens suggested having unique job names for Spark actions and the parent Spark application.
- Pawelleszczynski explained that the current job name is constructed based on the name of the operator or Spark logical note and appended with a dataset name, but they can make it optional to have a human-readable job name or use a hash on the logical plan to ensure uniqueness.
- Harel mentioned having good news for Bob and suggested discussing it next week.
- Jens added that having unique job names would help distinguish each Spark action and its runs, and Pawelleszczynski explained the current job naming convention and the possibility of making it unique using a hash on the logical plan.
- Julien asked if anyone had more comments on the topic.

Creating a registry for consumers and producers
- Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
- Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
- The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
- The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Proposing an optional contract for providers for Airflow operators
- Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
- Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
- The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
- The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Spark integration
- Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
- Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
- The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
- The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Certification process in the Open Lineage ecosystem
- Julien discussed the need for a certification process in the Open Lineage ecosystem, and suggested creating a document to start a discussion on how to implement it. He mentioned the possibility of providing data set support for scans and action notes, and creating a contract for implementing data sources to expose lineage in relation notes.
- Julien also talked about the goal of Open Lineage to be built into systems like Airflow, and encouraged attendees to share their opinions and ask questions on Slack.
- Julien discussed the need for a certification process in the Open Lineage ecosystem, and suggested creating a document to start a discussion on how to implement it. He mentioned the possibility of providing data set support for scans and action notes, and creating a contract for implementing data sources to expose lineage in relation notes.
- Julien also talked about the goal of Open Lineage to be built into systems like Airflow, and encouraged attendees to share their opinions and ask questions on Slack.

September 14, 2023 (10am PT)

...

Page tree

Versions Compared

Old Version 197

New Version 198

Key

Next meeting: November 9, 2023 (10am PT)

October 12, 2023 (10am PT)

September 14, 2023 (10am PT)