Page History

...

Widget Connector

url	http://youtube.com/watch?v=aFiRpWBU8bQ

Notes:

Announcements [Julien]
1. A New York meetup will be happening on 4/26 at the Astronomer offices in the Flatiron District
2. Julien Le Dem will be speaking at the Data+AI Summit in June: "Cross-platform Data Lineage with OpenLineage"
3. Recent talks:
  1. Last month: Ross Turk, Paweł Leszczyński and Maciej Obuchowski all spoke at Big Data Technology Warsaw Summit 2023
  2. Also last month: Julien Le Dem spoke at Data Council Austin
4. Recent meetups:
  1. Last month: OpenLineage Meetup at Data Council Austin
  2. Last month: Data Lineage Meetup in Providence, RI
Updates [Julien]
1. OpenLineage in Airflow (AIP-53)
  1. Goal: make operators responsible for their own lineage
  2. Goal requires additions to the Airflow infrastructure
  3. Development process will progress in 3 phases
    1. add an OpenLineage library conforming to Airflow processes and coding style
    2. work on other providers, implementing OpenLineage methods
    3. add OpenLineage support to TaskFlow and Python operators
  4. Timeline: aiming for June Providers release
  5. We have begun with the Snowflake operator
  6. A significant benefit: operators will support it
2. Static lineage support
  1. Next stage: add formal proposal to the OpenLineage repo, where it will be easier for members to comment
  2. To recap:
    1. OL is designed to capture lineage as pipelines run, as well as some info that is more static (schema, schema changes, etc.)
    2. Goal: capture lineage about views, etc., that have not run yet
    3. Focus will remain on everything that has been deployed
    4. Parallel discussion: lineage from job-less events, e.g., ad-hoc events
      1. challenge: these could pollute the namespace
    5. Basic proposal: to make the job name optional, which will require changes on the Marquez side, as well
  3. Comments are welcome
Caching support for column lineage [Paweł]
1. Personal opinion: the Spark integration is amazing because it extracts from the logical plan; also, it is easy to configure (requiring just 4 lines of code)
2. Caching: a popular concept for Spark jobs
  1. a separate logical plan is used for cached datasets, meaning that two logical plans must be merged
  2. we will know how inputs are affecting outputs even when logical plans have been merged
Open discussion
1. A question about duplicated events when setting env variables [Anirudh]
  1. we have needed to employ filtering
  2. Spark reuses jobs for actions that are not really jobs

...

Page tree

Versions Compared

Old Version 170

New Version 171

Key