...
Widget Connector | ||
---|---|---|
|
Notes:
- Announcements [Julien]
- A New York meetup will be happening on 4/26 at the Astronomer offices in the Flatiron District
- Julien Le Dem will be speaking at the Data+AI Summit in June: "Cross-platform Data Lineage with OpenLineage"
- Recent talks:
- Last month: Ross Turk, Paweł Leszczyński and Maciej Obuchowski all spoke at Big Data Technology Warsaw Summit 2023
- Also last month: Julien Le Dem spoke at Data Council Austin
- Recent meetups:
- Last month: OpenLineage Meetup at Data Council Austin
- Last month: Data Lineage Meetup in Providence, RI
- Updates [Julien]
- OpenLineage in Airflow (AIP-53)
- Goal: make operators responsible for their own lineage
- Goal requires additions to the Airflow infrastructure
- Development process will progress in 3 phases
- add an OpenLineage library conforming to Airflow processes and coding style
- work on other providers, implementing OpenLineage methods
- add OpenLineage support to TaskFlow and Python operators
- Timeline: aiming for June Providers release
- We have begun with the Snowflake operator
- A significant benefit: operators will support it
- Static lineage support
- Next stage: add formal proposal to the OpenLineage repo, where it will be easier for members to comment
- To recap:
- OL is designed to capture lineage as pipelines run, as well as some info that is more static (schema, schema changes, etc.)
- Goal: capture lineage about views, etc., that have not run yet
- Focus will remain on everything that has been deployed
- Parallel discussion: lineage from job-less events, e.g., ad-hoc events
- challenge: these could pollute the namespace
- Basic proposal: to make the job name optional, which will require changes on the Marquez side, as well
- Comments are welcome
- OpenLineage in Airflow (AIP-53)
- Caching support for column lineage [Paweł]
- Personal opinion: the Spark integration is amazing because it extracts from the logical plan; also, it is easy to configure (requiring just 4 lines of code)
- Caching: a popular concept for Spark jobs
- a separate logical plan is used for cached datasets, meaning that two logical plans must be merged
- we will know how inputs are affecting outputs even when logical plans have been merged
- Open discussion
- A question about duplicated events when setting env variables [Anirudh]
- we have needed to employ filtering
- Spark reuses jobs for actions that are not really jobs
- A question about duplicated events when setting env variables [Anirudh]
...