Page History

...

All are welcome.

Table of Contents

Next meeting: Apr 13th, 2022 (9am PT)

Mar 9th, 2022 (9am PT)

...

Attendees:
TSC:
Mike Collado: Staff Software Engineer, Datakin
Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
Julien Le Dem: OpenLineage Project lead
Mandy Chessel: Egeria Project Lead
Willy Lulciuc: Co-creator of Marquez
And:
Michael Robinson: Dev Rel Engineer
Ross Turk: VP of Marketing, Datakin
Minkyu Park: Senior Software Engineer, Datakin
Srikanth Venkat: Product Manager, Privacera
John Thomas: Support Engineer, Datakin
Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
Paweł Leszczyński, Software Engineer, GetinData
Sheeri Cabral, Product Manager, Collibra
Michal Bartos, Software Engineer, MANTA
Chandru
Caroline Fahrenkrog, Product Manager, MANTA Scanners
John Montroy, Backend Engineer
Agenda:
New committers [Julien]
Release overview (0.6.0-0.6.1) [Michael R.]
Process for blog posts [Ross]
Retrospective: Spark integration [Willy et al.]
Open discussion
Notes:
New committers [Julien]
4 new committers were voted in last week
We had fallen behind
Congratulations to all
Release overview (0.6.0-0.6.1) [Michael R.]
Added
Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
These first two additions are similar to SQL facet
Offer the ability to see top-level code
Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
Captures when someone is conducting dataset operations (overwrite, create, etc.)
Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
Collects environment variables
Depends on Databricks runtime but can be reused in other environments
OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
The first iteration of the Dagster integration to get lineage from Dagster
Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
Small addition to Java client feat. better types; was string
Fixed
Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
The former was a particular issue with the Great Expectations integration
Reduce logging level for import errors to info @rossturk (0.6.0)
Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
This fix reduced the level to Info
Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
Some keys are Snowflake-specific, but more can be added from other data sources
Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
Previously when an OL event failed to emit, this could break an integration
This fix catches possible failures and logs them
Process for blog posts [Ross]
Moving the process to Github Issues
Follow release tracker there
Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts
No one will have a monopoly
Proposals for blog posts also welcome and we can support your efforts with outlines, feedback
Throw your ideas on the issue tracker on Github
Retrospective: Spark integration [Willy et al.]
Willy: originally this part of Marquez – the inspiration behind OL
OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)
Donated the integration to OL
Srikanth: #559 very helpful to Azure
Pawel: is anything missing from the Spark integration? E.g., column-level lineage?
Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome
Maciej: should be more active about tracking projects we have integrations with; add to test matrix
Julien: let’s open some issues to address these
Open Discussion
Flink updates? [Julien]
Maciej: initial exploration is done
challenge: Flink has 4 APIs
prioritizing Kafka lineage currently because most jobs are writing to/from Kafka
track this on Github milestones, contribute, ask questions there
Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage?
Maciej: trying to model entire Flink run as one event
Srikanth: proposed two separate streams, one for data updates and one for metadata
Julien: do we have an issue on this topic in the repo?
Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc
Julien: see notes for ticket number; MC will create the ticket
Srikanth: we can collaborate offline
Tentative agenda:
New committers
Release overview (0.6.0-0.6.1)
Process for blog posts
Retrospective: Spark integration
Open discussion

Feb 9th 2022 (9am PT)

Attendees:

...

And:

Michael Robinson: Dev Rel Engineer
Ross Turk: VP of Marketing, Datakin
Minkyu Park: Senior Software Engineer, Datakin
Srikanth Venkat: Product Manager, Privacera
John Thomas: Support Engineer, Datakin
Peter Scharling: EI Group
Peter Hicks: Senior Software Engineer, Datakin
Dalin Kim: Data Engineer, Northwestern Mutual
Kevin Mellott: Data Engineer, Northwestern Mutual
Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
Kelsy Brennan: EI Group
Aaron Colcord: Data Engineer, Northwestern Mutual

Agenda:

OpenLineage recent release overview (0.5.1) [Julien]
TaskInstanceListener now official way to integrate with Airflow [Julien]
Apache Flink integration [Julien]
Dagster integration demo [Dalin]
Open Discussion

Notes:

OpenLineage recent release overview (0.5.1) [Julien]
- No 0.5.0 due to bug
- Support for dbt-spark adapter
- New backend to proxy OL events
- Support for custom facets
TaskInstanceListener now official way to integrate with Airflow [Julien]

Integration runs on worker side
Will be in next OL release of airflow (2.3)
Thanks to Maciej for his work on this

Apache Flink integration [Julien]
- Ticket for discussion available
- Integration test setup
- Early stages
Dagster integration demo [Dalin]
- Initiated by Dalin Kim
- OL used with Dagster on orchestration layer
- Utilizes Dagster sensor
- Introduces OL sensor that can be added to Dagster repo definition
- Uses cursor to keep track of ID
- Looking for feedback after review complete
- Discussion:
  - Dalin: needed: way to interpret Dagster asset for OL
  - Julien: common code from Great Expectations/Dagster integrations
  - Michael C: do you pass parent run ID in child job when sending the job to MZ?
  - Hierarchy can be extended indefinitely – parent/child relationship can be modeled
  - Maciej: the sensor kept failing – does this mean the events persisted despite being down?
  - Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
  - Dalin: hoping for more feedback
  - Julien: slides will be posted on slack channel, also tickets
Open discussion
- Will: how is OL ensuring consistency of datasets across integrations?
- Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
- Julien: need for tutorial on creating integrations
- Srikanth: have done some of this work in Atlas
- Kevin: are there libraries on the horizon to play this role? (Julien: yes)
- Srikanth: it would be good to have model spec to provide enforceable standard
- Julien: agreed; currently models are based on the JSON schema spec
- Julien: contributions welcome; opening a ticket about this makes sense
- Will: Flink integration: MZ focused on batch jobs
- Julien: we want to make sure we need to add checkpointing
- Julien: there will be discussion in OLMZ communities about this
- Julien: a consistent model is needed
- Julien: one solution being looked into is Arrow
- Julien: everyone should feel welcome to propose agenda items (even old projects)
- Srikanth: who are you working with on the Flink comms side? Will get back to you.

...

Page tree

Versions Compared

Old Version 47

New Version 48

Key

Next meeting: Apr 13th, 2022 (9am PT)

Mar 9th, 2022 (9am PT)

Feb 9th 2022 (9am PT)