...
All are welcome.
Table of Contents |
---|
Next meeting: Apr 13th, 2022 (9am PT)
Mar 9th, 2022 (9am PT)
...
Attendees:
- TSC:
- Mike Collado: Staff Software Engineer, Datakin
- Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
- Julien Le Dem: OpenLineage Project lead
- Mandy Chessel: Egeria Project Lead
- Willy Lulciuc: Co-creator of Marquez
- And:
- Michael Robinson: Dev Rel Engineer
- Ross Turk: VP of Marketing, Datakin
- Minkyu Park: Senior Software Engineer, Datakin
- Srikanth Venkat: Product Manager, Privacera
- John Thomas: Support Engineer, Datakin
- Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Paweł Leszczyński, Software Engineer, GetinData
- Sheeri Cabral, Product Manager, Collibra
- Michal Bartos, Software Engineer, MANTA
- Chandru
- Caroline Fahrenkrog, Product Manager, MANTA Scanners
- John Montroy, Backend Engineer
Agenda:
- New committers [Julien]
- Release overview (0.6.0-0.6.1) [Michael R.]
- Process for blog posts [Ross]
- Retrospective: Spark integration [Willy et al.]
- Open discussion
Notes:
- New committers [Julien]
- 4 new committers were voted in last week
- We had fallen behind
- Congratulations to all
- Release overview (0.6.0-0.6.1) [Michael R.]
- Added
- Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
- Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
- These first two additions are similar to SQL facet
- Offer the ability to see top-level code
- Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
- Captures when someone is conducting dataset operations (overwrite, create, etc.)
- Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
- Collects environment variables
- Depends on Databricks runtime but can be reused in other environments
- OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
- The first iteration of the Dagster integration to get lineage from Dagster
- Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
- Small addition to Java client feat. better types; was string
- Fixed
- Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
- The former was a particular issue with the Great Expectations integration
- Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
- Reduce logging level for import errors to info @rossturk (0.6.0)
- Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
- This fix reduced the level to Info
- Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
- Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
- Some keys are Snowflake-specific, but more can be added from other data sources
- Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
- Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
- Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
- Previously when an OL event failed to emit, this could break an integration
- This fix catches possible failures and logs them
- Reduce logging level for import errors to info @rossturk (0.6.0)
- Added
- Process for blog posts [Ross]
- Moving the process to Github Issues
Follow release tracker there
Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts
No one will have a monopoly
Proposals for blog posts also welcome and we can support your efforts with outlines, feedback
Throw your ideas on the issue tracker on Github
- Retrospective: Spark integration [Willy et al.]
Willy: originally this part of Marquez – the inspiration behind OL
OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)
Donated the integration to OL
Srikanth: #559 very helpful to Azure
Pawel: is anything missing from the Spark integration? E.g., column-level lineage?
Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome
Maciej: should be more active about tracking projects we have integrations with; add to test matrix
Julien: let’s open some issues to address these
- Open Discussion
- Flink updates? [Julien]
Maciej: initial exploration is done
challenge: Flink has 4 APIs
prioritizing Kafka lineage currently because most jobs are writing to/from Kafka
track this on Github milestones, contribute, ask questions there
Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage?
Maciej: trying to model entire Flink run as one event
Srikanth: proposed two separate streams, one for data updates and one for metadata
Julien: do we have an issue on this topic in the repo?
Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc
Julien: see notes for ticket number; MC will create the ticket
Srikanth: we can collaborate offline
- New committers
- Release overview (0.6.0-0.6.1)
- Process for blog posts
- Retrospective: Spark integration
- Open discussion
- Flink updates? [Julien]
Feb 9th 2022 (9am PT)
Attendees:
...
- And:
- Michael Robinson: Dev Rel Engineer
- Ross Turk: VP of Marketing, Datakin
- Minkyu Park: Senior Software Engineer, Datakin
- Srikanth Venkat: Product Manager, Privacera
- John Thomas: Support Engineer, Datakin
- Peter Scharling: EI Group
- Peter Hicks: Senior Software Engineer, Datakin
- Dalin Kim: Data Engineer, Northwestern Mutual
- Kevin Mellott: Data Engineer, Northwestern Mutual
- Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Kelsy Brennan: EI Group
- Aaron Colcord: Data Engineer, Northwestern Mutual
Agenda:
- OpenLineage recent release overview (0.5.1) [Julien]
- TaskInstanceListener now official way to integrate with Airflow [Julien]
- Apache Flink integration [Julien]
- Dagster integration demo [Dalin]
- Open Discussion
Notes:
- OpenLineage recent release overview (0.5.1) [Julien]
- No 0.5.0 due to bug
- Support for dbt-spark adapter
- New backend to proxy OL events
- Support for custom facets
- TaskInstanceListener now official way to integrate with Airflow [Julien]
- Integration runs on worker side
- Will be in next OL release of airflow (2.3)
- Thanks to Maciej for his work on this
- Apache Flink integration [Julien]
- Ticket for discussion available
- Integration test setup
- Early stages
- Dagster integration demo [Dalin]
- Initiated by Dalin Kim
- OL used with Dagster on orchestration layer
- Utilizes Dagster sensor
- Introduces OL sensor that can be added to Dagster repo definition
- Uses cursor to keep track of ID
- Looking for feedback after review complete
- Discussion:
- Dalin: needed: way to interpret Dagster asset for OL
- Julien: common code from Great Expectations/Dagster integrations
- Michael C: do you pass parent run ID in child job when sending the job to MZ?
- Hierarchy can be extended indefinitely – parent/child relationship can be modeled
- Maciej: the sensor kept failing – does this mean the events persisted despite being down?
- Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
- Dalin: hoping for more feedback
- Julien: slides will be posted on slack channel, also tickets
- Open discussion
- Will: how is OL ensuring consistency of datasets across integrations?
- Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
- Julien: need for tutorial on creating integrations
- Srikanth: have done some of this work in Atlas
- Kevin: are there libraries on the horizon to play this role? (Julien: yes)
- Srikanth: it would be good to have model spec to provide enforceable standard
- Julien: agreed; currently models are based on the JSON schema spec
- Julien: contributions welcome; opening a ticket about this makes sense
- Will: Flink integration: MZ focused on batch jobs
- Julien: we want to make sure we need to add checkpointing
- Julien: there will be discussion in OLMZ communities about this
- In MZ, there are questions about what counts as a version or not
- Julien: a consistent model is needed
- Julien: one solution being looked into is Arrow
- Julien: everyone should feel welcome to propose agenda items (even old projects)
- Srikanth: who are you working with on the Flink comms side? Will get back to you.
...