...
All are welcome.
Table of Contents |
---|
Next meeting: May 11th, 2022 (9am PT)
Apr 13th, 2022 (9am PT)
Attendees:
Tentative agenda:
- TSC:
- Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
- Julien Le Dem: OpenLineage Project lead
- Mandy Chessel: Egeria Project Lead
- Willy Lulciuc: Co-creator of Marquez
- And:
- Sheeri Cabral: Project Manager, Lineage, Collibra
- Michael Robinson: Software Engineer, Developer Relations, Astronomer
- John Thomas: Support Engineer, Astronomer
- Ross Turk: Senior Director of Community, Astronomer
- Minkyu Park: Senior Software Engineer, Astronomer
- Ernie Ostic: SVP of Product, Manta
- Kelsy Brennan: Lead Developer, Environmental Intelligence Group
- Dalin Kim: Data Engineer, Northwestern Mutual
- Will Johnson: Microsoft, OL contributor
- Jorge
- Jakub Moravec: Software Architect, Manta
- Chandru Sugunan: Product Manager, Azure Cloud, Microsoft
Agenda:
- 0.6.2 release overview [Michael R.]
- Transports in OpenLineage clients [Maciej]
- Airflow integration update [Maciej]
- Dagster integration retrospective [Dalin]
- Open discussion
Notes:
- Introductions
- Communication channels overview [Julien]
- Agenda overview [Julien]
- 0.6.2 release overview [Michael R.]
Added
- CI: add integration tests for Airflow's SnowflakeOperator and dbt-Snowflake @mobuchowski
- #611
- Workaround necessitated by the fact we have only 1 schema in the Snowflake db
- This creates conflicts between different Airflow versions
- By contrast: in BigQuery, different schemas are prefixed with Airflow versions
- Introduce DatasetVersion facet in spec @pawel-big-lebowski
- #580
- Problem: the spec did not support dataset versioning (which is needed for providers like Iceberg, Delta)
- Solution: this change introduced a DatasetVersionFacet in spec
- Airflow: add external query ID facet @mobuchowski
- #546
- Issue: jobs that ran on external systems like BigQuery or Snowflake were identified by their query IDs.
- This change added a facet that exposes this collected query ID, so that an OpenLineage job run can be associated with that external job.
Fixed
- Complete Fix of Snowflake Extractor get_hook() Bug @denimalpaca
- #589
- In #507, an incorrect fix was made to the Snowflake Extractor to allow for the operator's new get_db_hook() method.
- Solution: this change checks for the existence of the get_db_hook() method in the underlying Operator, then get_hook() calls the correct version of the underlying method, enabling it
- Update artwork @rossturk
- #605
- This change updated artwork in the README.md with the latest versions from recent presentations and other sources.
- Transports in OpenLineage clients [Maciej]
- Currently, OL clients can only read HTTP data
- Common request: ability to read Kafka
- This feature will offer a language-independent solution
- Status: Python client implementation merged, Java implementation close to being merged
- Timeline: next release (0.7.0)
- Airflow integration [Maciej]
- TaskInstance listener-based plugin not ready yet
- Status: waiting for Airflow 2.3 to be merged (due by April 18, 2022)
- Ready upon Airflow 2.3 release
- New SQL parser
- Used in Snowflake, Postgres, GE integrations
- Missing: API for SQL queries
- Formerly had a SQL parser but based on guesswork and fragile reliance on language patterns
- Solution: AST (abstract syntax trees), not guesswork
- Features strong typing, Enums, encapsulation
- Language: Rust
- Disadvantages: additional language, distribution
- Advantages: high-quality libraries, possible new applications, e.g. Spark
- Unified API: previous implementation still exists for users of older architectures
- Utilizable in Java
- Makes all tasks using SQL easier
- Will J.: can I inject a different SQL parser that I want to use?
- Unified API would make this possible
- Goal is to work with different dialects, implementations
- Dagster integration [Dalin]
- Initial proposal: use custom OL executor as thin wrapper over existing executors
- Challenges:
- OL handling tightly coupled with actual job runs
- Requires multiple custom executors to main flexibility
- Incomplete events (only op-level)
- Solution: use Dagster's OL sensor that tails Dagster event logs for tracking metadata
- Lessons learned:
- Non-sharded event log storage must be used for sensor to access all event logs across runs
- Sensor's cursor does not get updated on an exception. Typical use of cursors is to submit a run request while tracking some state. To guarantee atomic operation with the cursor, the cursor update gets processed only after the sensor function exits.
- Event type conversion
- Dagster event types converted to OpenLineage events
- Architecture
- Sensor defined under a repository then converted and sent to the OL backend
- Lineage collected at job level only; dataset tracking being explored
- Currently datasets being stored as Dagster assets
- This a manual/custom solution
- 3M event logs processed, used as part of published telemetry report
- Will J.: what's been the timeline since inception of the idea to now?
- December 2021; integrated within ~1 month's time
- Bulk of time was spent on understanding Dagster
- OL sensor is configurable and can be started late while still catching the first events
- Willy: do you remember the issue # or title you were waiting for?
- Julien: Dalin reached out on Slack initially. We started a new channel, my small contribution was to reach out to the Dagster community to facilitate collaboration; we can support new integrations in this way. Thanks to Sandy from the Dagster community for help with this.
- Don't hesitate to reach out for help!
- Open discussion
- Mandy: where do I submit my blog? Two website repos are a source of confusion.
- Julien: Ross and Michael R. can help.
- Ross: branching could solve this problem. We welcome blog posts from anyone in the community.
- Will J.: parent/child relationships in OL. Problem in Azure: Databricks connector has a parent execution inside Spark and a child execution that is not connected. Spark issues a parent ID that's not being caught. Currently using a workaround. What's the right way to emit a parent/child relationship?
- Julien: this is relevant to the ParentRunFacet in OL. Michael C. is working on this in Marquez. Recommended: create an issue about this and ping Michael C.
- Maciej: this functional in the Airflow integration for Spark jobs.
- Julien: this issue could be documented better.
- 0.6.2 release overview
- Transports in OpenLineage clients
- Airflow integration update
- Dagster integration retrospective
- Open discussion
Mar 9th, 2022 (9am PT)
Attendees:
...