Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

All are welcome.

Table of Contents

Next meeting: May 11th, 2022 (9am PT)

Apr 13th, 2022 (9am PT)

Attendees:

Tentative agenda:

  • TSC:
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Mandy Chessel: Egeria Project Lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Sheeri Cabral: Project Manager, Lineage, Collibra
    • Michael Robinson: Software Engineer, Developer Relations, Astronomer
    • John Thomas: Support Engineer, Astronomer
    • Ross Turk: Senior Director of Community, Astronomer
    • Minkyu Park: Senior Software Engineer, Astronomer
    • Ernie Ostic: SVP of Product, Manta
    • Kelsy Brennan: Lead Developer, Environmental Intelligence Group
    • Dalin Kim: Data Engineer, Northwestern Mutual
    • Will Johnson: Microsoft, OL contributor
    • Jorge
    • Jakub Moravec: Software Architect, Manta
    • Chandru Sugunan: Product Manager, Azure Cloud, Microsoft

Agenda:

  • 0.6.2 release overview [Michael R.]
  • Transports in OpenLineage clients [Maciej]
  • Airflow integration update [Maciej]
  • Dagster integration retrospective [Dalin]
  • Open discussion

Notes:

  • Introductions
  • Communication channels overview [Julien]
  • Agenda overview [Julien]
  • 0.6.2 release overview [Michael R.]

Added

    • CI: add integration tests for Airflow's SnowflakeOperator and dbt-Snowflake @mobuchowski
      • #611
      • Workaround necessitated by the fact we have only 1 schema in the Snowflake db
      • This creates conflicts between different Airflow versions
      • By contrast: in BigQuery, different schemas are prefixed with Airflow versions
    • Introduce DatasetVersion facet in spec @pawel-big-lebowski
      • #580
      • Problem: the spec did not support dataset versioning (which is needed for providers like Iceberg, Delta)
      • Solution: this change introduced a DatasetVersionFacet in spec
    • Airflow: add external query ID facet @mobuchowski
      • #546
      • Issue: jobs that ran on external systems like BigQuery or Snowflake were identified by their query IDs.
      • This change added a facet that exposes this collected query ID, so that an OpenLineage job run can be associated with that external job.

Fixed

    • Complete Fix of Snowflake Extractor get_hook() Bug @denimalpaca
      • #589
      • In #507, an incorrect fix was made to the Snowflake Extractor to allow for the operator's new get_db_hook() method.
      • Solution: this change checks for the existence of the get_db_hook()  method in the underlying Operator, then get_hook() calls the correct version of the underlying method, enabling it
    • Update artwork @rossturk
      • #605
      • This change updated artwork in the README.md with the latest versions from recent presentations and other sources.


  • Transports in OpenLineage clients [Maciej]
    • Currently, OL clients can only read HTTP data
    • Common request: ability to read Kafka
    • This feature will offer a language-independent solution
    • Status: Python client implementation merged, Java implementation close to being merged
    • Timeline: next release (0.7.0)
  • Airflow integration [Maciej]
    • TaskInstance listener-based plugin not ready yet
    • Status: waiting for Airflow 2.3 to be merged (due by April 18, 2022)
    • Ready upon Airflow 2.3 release
    • New SQL parser
      • Used in Snowflake, Postgres, GE integrations
      • Missing: API for SQL queries
      • Formerly had a SQL parser but based on guesswork and fragile reliance on language patterns
      • Solution: AST (abstract syntax trees), not guesswork
      • Features strong typing, Enums, encapsulation
      • Language: Rust
        • Disadvantages: additional language, distribution
        • Advantages: high-quality libraries, possible new applications, e.g. Spark
      • Unified API: previous implementation still exists for users of older architectures
      • Utilizable in Java
      • Makes all tasks using SQL easier
      • Will J.: can I inject a different SQL parser that I want to use?
        • Unified API would make this possible
        • Goal is to work with different dialects, implementations
  • Dagster integration [Dalin]
    • Initial proposal: use custom OL executor as thin wrapper over existing executors
    • Challenges:
      • OL handling tightly coupled with actual job runs
      • Requires multiple custom executors to main flexibility
      • Incomplete events (only op-level)
    • Solution: use Dagster's OL sensor that tails Dagster event logs for tracking metadata
    • Lessons learned:
      • Non-sharded event log storage must be used for sensor to access all event logs across runs
      • Sensor's cursor does not get updated on an exception. Typical use of cursors is to submit a run request while tracking some state. To guarantee atomic operation with the cursor, the cursor update gets processed only after the sensor function exits.
    • Event type conversion
      • Dagster event types converted to OpenLineage events 
    • Architecture
      • Sensor defined under a repository then converted and sent to the OL backend
    • Lineage collected at job level only; dataset tracking being explored
      • Currently datasets being stored as Dagster assets
      • This a manual/custom solution
    • 3M event logs processed, used as part of published telemetry report
    • Will J.: what's been the timeline since inception of the idea to now?
      • December 2021; integrated within ~1 month's time
      • Bulk of time was spent on understanding Dagster
      • OL sensor is configurable and can be started late while still catching the first events
    • Willy: do you remember the issue # or title you were waiting for?
    • Julien: Dalin reached out on Slack initially. We started a new channel, my small contribution was to reach out to the Dagster community to facilitate collaboration; we can support new integrations in this way. Thanks to Sandy from the Dagster community for help with this.
      • Don't hesitate to reach out for help!
  • Open discussion
    • Mandy: where do I submit my blog? Two website repos are a source of confusion.
    • Julien: Ross and Michael R. can help. 
    • Ross: branching could solve this problem. We welcome blog posts from anyone in the community.
    • Will J.: parent/child relationships in OL. Problem in Azure: Databricks connector has a parent execution inside Spark and a child execution that is not connected. Spark issues a parent ID that's not being caught. Currently using a workaround. What's the right way to emit a parent/child relationship?
      • Julien: this is relevant to the ParentRunFacet in OL. Michael C. is working on this in Marquez. Recommended: create an issue about this and ping Michael C.
      • Maciej: this functional in the Airflow integration for Spark jobs.
      • Julien: this issue could be documented better.
  • 0.6.2 release overview
  • Transports in OpenLineage clients
  • Airflow integration update
  • Dagster integration retrospective
  • Open discussion

Mar 9th, 2022 (9am PT)

Attendees:

...