Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Recent releases [Michael R.]
  • Rust implementation of the SQL integration [Piotr]
    • About me: dev with GetInData
    • Goal of project: to make adding more language support in the future easier to add
    • Separated into components: separate backend package for integration with language bindings with new Java interface
    • Components
      • openlineage_sql: main implementation with table + column lineage extraction
      • openlineage_sql_python: Python bindings, uses the pyo3  create, produces a Python wheel
      • openlineage_sql_java: Java bindings, using JNI, produces a jar
    • Changes
      • switch to a visitor pattern to traverse the AST
      • introduce Context Frames (like scopes) to resolve aliases, implicit contexts and shadowing
      • column lineage is a synthesized attribute over the tree – easy to compute with a visitor
    • Demo
    • Shout outs
      • Maciej Obuchowski (@mobuchowski)
      • Will Johnson (@wjohnson)
      • Hannah Moazam (@hmoazam)
  • Open discussion
    • Spark implementation: where do deps need to be added? [Will]
      • it depends on which sub-project you want to modify
      • if you want to modify all, import the dependency in shared 
    • Implementing the spec discussion [Sheeri]
      • 100% compliance is not required – it's a spec, after all, just like "standard" SQL
      • bottom line: compatibility between producers and consumers
      • minimum viable lineage
        • at least one circle
        • zero or more lines
        • associated information
      • data model: event runs a job on a dataset
      • What's required by the spec?
        • run: UUID
        • run state: transition, event time
        • job: namespace, job name
        • datasets: namespace, dataset name
      • But what is a run?
        • all the events for one UUID
      • Necessary per run:
        • at least one box
        • at least one line
        • everything else is optional
          • eventTime, etc.
      • OL query example:
        • run ID required for a run (but not a job, which can/should be a view)
        • inputs
        • outputs
        • producer
        • schemaURL
        • start event
        • complete event
      • Needed: discussion of what it means to be compliant with the spec, perhaps a test/self-test
        • maybe the test outputs categories (e.g., "design lineage") for compatibility between producers and consumers
      • Following up on main threads here [Julien]:
        • create Slack channel, Google docs
          • Sheeri will take the lead
          • we'll write a proposal that we eventually add to the spec

November 10, 2022 (10am PT)

...