Page History

...

TSC:
- Mike Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, OpenLineage Project lead
- Willy Lulciuc, Co-creator of Marquez
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
- Mandy Chessell, Egeria Project Lead
- Daniel Henneberger, Database engineer
- Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
And:
- Petr Hajek, Information Management Professional, Profinit
- Harel Shein, Director of Engineering, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Sam Holmberg, Software Engineer, Astronomer
- Ernie Ostic, SVP of Product, MANTA
- Sheeri Cabral, Technical Product Manager, Lineage, Collibra
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Brahma Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics

Agenda:

Announcements
Recent release 0.19.2
Update on column-level lineage
Overview of recent improvements to the Airflow integration
Discussion topic: real-world implementation of OpenLineage (i.e., "What IS lineage, anyway?")
Announcement & discussion topic: the thinking behind namespaces

...

Announcements
- OpenLineage earned Incubation status with the LFAI & Data Foundation at their December TAC meeting!
  - Represents our maturation in terms of governance, code quality assurance practices, documentation, more
  - Required earning the OpenSSF Silver Badge, sponsorship, at least 300 GitHub stars
  - Next up: Graduation (expected in early summer)
Recent release 0.19.2 [Michael R.]
- Added
  - SQL: add column-level lineage to SQL parser #1432 #1461 @mobuchowski @StarostaGit
  - SQL: add ExtractionErrorRunFacet#1442 @mobuchowski
  - Airflow: add Trino extractor #1288 @sekikn
  - Airflow: add S3FileTransformOperator extractor #1450 @sekikn
  - Airflow: add standardized run facet #1413 @JDarDagran
  - Airflow: add NominalTimeRunFacet and OwnershipJobFacet#1410 @JDarDagran
  - dbt: add support for postgres datasources #1417 @julienledem
  - Proxy: add client-side proxy (skeletal version) #1439 #1420 @fm100
  - Proxy: add CI job to publish Docker image #1086 @wslulciuc
  - Spark: pass config parameters to the OL client #1383 @tnazarew
  Fixed
  - Airflow: fix collect_ignore, add flags to Pytest for cleaner output #1437 @JDarDagran
  - Spark & Java client: fix README typos @versaurabh
- Thanks to all the contributors, including new contributor @versaurabh!
- More details: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md
Column-level lineage update [Maciej]
- What is the OpenLineage SQL parser?
  - At its core, it’s a Rust library that parses SQL statements and extracts lineage data from it
  - 80/20 solution - we’ll not be able to parse all possible SQL statements - each database has custom extensions and different syntax, so we focus on standard SQL.
  - Good example of complicated extension: Snowflake COPY INTO https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
  - We primarily use the parser in Airflow integration and Great Expectations integration
  - Why? Airflow does not “understand” a lot of what some operators do, for example PostgreSqlOperator
  - We also have Java support package for parser
- What changed previously?
  - Parser in current release can emit column-level lineage!
  - Last OL meeting Piotr Wojtczak, primary author of this change presented new core of parser that enabled that functionality
    https://www.youtube.com/watch?v=Lv_bODeAVYQ
  - Still, the fact that Rust code can do that does not mean we have it for free everywhere
- What has changed recently?
  - We wrote “glue code” that allows us to use new parser constructs in Airflow integration
  - Error handling just got way easier: SQL parser can “partially” parse SQL construct, and report errors it encountered, with particular statements that caused it.
- Usage
  - Airflow integration extractors based on SqlExtractor (ex. PostgreSqlExtractor, SnowflakeExtractor, TrinoExtractor…) are now able to extract column-level lineage
  - Close future: Spark will be able to extract lineage from JDBCRelation.
Recent improvements to the Airflow integration [Kuba]
- OpenLineage facets
  - Facets are pieces of metadata that can be attached to the core entities: run, job or dataset
  - Facets provide context to OpenLineage events
  - They can be defined as either part of the OpenLineage spec or custom facets
- Airflow generic facet
  - Previously multiple custom facets with no standard
    - AirflowVersionRunFacet as an example of rapidly growing facet with version unrelated information
  - Introduced AirflowRunFacet with Task, DAG, TaskInstance and DagRun properties
  - Old facets are going to be deprecated soon. Currently both old and new facets are emitted
    - AirflowRunArgsRunFacet, AirflowVersionRunFacet, AirflowMappedTaskRunFacet will be removed
    - All information from above is moved to AirflowRunFacet
- Other improvements (added in 0.19.2)
  - SQL extractors now send column-level lineage metadata
  - Further facets standardization
    - Introduced ProcessingEngineRunFacet
      - provides processing engine information, e.g. Airflow or Spark version
    - Improved support for nominal start & end times
      - makes use of data interval (introduced in Airflow 2.x)
      - nominal end time now matches next schedule time
    - DAG owner added to OwnershipJobFacet
    - Added support for S3FileTransformOperator and TrinoOperator (@sekikn’s great contribution)
Discussion: what does it mean to implement the spec? [Sheeri]
- What is it mean to meet the spec?
  - 100% compliance is not required
  - OL ecosystem page
    - doesn't say what exactly it does
    - operational lineage not well defined
    - what does a payload look like? hard to find this info
  - Compatibility between producers/consumers is unclear
- Important if standard is to be adopted widely [Mandy]
  - Egeria: uses compliance test with reports and badging; clarifies compatibility
  - test and test cases available in the Egeria repo, including profiles and clear rules about compliant ways to support Egeria
  - a badly behaving producer or consumer will create problems
  - have to be able to trust what you get
- What about consumers? [Mike C.]
  - can we determine if they have done the correct thing with facets? [John]
  - what do we call "compliant"?
  - custom facets shouldn't be subject to this – they are by definition custom (and private) [Maciej]
  - only complete events (not start events) should be required – start events not desired outside of operational use cases [Maciej]
- There's a simple baseline on the one hand and facets on the other [Julien]
- Note: perfection isn't the goal
  - instead: shared test cases, data such as sample schema that can be tested against
- Marquez doesn't explain which facets it's using or how [Willy]
  - communication by consumers could be better
- Effort at documenting this: matrix [Julien]
- How would we define failing tests? [Maciej]
  - at a minimum we could have a validation mode [Julien]
  - challenge: the spec is always moving, growing [Maciej]
  - ex: in the case of JSON schema validation, facets are versioned individually but there's a reference schema that is versioned that might not be the current schema. Facets can be dereferenced, but the right way to do this is not clear [Danny]
  - one solution could be to split out base times, or we could add a tool that would force us to clean this up
  - client-side proxy presents same problem; tried different validators in Go; a workaround is to validate against the main doc first; by continually validating against the client proxy we can make sure it stays compliant with the spec [Minkyu]
  - Mandy: if Marquez says it's "OK," it's OK; we've been doing it manually [Mandy]
  - Marquez doesn't do any validation for consumers [Mike C.]
  - manual validation is not good enough [Mandy]
  - I like the idea of compliance badges – it would be cool if we had a way to validate consumers and there were a way to prove this and if we could extend validation to integrations like the Airflow integration [Mike C.]
- Let's follow up on Slack and use the notes from this discussion to collaborate on a proposal [Julien]

December 8, 2022 (10am PT)

...

Page tree

Versions Compared

Old Version 149

New Version 150

Key

December 8, 2022 (10am PT)