Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Announcements
    • OpenLineage earned Incubation status with the LFAI & Data Foundation at their December TAC meeting!
      • Represents our maturation in terms of governance, code quality assurance practices, documentation, more
      • Required earning the OpenSSF Silver Badge, sponsorship, at least 300 GitHub stars
      • Next up: Graduation (expected in early summer)
  • Recent release 0.19.2 [Michael R.]
  • Column-level lineage update [Maciej]
    • What is the OpenLineage SQL parser?
      • At its core, it’s a Rust library that parses SQL statements and extracts lineage data from it 
      • 80/20 solution - we’ll not be able to parse all possible SQL statements - each database has custom extensions and different syntax, so we focus on standard SQL.
      • Good example of complicated extension: Snowflake COPY INTO https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
      • We primarily use the parser in Airflow integration and Great Expectations integration
      • Why? Airflow does not “understand” a lot of what some operators do, for example PostgreSqlOperator
      • We also have Java support package for parser   
    • What changed previously?
      • Parser in current release can emit column-level lineage!
      • Last OL meeting Piotr Wojtczak, primary author of this change presented new core of parser that enabled that functionality
        https://www.youtube.com/watch?v=Lv_bODeAVYQ
      • Still, the fact that Rust code can do that does not mean we have it for free everywhere
    • What has changed recently?
      • We wrote “glue code” that allows us to use new parser constructs in Airflow integration
      • Error handling just got way easier: SQL parser can “partially” parse SQL construct, and report errors it encountered, with particular statements that caused it.
    • Usage
      • Airflow integration extractors based on SqlExtractor (ex. PostgreSqlExtractor, SnowflakeExtractor, TrinoExtractor…) are now able to extract column-level lineage
      • Close future: Spark will be able to extract lineage from JDBCRelation.
  • Recent improvements to the Airflow integration [Kuba]
    • OpenLineage facets
      • Facets are pieces of metadata that can be attached to the core entities: run, job or dataset
      • Facets provide context to OpenLineage events
      • They can be defined as either part of the OpenLineage spec or custom facets
    • Airflow generic facet
      • Previously multiple custom facets with no standard
        • AirflowVersionRunFacet as an example of rapidly growing facet with version unrelated information
      • Introduced AirflowRunFacet with Task, DAG, TaskInstance and DagRun properties
      • Old facets are going to be deprecated soon. Currently both old and new facets are emitted
        • AirflowRunArgsRunFacet, AirflowVersionRunFacet, AirflowMappedTaskRunFacet will be removed
        • All information from above is moved to AirflowRunFacet
    • Other improvements (added in 0.19.2)
      • SQL extractors now send column-level lineage metadata
      • Further facets standardization

        • Introduced ProcessingEngineRunFacet
          • provides processing engine information, e.g. Airflow or Spark version
        • Improved support for nominal start & end times
          • makes use of data interval (introduced in Airflow 2.x)
          • nominal end time now matches next schedule time
        • DAG owner added to OwnershipJobFacet
        • Added support for S3FileTransformOperator and TrinoOperator (@sekikn’s great contribution)

...