...
- Recent releases
- Demo: creating a new OpenLineage consumer [Daniel]
- Discussion topic: real-world implementation of OpenLineage (i.e., "What IS lineage, anyway?")
- Announcement & discussion topic: the thinking behind namespaces
- Open discussion
January 12, 2023 (10am PT)
...
Attendees:
- TSC:
- Announcements
- Recent release 0.19.2
- Update on column-level lineage
- Overview of recent improvements to the Airflow integration
- Discussion topic: real-world implementation of OpenLineage (i.e., "What IS lineage, anyway?")
- Announcement & discussion topic: the thinking behind namespaces
Meeting:
...
- Mike Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, OpenLineage Project lead
- Willy Lulciuc, Co-creator of Marquez
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
- Mandy Chessell, Egeria Project Lead
- Daniel Henneberger,
- Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
- And:
- Petr Hajek, Information Management Professional, Profinit
- Harel Shein, Director of Engineering, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Sam Holmberg, Software Engineer, Astronomer
- Ernie Ostic, SVP of Product, MANTA
- Sheeri Cabral, Technical Product Manager, Lineage, Collibra
- John Thomas, Software Engineer, Dev. Rel., Astronomer
Agenda:
- Announcements
- Recent release 0.19.2
- Update on column-level lineage
- Overview of recent improvements to the Airflow integration
- Discussion topic: real-world implementation of OpenLineage (i.e., "What IS lineage, anyway?")
- Announcement & discussion topic: the thinking behind namespaces
Meeting:
Widget Connector | ||
---|---|---|
|
Notes:
- Announcements
- OpenLineage earned Incubation status with the LFAI & Data Foundation at their December TAC meeting!
- Represents our maturation in terms of governance, code quality assurance practices, documentation, more
- Required earning the OpenSSF Silver Badge, sponsorship, at least 300 GitHub stars
- Next up: Graduation (expected in early summer)
- OpenLineage earned Incubation status with the LFAI & Data Foundation at their December TAC meeting!
- Recent release 0.19.2 [Michael R.]
Added
- SQL: add column-level lineage to SQL parser #1432#1461@mobuchowski@StarostaGit
- SQL: add ExtractionErrorRunFacet#1442@mobuchowski
- Airflow: add Trino extractor #1288@sekikn
- Airflow: add S3FileTransformOperator extractor #1450@sekikn
- Airflow: add standardized run facet #1413@JDarDagran
- Airflow: add NominalTimeRunFacet and OwnershipJobFacet#1410@JDarDagran
- dbt: add support for postgres datasources #1417@julienledem
- Proxy: add client-side proxy (skeletal version) #1439#1420@fm100
- Proxy: add CI job to publish Docker image #1086@wslulciuc
- Spark: pass config parameters to the OL client #1383@tnazarew
Fixed
- Airflow: fix collect_ignore, add flags to Pytest for cleaner output #1437@JDarDagran
- Spark & Java client: fix README typos @versaurabh
- Thanks to all the contributors, including new contributor @versaurabh!
- Bug fixes and more details: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md
- Column-level lineage update [Maciej]
- What is the OpenLineage SQL parser?
- At its core, it’s a Rust library that parses SQL statements and extracts lineage data from it
- 80/20 solution - we’ll not be able to parse all possible SQL statements - each database has custom extensions and different syntax, so we focus on standard SQL.
- Good example of complicated extension: Snowflake COPY INTO https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
- We primarily use the parser in Airflow integration and Great Expectations integration
- Why? Airflow does not “understand” a lot of what some operators do, for example PostgreSqlOperator
- We also have Java support package for parser
- What changed previously?
- Parser in current release can emit column-level lineage!
- Last OL meeting Piotr Wojtczak, primary author of this change presented new core of parser that enabled that functionality
https://www.youtube.com/watch?v=Lv_bODeAVYQ - Still, the fact that Rust code can do that does not mean we have it for free everywhere
- What has changed recently?
- We wrote “glue code” that allows us to use new parser constructs in Airflow integration
- Error handling just got way easier: SQL parser can “partially” parse SQL construct, and report errors it encountered, with particular statements that caused it.
- Usage
- Airflow integration extractors based on SqlExtractor (ex. PostgreSqlExtractor, SnowflakeExtractor, TrinoExtractor…) are now able to extract column-level lineage
- Close future: Spark will be able to extract lineage from JDBCRelation.
- What is the OpenLineage SQL parser?
- Recent improvements to the Airflow integration [Kuba]
- OpenLineage facets
- Facets are pieces of metadata that can be attached to the core entities: run, job or dataset
- Facets provide context to OpenLineage events
- They can be defined as either part of the OpenLineage spec or custom facets
- Airflow generic facet
- Previously multiple custom facets with no standard
- AirflowVersionRunFacet as an example of rapidly growing facet with version unrelated information
- Introduced AirflowRunFacet with Task, DAG, TaskInstance and DagRun properties
- Old facets are going to be deprecated soon. Currently both old and new facets are emitted
- AirflowRunArgsRunFacet, AirflowVersionRunFacet, AirflowMappedTaskRunFacet will be removed
- All information from above is moved to AirflowRunFacet
- Previously multiple custom facets with no standard
- Other improvements (added in 0.19.2)
- SQL extractors now send column-level lineage metadata
Further facets standardization
- Introduced ProcessingEngineRunFacet
- provides processing engine information, e.g. Airflow or Spark version
- Improved support for nominal start & end times
- makes use of data interval (introduced in Airflow 2.x)
- nominal end time now matches next schedule time
- DAG owner added to OwnershipJobFacet
- Added support for S3FileTransformOperator and TrinoOperator (@sekikn’s great contribution)
- Introduced ProcessingEngineRunFacet
- OpenLineage facets
December 8, 2022 (10am PT)
...
- TSC:
- Mike Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, OpenLineage Project lead
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
- Mandy Chessell, Egeria Project Lead
- Willy Lulciuc, Co-creator of Marquez
- Paweł Leszczyński, Software Engineer, GetInData
- Ross Turk, Senior Director of Community, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer
- Tomasz Nazarewicz, Software Engineer, GetInData
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- And:
- Ann Mary Justine, Research Engineer, HP Enterprise
- Martin Foltin, Master Technologist, HP Enterprise
- Sam Holmberg, Software Engineer, AstronomerPaweł Leszczyński, Software Engineer, GetInData
- Aalap Tripathy, Principal Research Engineer, HP Enterprise
- Petr Hajek, Information Management Professional, Profinit
- Harel Shein, Director of Engineering, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Ross Turk, Senior Director of Community, Astronomer
- Benji Lampel, Ecosystem Engineer, AstronomerHoward Yoo, Staff Product Manager, Astronomer
- Suparna Bhattacharya, Distinguished Technologist, HP Enterprise
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Sergey Serebryakov, Research Engineer, HP Enterprise
- Glyn Bowden, Chief Technologist, HP Enterprise, CMF
- Nigel Jones, Maintainer, Egeria/IBM
- Tomasz Nazarewicz, Software Engineer, GetInData
- Sheeri Cabral, Technical Product Manager, Lineage, CollibraMichael Robinson, Software Engineer, Dev. Rel., Astronomer
- Prachi Mishra, Senior Software Engineer, Astronomer
...