...
- Recent releases (0.13.0, 0.13.1, 0.14.0, 0.14.1) [Michael R.]
- 0.13.0
Added
- Add BigQuery check support
#960
@denimalpaca - Add
RUNNING
EventType
in spec and Python client#972
@mzareba382 - Use databases & schemas in SQL Extractors
#974
@JDarDagran - Implement Event forwarding feature via HTTP protocol
#995
@howardyoo - Introduce
SymlinksDatasetFacet
to spec#936
@pawel-big-lebowski - Add Azure Cosmos Handler to Spark integration
#983
@hmoazam - Support OL Datasets in manual lineage inputs/outputs
#1015
@conorbev - Create ownership facets
#996
@julienledem
Changed
- Use
RUNNING
EventType in Flink integration for currently running jobs#985
@mzareba382 - Convert task object into JSON encodable when creating Airflow version facet
#1018
@fm100
Fixed
- Add support for custom SQL queries in v3 Great Expectations API
#1025
@collado-mike
- Add BigQuery check support
- 0.13.1
Fixed
- Rename all parentRun occurrences to parent from Airflow integration #1037 @fm100
- Do not change task instance during on_running event #1028 @JDarDagran
- 0.14.0
Added
- Support ABFSS and Hadoop Logical Relation in Column-level lineage #1008 @wjohnson
- Add Kusto relation visitor #939 @hmoazam
- Add ColumnLevelLineage facet doc #1020 @julienledem
- Include symlinks dataset facet #935 @pawel-big-lebowski
- Add support for dbt 1.3 beta's metadata changes #1051 @mobuchowski
- Support Flink 1.15 #1009 @mzareba382
- Add Redshift dialect to the SQL integration #1066 @mobuchowski
Changed
Fixed
- Add a dialect parameter to Great Expectations SQL parser calls #1049 @collado-mike
- Fix Delta 2.1.0 with Spark 3.3.0 #1065 @pawel-big-lebowski
- 0.14.1
Fixed
- Fix Spark integration issues including error when no
openlineage.timeout
#1069 @pawel-big-lebowski
- Fix Spark integration issues including error when no
- Notes:
- Thank you to all the contributors! And a special shout out to new contributor Hanna Moazam!
- 0.13.0
- Native data quality in Airflow with OpenLineage [Benji]
- Related webinar: https://www.astronomer.io/events/webinars/implementing-data-quality-checks-in-airflow/
- Why Airflow?
- In-pipeline checks
- Immediate alerts
- Lineage support
- Use case
- static checks
- typed values
- data ranges
- temporal intervals
- static checks
- Two providers
- SQL column check operator
- "On Rails operator"
- supports tolerance
- supports partitioning with parameter
- available checks:
- min
- max
- unique check
- distinct check
- null check
- qualifiers:
- greater_than
- geq_to
- less_than
- leq_to
- equal_to
- SQL table check operator
- flexible
- supports static checks
- supports partitioning with parameter
- uses cases:
- checks that include aggregate values using the whole table
- row count checks
- schema checks
- comparisons between multiple columns, both aggregated and not aggregated
- SQL column check operator
- Innovation: operators can now give data quality data directly to a lineage consumer (e.g., Marquez)
- Note: the UI in the demo is part of the Datakin product
- Can you talk about the OL packets?
- the existing OL data quality facets are being used
- MANTA integrations using OpenLineage [Petr]
- MANTA & MANTA Flow tools
- unique column-level lineage parser of most data technologies
- parses code to create database and reconstruct detailed column-level based on static analysis
- represents end-to-end dependencies across technologies on enterprise level (indirect and direct)
- challenge: integrating runtime lineage
- MANTA connectors
- reverse-engineer code
- integration gets lineage from OpenLineage producers
- e.g., Keboola, dbt, Airflow, Snowflake, Spark
- converts the OpenLineage json files to MANTA objects
- currently limited to the table level
- for some technologies, Marquez libraries were used
- MANTA repository model
- underlying graph database
- nodes: hierarchically organized objects
- edges: relations
- layers: physical, logical, runtime...
- resources: all integration OL metadata sources
- used to distinguish the sources of metadata
- column-level project
- we currently can get it if provided in facets
- idea: extend the OpenLineage model for facet extensions which MANTA then analyzes statically
- passes code, encoded using BASE64, in artifacts in job facets
- status: in testing, beginning with Keboola
- hope: to use the integration to increase number of producers we can consumer lineage from
- Q & A
- Have you used json files for metadata in the past?
- No, but we are now and also using API calls
- Egeria was in a similar situation
- MANTA & MANTA Flow tools
- Open Discussion
- common metadata framework project at HP Enterprise will be added to agenda for a future meeting
August 11, 2022 (10am PT)
...