...
- Recent talks
- Ross, “What Is Data Lineage and Why Should I Care?”
- Maciej & Paweł, “OpenLineage & Airflow: Data Lineage has never been Easier”
- Willy, “Automating Airflow Backfills with Marquez”
- Michael C., “Data Lineage with Apache Airflow and Apache Spark”
- Ross & Michael R., “An Introduction to Data Lineage with Airflow and Marquez”
- Julien, “Observability for Data Pipelines with OpenLineage”
- Michael C., “Cross-platform Lineage with OpenLineage"
- Release 0.10.0
Added:
- Extend SaveIntoDataSourceCommandVisitor to extract schema from LocalRelation and LogicalRdd in Spark integration (#794) @pawel-big-lebowski
- Add InMemoryRelationInputDatasetBuilder for InMemory datasets to Spark integration (#818) @pawel-big-lebowski
- Add SnowflakeOperatorAsync extractor support to Airflow integration (#869) @denimalpaca
- Add PMD analysis to proxy project (#889) @howardyoo
- Add static code analysis tool mypy to run in CI against all Python modules (#802) @howardyoo
- Add copyright to source files (#755) @merobi-hub
Changed:
- Skip FunctionRegistry.class serialization in Spark integration (#828) @mobuchowski
- Reduce OL event payload size by excluding local data and including output node in start events (#881) @collado-mike
- Install new rust-based SQL parser by default in Airflow integration (#835) @mobuchowski
- Improve overall pytest and integration tests for Airflow integration (#851, #858) @denimalpaca
- Split Spark integration into submodules (#834, #890) @tnazarew @mobuchowski
- Flink integration
- Entry point: built Flink example app to find out if metadata, schema extractable
- Maciej also successfully read data from Iceberg
- Flink provides two APIs
- Created integration tests for all use cases, added them to CircleCI
- New Java client: different configs for HTTP, Kafka endpoints
- Missing feature: make sure crashing integration doesn't kill a Flink job
- Coming soon: experimental version
- not focused on streaming currently
- focus: how to extract info from Flink
- feedback from community desired
- Q & A
- Will: is the code an extension of OL or an integration?
- an integration akin to the dbt integration
- Willy: any changes to the spec/schema? Is the state part of the payload?
- new state should be added (currently "other")
- Will: is the code an extension of OL or an integration?
- New docs site
- Up until today, docs have been on the website and spread throughout READMEs
- Docusaurus deployment now available
- Changes to structure as well as content welcome
- Not currently live but will be soon
- Can be hosted at docs.openlineage.io
- Everything is in Markdown
- Another motivation: Keboola use case not part of the codebase, so a docs site could describe it
- Next milestone: we all decide to publish it
- Q & A
- Willy: let's add a section on defining custom facets
- Ross: feel free to add another page stub
- Ross: also need a FAQ
- Julien: we could autogenerate some docs
- Ross: there are downsides to such an approach
- Julien: let's open issues when answers aren't good enough
- Willy: descriptions of facets could be improved
- Julien: we could version them
- Ross: I'll look for signs that people are not finding docs on the version they are using
- Discussion: streaming in Flink integration
- Has there been any evolution in the thinking on support for streaming?
- Julien: start event, complete event, snapshots in between limited to certain number per time interval
- Paweł: we can make the snapshot volume configurable
- Does Flink support sending data to multiple tables like Spark?
- Yes, multiple outputs supported by OpenLineage model
- Marquez, the reference implementation of OL, combines the outputs
- Looking forward to seeing this documented on the new docs site
- Has there been any evolution in the thinking on support for streaming?
- Open discussion
- What's the logical approach to avoid overloading the backend with lineage events? [Colin]
- Paweł: we only send events when checkpoints change; configurable for more events
- Will: at Microsoft we're working on a fix that caches and consolidates OL events
- It'd be awesome to see example payloads for streaming in docsdocs [Colin]
- Ross: they're currently spread out; it'd be nice to have them in one place
- How can we create custom facets? [Sandeep]
- Julien: two options; anyone can create a custom facet without asking permission, or open a proposal/issue
- What's the logical approach to avoid overloading the backend with lineage events? [Colin]
...