...
- TSC:
- Willy Lulciuc: Co-creator of Marquez
- Mike Collado: Staff Software Engineer, Astronomer
- Julien Le Dem: OpenLineage Project lead
- And:
- Ernie Ostic, SVP of Product, Manta
- Ross Turk, Senior Director of Community, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Peter Hicks, Senior Software Engineer, Astronomer
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Sandeep Adwankar: Senior Technical Product Manager, AWS
- Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
- John Thomas: Software Engineer, Dev. Rel., Astronomer
- Chandru Sugunan: Product Manager, Azure Cloud, Microsoft
- Petr Hajek, Information Management Professional, Profinit
- Colin Schaub, Lead API Engineer, API Platform Lead, Cargill
- Mark Chiarelli, Senior Consultant, MarkLogic
- Sam Holmberg, Software Engineer, Astronomer
- Paweł Leszczyński, Software Engineer, GetInData
Agenda:
- Recent talks [Julien]
- Recent release: 0.10.0 [Michael R.]
- Flink integration [Paweł, Maciej]
- New docs site [Ross]
- Discuss: streaming services in Flink integration [Will]
- Open discussion
- OL philosophy for streaming in general
...
- Recent talks
- Ross, “What Is Data Lineage and Why Should I Care?”
- Maciej & Paweł, “OpenLineage & Airflow: Data Lineage has never been Easier”
- Willy, “Automating Airflow Backfills with Marquez”
- Michael C., “Data Lineage with Apache Airflow and Apache Spark”
- Ross & Michael R., “An Introduction to Data Lineage with Airflow and Marquez”
- Julien, “Observability for Data Pipelines with OpenLineage”
- Michael C., “Cross-platform Lineage with OpenLineage"
- Release 0.10.0
Added:
- Extend SaveIntoDataSourceCommandVisitor to extract schema from LocalRelation and LogicalRdd in Spark integration (#794) @pawel-big-lebowski
- Add InMemoryRelationInputDatasetBuilder for InMemory datasets to Spark integration (#818) @pawel-big-lebowski
- Add SnowflakeOperatorAsync extractor support to Airflow integration (#869) @denimalpaca
- Add PMD analysis to proxy project (#889) @howardyoo
- Add static code analysis tool mypy to run in CI against all Python modules (#802) @howardyoo
- Add copyright to source files (#755) @merobi-hub
Changed:
- Skip FunctionRegistry.class serialization in Spark integration (#828) @mobuchowski
- Reduce OL event payload size by excluding local data and including output node in start events (#881) @collado-mike
- Install new rust-based SQL parser by default in Airflow integration (#835) @mobuchowski
- Improve overall pytest and integration tests for Airflow integration (#851, #858) @denimalpaca
- Split Spark integration into submodules (#834, #890) @tnazarew @mobuchowski
- Flink integration
- Entry point: built Flink example app to find out if metadata, schema extractable
- Maciej also successfully read data from Iceberg
- Flink provides two APIs
- Created integration tests for all use cases, added them to CircleCI
- New Java client: different configs for HTTP, Kafka endpoints
- Missing feature: make sure crashing integration doesn't kill a Flink job
- Coming soon: experimental version
- not focused on streaming currently
- focus: how to extract info from Flink
- feedback from community desired
- Q & A
- Will: is the code an extension of OL or an integration?
- an integration akin to the dbt integration
- Willy: any changes to the spec/schema? Is the state part of the payload?
- new state should be added (currently "other")
- Will: is the code an extension of OL or an integration?
- New docs site
- Up until today, docs have been on the website and spread throughout READMEs
- Docusaurus deployment now available
- Changes to structure as well as content welcome
- Not currently live but will be soon
- Can be hosted at docs.openlineage.io
- Everything is in Markdown
- Another motivation: Keboola use case not part of the codebase, so a docs site could describe it
- Next milestone: we all decide to publish it
- Q & A
- Willy: let's add a section on defining custom facets
- Ross: feel free to add another page stub
- Ross: also need a FAQ
- Julien: we could autogenerate some docs
- Ross: there are downsides to such an approach
- Julien: let's open issues when answers aren't good enough
- Willy: descriptions of facets could be improved
- Julien: we could version them
- Ross: I'll look for signs that people are not finding docs on the version they are using
- Streaming in Flink integration
- Has there been any evolution in the thinking on support for streaming?
- Julien: start event, complete event, snapshots in between limited to certain number per time interval
- Paweł: we can make the snapshot volume configurable
- Does Flink support sending data to multiple tables like Spark?
- yes, multiple outputs supported by OpenLineage model
- Has there been any evolution in the thinking on support for streaming?
June 9th, 2022 (10am PT)
Attendees:
...