The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific. Link to join the meeting
All are welcome.
Next meeting: Feb 9th 2022 (9am PT)
Jan 12th 2022 (9am PT)
Attendees:
- TSC:
- Mike Collado: Eng, Datakin
- Mandy Chessel: Lead Egeria project
- Maciej Obuchowski: Eng GetInData, OpenLineage contributor
- Willy Lulciuc: Co-creator of Marquez
- Julien: OpenLineage Project lead
- And:
- Michael Robinson: dev rel
- Ross Turk: VP marketing Datakin
- Minkyu Park: Dev at Datakin, learning about MQZ and OL.
- Conor Beverland: Senior Dir of Product, Astronomer
- Srikanth Venkat, Product Management, Privacera
Agenda:
- OpenLineage recent releases overview [Julien]
- OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
- Databricks install README and init scripts (by Will)
- Iceberg integration (by Pawel)
- Kafka read and write support (by Olek and Mike)
- Arbitrary parameters supported in HTTP URL construction (by Will)
- Increased coverage (Pawel/Maciej)
- OpenLineage 0.5 release overview
- OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
- Egeria support for OpenLineage [Mandy]
- Airflow TaskListener for OpenLineage integration [Maciej]
- Open discussion
Meeting:
- Slides
- Passcode:
- Zoom link
Notes:
0.4 release [Willy]:
- Databricks install README and init scripts (by Will)
- Iceberg integration (Pawel)
- Iceberg adoption already strong
- Kafka read and write support (Olek and Mike)
- Arbitrary parameters supported in HTTP URL construction (Will)
- Increased coverage (Pawel and Maciej)
0.5 preview [Willy]:
- Add Spark support to openlineage-dbt lib. (by Maciej)
- New extensible API to handle Spark events for openlineage-spark lib (Mike)
- New proxy HTTP backend to route events to event streams (Mandy and Willy)
- Increase coverage of sparkV2 cmds for openlineage-spark lib. (Pawel)
- Added HTTP client to openlineage-java lib. (Willy)
- Thanks go to Mike Collado for work on PRs, proposal; also to Mandy for work on HTTP backend over last two months
- HTTP client will decrease confusion about how to capture metadata
Tasklistener for OL Integration [Maciej]:
1.10 required modifying each DAG, which was cumbersome and not compatible with 2.1
2.1: lineage backend comparable to Apache Atlas’ old backend
- benefit: provides all info about events
- downside: cannot notify about task starts/failures
2.3: Airflow Event Listener
- Status: not merged yet, in final reviews for deployment with 0.6
- Improvements: transparent, less exposure, enables pull model using queue, enables Egeria and other projects in the future (e.g., DataHub)
- Discussion [Julien, Maciej, Willy, Mike]:
- generic: supports additional functionality
- extendable to different kinds of events, e.g., scheduling
- makes more data available
- much less brittle because depends on public API
- requires little configuration
- will not do away with registration of listeners/extractors
- entry point mechanism comparable to service loaded in Java, requires env variables
- theoretically possible to back port it to earlier versions of Airflow (as far as 1.10)
- possibly helpful to document that we have 3 approaches but are not recommending older ones, mention that this changes only how we collate
- older approaches can be deprecated; it will be important to monitor the community to determine timing of this
Egeria Support for OpenLineage [Mandy]:
- Monthly releases
- OpenLineage support ready in recent release
- Metaphor: Lego blocks
- OL events can brought in through API or proxy backend with Kafka
- events augmentable in Egeria, storable or publishable in Marquez or Kafka for distribution or to log store (e.g., file system)
- Can validate that a process is running correctly
- See documentation in Egeria about proxy backend and extensions, API mechanism
- Diagram in documentation illustrates capabilities
- Discussion [Julien, Mandy, Srikanth, Mike]:
- Egeria sees value of OpenLineage
- Engine is uncoupled from receivers
- Endpoint is simple, allowing independent management of processes
- Some transformation of payload during storage
- Kafka integration coming in 0.5
- Customers expect ability to filter data
- Varying granularity of metadata already possible through versioning with Marquez
Open Discussion:
Proposal to convert licenses to SPDX [Michael]: no objections
Dec 8th 2021 (9am PT)
TODO: add notes
Tentative agenda:
- SPDX headers [Mandy Chessel]
- Azure Purview + OpenLineage [Will Johnson, Mark Taylor]
- Logging backend (OpenTelemetry, ...) [Julien Le Dem]
- Open discussion
Nov 10th 2021 (9am PT)
Attendees:
- TSC
- Mike Collado: Eng
- Ryan Blue: Tabular, Apache Iceberg
- Mandy Chessel: Lead Egeria project
- Maciej Obuchowski: Eng GetInData, OpenLineage contributor
- Willy Lulciuc: Co-creator of Marquez
- Julien: OpenLineage Project lead
- And:
- Michael Robinson: dev rel
- Peter Hicks: Marquez contributor
- Ross Turk: VP marketting Datakin
- John Thomas: Support eng at Datakin
- Minkyu Park: Dev at Datakin, learning about MQZ and OL.
Agenda:
- OL Client use cases for Apache Iceberg [Ryan]
- Proxy Backend and Egeria integration progress update (Issue #152) [Mandy]
- OpenLineage last release overview (0.3.1)
- Facet versioning
- Airflow 2 / Spark 3 support, dbt improvements
- OpenLineage 0.4 scope review
- Proxy Backend (Issue #152)
- Spark, Airflow, dbt improvements (documentation, coverage, ...)
- improvements to the OpenLineage model
- Open discussion
Meeting recording:
Notes:
SPDX tags:
shorter license headers => makes things easier.
https://spdx.org/licenses/
TODO: Mandy will propose something next time
Iceberg requirements:
ability for Iceberg to add facets without having to depend on the context it's running in.
Avoid depending on allowing the Sources to expose facets in the Spark API as it would be a hard change to get into Spark.
Ryan:
Proposal to have a logger style API.
similar to SLF4J or dropwizard metrics => Create a logging/metrics object. Independent of logging backend.
Facets can be emitted and the backend can be configured independently whether those facets are picked up or not.
Example: Have an OpenLineage API to add facets in a given context:
create facet for some context: Read datasets x, ... write dataset Y
=> broad agreement on principle
Open Questions:
when facets are sent?
preference to sending events as they go.
does that it fit with the OpenLineage view of the world? => yes
do we send them immediately? Do we wait?
iceberg not creating a facet until Spark asks for the splits
Spark, bound to a context thread:
the "logger backend can grab the sql execution id"
loggers depend on thread
listener is on different thread
Report for a given job run
Ryan: runcontext is threadlocal: sets the executionid.
The client side should be able to send an event immediately vs sent when you get a chance.
Who needs to do this?
Need to have a guide to defining a facet.
Michael C.: TODO: Design Doc on logging
Willy: Do we need a "RUNNING" event?
Flink:
how to handle long running job
[Ryan] [Mandy] long running jobs need to be defined
TODO: Julien, post a ticket for long running jobs
Also need for OSS trino integration, tabular might contribute
Proxy Backend update [Mandy]
- draft PR #500: Thanks Willy for the initial setup.
Looking for feedback
Issues:
Initial implementation was using the provided beans to deserialize but it didn't quite work (TODO: ticket)
Instead just pass through. faster, but no validation
- OL is the dynamic lineage solution for Egeria
used postman for 3rd party
released in a few weeks
https://odpi.github.io/egeria-docs/features/lineage-management/overview/#the-openlineage-standard
- proposal for new facets.
RequestFacet => should be a runfacet, maps to the run args in Marquez
https://github.com/OpenLineage/OpenLineage/issues/256
Does the last version of a facet win? => yes
Need to document size constraint in OL (name length...) TODO: ticket
Oct 13th 2021
Attendees:
- TSC:
Michael Collado: Datakin
Julien Le Dem: OpenLineage Project Lead, Datakin
Maciej Obuchowski: GetInData, OpenLineage
Willy Lulciuc: Marquez, OpenLineage
Mandy Chessel: Egeria Project Lead, working on OpenLineage
- And:
Ross Turk: VP marketing at Datakin talk about the website
Minkyu Park: interested in contributing to Datakin
Peter Hicks: Marquez contributor, OpenLineage user
- Meeting recording:
- Notes:
- OpenLineage website: https://openlineage.io/
- Gatsby based (markdown) in OpenLineage/website repo
- generates a static site hosted in github pages. OpenLineage/OpenLineage.github.io
- deployment is currently manual. Automation in progress
- Please open PRs on /website to contribute a blog posts.
- Getting started with Egeria?
- Suggestions:
- Add page on open governance and how to join the project.
- Add LFAI & data banner to the website?
- Egeria is using MKdocs: very nice to navigate documentation.
- upcoming 0.3.0:
- Facet versioning:
- each facet schema is versioned individually.
- client/server code generation to facilitate producing/consuming openlineage events
- Spark 3.x support
- new mechanism for airflow 2.x
- working with airflow maintainer to improve that.
- Facet versioning:
- Proxy Backend update (planned for OL 0.4.0):
- mapping to egeria backend
- planning to release for the Egeria webinar on the 8th of November
- Willy provided a base module for ProxyBackend
- Monthly release is a good cadence
Open discussions:
Azure purview team hackathon ongoing to consumer OpenLineage events
Design docs discussion:
proposal to add design doc for proposal.
goal:
Similar to the process of projects like Kafka, Flink: for specs and bigger features
not for bug fixes.
options:
proposal directory for docs as markdown
Open PRs against wiki pages: proposals wiki.
Manage status:
list of designs that are implemented vs pending.
table of open proposals.
vote for prioritization:
Every proposal design doc has an issue opened and link back to it.
good start for the blog talking about that feature
New committee on data ops: Mandy will be speaking about Egeria and OpenLineage
Scope:
How the foundation projects should work together around the topic.
Establish OpenLineage is important.
https://wiki.lfaidata.foundation/display/DL/DataOps+Committee
- OpenLineage website: https://openlineage.io/
Sept 8th 2021
- Attendees:
- TSC:
Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
Michael Collado: Datakin, OpenLineage
- Maciej Obuchowski: GetInData. OpenLineage integrations
- Willy Lulciuc: Marquez co-creator.
- Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
- And:
- Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
- Minkyu Park: Datakin. learning about OpenLineage
- Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
- TSC:
- Meeting recording:
- zoom link
- Passcode: +UsC8emL
- Meeting notes:
- agenda:
Update on OpenLineage latest release (0.2.1)
dbt integration demo
OpenLineage 0.3 scope discussion
Facet versioning mechanism (Issue #153)
OpenLineage Proxy Backend (Issue #152)
OpenLineage implementer test data and validation
Kafka client
Roadmap
- Iceberg integration
Open discussion
- Discussions:
added to the agenda a Discussion of Iceberg requirements for OpenLineage.
Demo of dbt:
really easy to try
when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
Presentation of Proxy Backend design:
- summary of discussions in Egeria
Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
Two ways to use Egeria with OpenLineage
receives HTTP events and forwards to Kafka
A consumer receives the Kafka events in Egeria
Proxy Backend in OpenLineage:
direct HTTP endpoint implementation in Egeria
Depending on the user they might pick one or the other and we'll document
- summary of discussions in Egeria
Use a direct OpenLineage endpoint (like Marquez)
Deploy the Proxy Backend to write to a queue (ex: Kafka)
Follow up items:
- agenda:
The transport abstraction (Backend interface) could be usable directly from the client or from the Proxy Backend. The user can decide if they want the intermediate proxy. See #269
We should add a distribution client symmetric to the Proxy Backend. It reads from Kafka and sends event to an OpenLineage HTTP endpoint. Marquez would use it, for example to consume OpenLineage events produced by Egeria. See #270
- Iceberg integration:
presentation of Iceberg model
Manifest and manifest list: 2-level tree structure tracking data files.
root metadata version file. Points to manifest list (It knows all of the previous versions of the dataset that we want to keep)
Iceberg collect various metadata about the scans and data being produced and wants to expose it through OpenLineage. It can already expose metadata but there is no listener yet.
Ryan: added the metadata list presented to the Iceberg ticket: See #167
Aug 11th 2021
- Attendees:
- TSC:
Ryan Blue
Maciej Obuchowski
Michael Collado
Daniel Henneberger
Willy Lulciuc
Mandy Chessell
Julien Le Dem
- And:
Peter Hicks
Minkyu Park
Daniel Avancini
- TSC:
- Meeting recording:
- zoom link
- Passcode: =RBUj01C
- Meeting notes:
- Agenda:
- Coming in OpenLineage 0.1
- OpenLineage spec versioning
- Clients
- Marquez integrations imported in OpenLineage
- Apache Airflow:
- BigQuery
- Postgres
- Snowflake
- Redshift
- Great Expectations
- Apache Spark
- dbt
- Apache Airflow:
- OpenLineage 0.2 scope discussion
- Facet versioning mechanism (Issue #153)
- OpenLineage Proxy Backend (Issue #152)
- Kafka client
- Roadmap
- Open discussion
- Coming in OpenLineage 0.1
- Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14
- Notes:
- OpenLineage 0.1 is being published
- Coming in OpenLineage 0.1
- OpenLineage spec versioning
- Clients (Java, Python)
- Marquez integrations imported in OpenLineage
- Apache Airflow:
- BigQuery
- Postgres
- Snowflake
- Redshift
- Great Expectations
- Apache Spark
- dbt
- Question: How is airflow capturing openlineage events?
- openlineage-airflow installed on the airflow instance
- adapters per operator
- Apache Airflow:
- OpenLineage 0.2 scope discussion
- Facet versioning mechanism (Issue #153)
- OpenLineage Proxy Backend (Issue #152)
- Questions:
- What is the advantage of the proxy backend?
- The consumer does not need to implement an endpoint and can consume from kafka
- can configure what to do with events independently of various integrations
- first step to having a routing mechanism:
- to send events to multiple consumer
- to have rule-based routing
- to enable archiving the event in addition to sending them
- Is it included in OpenLineage?
- Yes (Otherwise it would have to be in Egeria)
- Does it include error management or retry policy? What if the proxy dies? Do we care about durability?
- Yes we care about durability
- first implementation to be synchronous. single transaction to Kafka per event.
- future might be configurable to adjust depending on context (guaranteed delivery vs performance batching)
- What technology should we use?
- Proposed: Java + spring boot (like Egeria)
- discussion to use Java + dropwizard like Marquez
- general consensus on using java. (framework TBD)
- In the future, might have a go implementation to enable lightweight sidecar pattern
- What is the advantage of the proxy backend?
- Questions:
- Kafka client
- Roadmap
- Open discussion
How do we define extension points for integrations? For example hooks, spark and airflow for the user to add adapters/facets without having to modify OL.
- TODO: create a ticket to track this
- Apache Iceberg interest in OpenLineage:
- Would want to add additional notifications
- how many files read or written
- How long a commit took.
- How many attempts to commit were needed?
- TODO: create ticket to enable Iceberg facets to be added to OpenLineage events
- Iceberg needs to send events independently of where the library is used. (example: plain java process or other)
- TODO: need ticket for this => #167 Iceberg integration
- TODO: ticket for PrestoDB/Trino integrations
- => #164 Trino and #165 PrestoDB
- Would want to add additional notifications
- Egeria has a weekly community call
- September 1st will be about OpenLineage
- Also an incoming webinar
- Agenda:
July 14th 2021
- Attendees:
- TSC:
- Julien Le Dem
- Mandy Chessel
- Michael Collado
- Willy Lulciuc
- TSC:
- Meeting recording:
- zoom link
- Passcode: =!oZ?&0A
- Meeting notes
- Agenda:
- Finalize the OpenLineage Mission Statement
- Review OpenLineage 0.1 scope
- Roadmap
- Open discussion
- Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46
- Notes:
Mission statement:
Overall consensus on the statement.
TODO: vote by commenting on the ticket
Spec versioning mechanism:
The goal is to commit to compatible changes once 0.1 is published
We need a follow up to separate core facet versioning
=> TODO: create a separate github ticket.The lineage event should have a field that identifies what version of the spec it was produced with
=> TODO: create a github issue for this
TODO: Add issue to document version number semantics (SCHEMAVER)
Extend Event State notion:
where do we capture more precise state transitions like RESTART?
Discussion should happen here: https://github.com/OpenLineage/OpenLineage/issues/9
OpenLineage 0.1:
finalize a few spec details for 0.1 : a few items left to discuss.
In particular job naming
parent job model
Importing Marquez integrations in OpenLineage
Open Discussion:
connecting the consumer and producer
TODO: ticket to track distribution mechanism
options:
Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
proxy similar to OpenTelemetry Collector.
Send to Kafka: https://github.com/OpenLineage/OpenLineage/issues/70
We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
Do we have a mutual third party or the client know where to send?
Source code location finalization
job naming convention
you don't always have a nested execution
can call a parent
parent job
You can have a job calling another one.
always distinguish a job and its run
need a separate notion for job dependencies
need to capture event driven: TODO: create ticket.
TODO(Julien): update job naming ticket to have the discussion.
- Agenda:
June 9th 2021
- Attendees:
- TSC:
Julien Le Dem: Marquez, Datakin
Drew Banin: dbt, CPO at fishtown analytics
Maciej Obuchowski: Marquez, GetIndata consulting company
Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
Daniel Henneberger: building a database, interested in lineage
Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
Willy Lulciuc: co-creator of Marquez
Michael Collado: Datakin, OpenLineage end-to-end holistic approach. - And:
Kedar Rajwade: consulting on distributed systems.
Barr Yaron: dbt, PM at Fishtown analytics on metadata.
Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue - Excused: Ryan Blue, James Campbell
- TSC:
- Meeting recording:
- zoom link
- Passcode: +ge1Akp9
- Meeting notes:
Agenda:
project communication
Technical charter review
medium term roadmap discussion
Notes:
project communication
github: for specs, designs, reviews and building consensus (issues and PRs)
email: for announcements, notes, etc
Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)
monthly meeting: recorded, notes and recording published on the wiki
Technical Charter review:
TODO: Finalize the mission statement. TSC members to comment in the doc.
Roadmap discussion:
TODO: please comment in the doc. Julien to update the OpenLineage project in github: https://github.com/OpenLineage/OpenLineage/projects/1