...
- Announcements [Julien]
- OpenLineage earned the OSSF Core Infrastructure Silver Badge!
- Happening soon: OpenLineage to apply formally for Incubation status with the LFAI
- Blog: a post by Ernie Ostic about MANTA’s OpenLineage integration
- Website: a new Ecosystem page
- Workshops repo: An Intro to Dataset Lineage with Jupyter and Spark
- Airflow docs: guidance on creating custom extractors to support external operators
- Spark docs: improved documentation of column lineage facets and extensions
- Recent release 0.16.1 [Michael R.]
Added
- Airflow: add dag_run information to Airflow version run facet #1133 @fm100
Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration. - Airflow: add LoggingMixin to extractors #1149 @JDarDagran
Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings. - Airflow: add default extractor #1162 @mobuchowski
Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors. - Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
Adds support for running another method on extract_on_complete. - SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains a CI fix.
Changed
- Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor. - Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
Splits up the method for greater legibility and easier maintenance.
- Airflow: add dag_run information to Airflow version run facet #1133 @fm100
Removed
- Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
Removes the code structures and tests enabling support for Airflow 1.10.
- Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
Bug fixes and more details
- Update on LFAI & Data progress [Michael R.]
- LFAI & Data: a single funding effort to support technical projects hosted under the [Linux] foundation
- Current status: applying soon for Incubation, will be ready to apply for Graduation soon (dates TBD).
- Incubation stage requirements:
2+ organizations actively contributing to the project
23 organizations
A sponsor who is an existing LFAI & Data member
To do
300+ stars on GitHub
1.1K GitHub stars
A Core Infrastructure Initiative Best Practices Silver Badge
Silver Badge earned on November 2
Affirmative vote of the TAC and Governing Board
Pending
A defined TSC with a chairperson
TSC with chairperson: Julien Le Dem
Graduation stage requirements:
5+ organizations actively contributing to the project
23 organizations
Substantial flow of commits for 12 months
Commit growth rate (12 mo.): 155.53%
Avg commits pushed by active contributors (12 mo.): 2.18K
1000+ stars on GitHub
1.1K GitHub stars
Core Infrastructure Initiative Best Practices Gold Badge
Gold Badge in progress (57%)
Affirmative vote of the TAC and Governing Board
Pending
1+ collaboration with another LFAI project
Marquez, Egeria, Amundsen
Technical lead appointed on the TAC
To do
- Implementing OpenLineage proposal and discussion [Julien]
- Procedure for implementing OpenLineage is under-documented
- Goal: provide a better guide on the multiple approaches that exist
- Contributions are welcome
- Expect more information about this at the next meeting
- MANTA integration update [Petr]
- Project: MANTA OpenLineage Connector
- Straightforward solution:
- Agent installed on customer side to setup an API endpoint for MANTA
- MANTA Agent will hand over OpenLineage events to the MANTA OpenLineage Extractor, which will save the data in a MANTA OpenLineage Event Repository
- Use the MANTA Admin UI to run/schedule the MANTA OpenLineage Reader to generator an OpenLineage Graph and produce the final MANTA Graph using a MANTA OpenLineage Generator
- The whole process will be parameterized
- Demo:
- Example dataset produced by Keboola integration
- All dependencies visualized in UI
- Some information about columns is available, but not true column lineage
- Possible to draw lineage across range of tools
- Looking for volunteers willing to test the integration
- Q&A
- Are you using the Column-level Lineage Facet from OpenLineage?
- Not yet, but we would like to test it
- Find a good example of this in the OpenLineage/workshops/Spark GitHub repo
- What would be great would be a real example/real environment for testing
- Are you using the Column-level Lineage Facet from OpenLineage?
- Linking CMF (a common ML metadata framework) and OpenLineage [Suparna & Ann Mary]
- https://github.com/HewlettPackard/cmf
- Where CMF will fit in the OpenLineage ecosystem
- linkage needed between forms of metadata for conducting AI experiments
- concept: "git for AI metadata" consumable by tools such as Marquez and Egeria after publication by an OpenLineage-CMF publisher
- challenges:
- multiple stages with interlinked dependencies
- executing asynchronously
- data centricity requires artifact lineage and tracking influence of different artifacts and data slices on model performance
- pipelines should be Reproducible, Auditable and Traceable
- end-to-end visibility is necessary to identify biases, etc.
- AI for Science example:
- training loop in complex pipeline with multiple models optimized concurrently
- e.g., an embedding model, edge selection model and graph neural model in same pipeline
- CMF used to capture metadata across pipeline stages
- training loop in complex pipeline with multiple models optimized concurrently
- Manufacturing quality monitoring pipeline
- iterative retraining with new samples added to the dataset every iteration
- CMF tracks lineage across training and deployment stages
- Q: is the recording of metadata automatic, or does the data scientist have control over it?
- there both explicit (e.g., APIs) and implicit modes of tracking
- the data scientist can choose which "branches" to "push" a la Git
- 3 columns of reproducibility
- metadata store (MLMD/MLFlow)
- Artifact Store (DVC/Others)
- Query Cache Layer (Graph Database)
- GIT
- optimization
- Comparison with other AI metadata infrastructure
- Git-like support and ability to collaborate across teams distinguish CMF from alternatives
- Metrics and lineage also make CMF comparable to model-centric and pipeline-centric tools
- Lineage tracking and decentralized usage model
- complete view of data model lineage for reproducibility, optimization, explainability
- decentralized usage model, easily cloned in any environment
- What does it look like?
- explicit tracking via Python library
- tracking of dataset, model and metrics
- offers end-to-end visibility
- API
- abstractions: pipeline state, context/stage of execution, execution
- Automated logging, heterogeneous SQ stand distributed teams
- enables collaboration of distributed teams of scientists using a diverse set of libraries
- automatic logging in command line interface
- POC implementations
- allows for integration with existing frameworks
- compatible with ML/DL frameworks and ML tracking platforms
- Translation between CMF and OpenLineage
- export of metadata in OpenLineage format
- mapping of abstractions onto OpenLineage
- Run ~ Execution with Run facet
- Job ~ Context with Job facet
- Dataset ~ Dataset with Dataset facet
- Namespace ~ Pipeline
- Q&A
- Pipeline might map to Job name
- Context might map to Pipeline as Parent job
- Model could map to a Dataset as well as Dataset
- Metric as a model could map to a Dataset facet
- 2 levels of dataset facet, one static and one tied to Job Runs
October 13, 2022 (10am PT)
...