Page History

...

Announcements [Julien]
- OpenLineage earned the OSSF Core Infrastructure Silver Badge!
- Happening soon: OpenLineage to apply formally for Incubation status with the LFAI
- Blog: a post by Ernie Ostic about MANTA’s OpenLineage integration
- Website: a new Ecosystem page
- Workshops repo: An Intro to Dataset Lineage with Jupyter and Spark
- Airflow docs: guidance on creating custom extractors to support external operators
- Spark docs: improved documentation of column lineage facets and extensions
Recent release 0.16.1 [Michael R.]
- Added
  - Airflow: add dag_run information to Airflow version run facet #1133 @fm100
    Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration.
  - Airflow: add LoggingMixin to extractors #1149 @JDarDagran
    Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.
  - Airflow: add default extractor #1162 @mobuchowski
    Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors.
  - Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
    Adds support for running another method on extract_on_complete.
  - SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
    Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains a CI fix.
  Changed
  - Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
    The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.
  - Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
    Splits up the method for greater legibility and easier maintenance.
- Removed
  - Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
    Removes the code structures and tests enabling support for Airflow 1.10.
- Bug fixes and more details
  - https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md

Update on LFAI & Data progress [Michael R.]

LFAI & Data: a single funding effort to support technical projects hosted under the [Linux] foundation
Current status: applying soon for Incubation, will be ready to apply for Graduation soon (dates TBD).

Incubation stage requirements:

2+ organizations actively contributing to the project	23 organizations
A sponsor who is an existing LFAI & Data member	To do
300+ stars on GitHub	1.1K GitHub stars
A Core Infrastructure Initiative Best Practices Silver Badge	Silver Badge earned on November 2
Affirmative vote of the TAC and Governing Board	Pending
A defined TSC with a chairperson	TSC with chairperson: Julien Le Dem

Graduation stage requirements:

5+ organizations actively contributing to the project	23 organizations
Substantial flow of commits for 12 months	Commit growth rate (12 mo.): 155.53% Avg commits pushed by active contributors (12 mo.): 2.18K
1000+ stars on GitHub	1.1K GitHub stars
Core Infrastructure Initiative Best Practices Gold Badge	Gold Badge in progress (57%)
Affirmative vote of the TAC and Governing Board	Pending
1+ collaboration with another LFAI project	Marquez, Egeria, Amundsen
Technical lead appointed on the TAC	To do

Implementing OpenLineage proposal and discussion [Julien]
- Procedure for implementing OpenLineage is under-documented
- Goal: provide a better guide on the multiple approaches that exist
- Contributions are welcome
- Expect more information about this at the next meeting
MANTA integration update [Petr]
- Project: MANTA OpenLineage Connector
- Straightforward solution:
  - Agent installed on customer side to setup an API endpoint for MANTA
  - MANTA Agent will hand over OpenLineage events to the MANTA OpenLineage Extractor, which will save the data in a MANTA OpenLineage Event Repository
  - Use the MANTA Admin UI to run/schedule the MANTA OpenLineage Reader to generator an OpenLineage Graph and produce the final MANTA Graph using a MANTA OpenLineage Generator
  - The whole process will be parameterized
- Demo:
  - Example dataset produced by Keboola integration
  - All dependencies visualized in UI
  - Some information about columns is available, but not true column lineage
  - Possible to draw lineage across range of tools
- Looking for volunteers willing to test the integration
- Q&A
  - Are you using the Column-level Lineage Facet from OpenLineage?
    - Not yet, but we would like to test it
    - Find a good example of this in the OpenLineage/workshops/Spark GitHub repo
    - What would be great would be a real example/real environment for testing
Linking CMF (a common ML metadata framework) and OpenLineage [Suparna & Ann Mary]
- https://github.com/HewlettPackard/cmf
- Where CMF will fit in the OpenLineage ecosystem
  - linkage needed between forms of metadata for conducting AI experiments
  - concept: "git for AI metadata" consumable by tools such as Marquez and Egeria after publication by an OpenLineage-CMF publisher
  - challenges:
    - multiple stages with interlinked dependencies
    - executing asynchronously
    - data centricity requires artifact lineage and tracking influence of different artifacts and data slices on model performance
    - pipelines should be Reproducible, Auditable and Traceable
    - end-to-end visibility is necessary to identify biases, etc.
  - AI for Science example:
    - training loop in complex pipeline with multiple models optimized concurrently
      - e.g., an embedding model, edge selection model and graph neural model in same pipeline
      - CMF used to capture metadata across pipeline stages
  - Manufacturing quality monitoring pipeline
    - iterative retraining with new samples added to the dataset every iteration
    - CMF tracks lineage across training and deployment stages
    - Q: is the recording of metadata automatic, or does the data scientist have control over it?
      - there both explicit (e.g., APIs) and implicit modes of tracking
      - the data scientist can choose which "branches" to "push" a la Git
  - 3 columns of reproducibility
    - metadata store (MLMD/MLFlow)
    - Artifact Store (DVC/Others)
    - Query Cache Layer (Graph Database)
    - GIT
    - optimization
  - Comparison with other AI metadata infrastructure
    - Git-like support and ability to collaborate across teams distinguish CMF from alternatives
    - Metrics and lineage also make CMF comparable to model-centric and pipeline-centric tools
  - Lineage tracking and decentralized usage model
    - complete view of data model lineage for reproducibility, optimization, explainability
    - decentralized usage model, easily cloned in any environment
  - What does it look like?
    - explicit tracking via Python library
    - tracking of dataset, model and metrics
    - offers end-to-end visibility
  - API
    - abstractions: pipeline state, context/stage of execution, execution
  - Automated logging, heterogeneous SQ stand distributed teams
    - enables collaboration of distributed teams of scientists using a diverse set of libraries
    - automatic logging in command line interface
  - POC implementations
    - allows for integration with existing frameworks
    - compatible with ML/DL frameworks and ML tracking platforms
  - Translation between CMF and OpenLineage
    - export of metadata in OpenLineage format
    - mapping of abstractions onto OpenLineage
    - Run ~ Execution with Run facet
    - Job ~ Context with Job facet
    - Dataset ~ Dataset with Dataset facet
    - Namespace ~ Pipeline
  - Q&A
    - Pipeline might map to Job name
    - Context might map to Pipeline as Parent job
    - Model could map to a Dataset as well as Dataset
    - Metric as a model could map to a Dataset facet
    - 2 levels of dataset facet, one static and one tied to Job Runs

October 13, 2022 (10am PT)

...

Page tree

Versions Compared

Old Version 137

New Version 138

Key

October 13, 2022 (10am PT)