Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Announcements [Julien]
    • OpenLineage earned the OSSF Core Infrastructure Silver Badge!
    • Happening soon: OpenLineage to apply formally for Incubation status with the LFAI
    • Blog: a post by Ernie Ostic about MANTA’s OpenLineage integration
    • Website: a new Ecosystem page
    • Workshops repo: An Intro to Dataset Lineage with Jupyter and Spark
    • Airflow docs: guidance on creating custom extractors to support external operators
    • Spark docs: improved documentation of column lineage facets and extensions
  • Recent release 0.16.1 [Michael R.] 
    • Added

      • Airflow: add dag_run information to Airflow version run facet #1133 @fm100
        Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration.
      • Airflow: add LoggingMixin to extractors #1149 @JDarDagran
        Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.
      • Airflow: add default extractor #1162 @mobuchowski
        Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors.
      • Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
        Adds support for running another method on extract_on_complete.
      • SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
        Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains a CI fix.

      Changed

      • Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
        The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.
      • Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
        Splits up the method for greater legibility and easier maintenance.
    • Removed

      • Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
        Removes the code structures and tests enabling support for Airflow 1.10.
    • Bug fixes and more details 

  • Update on LFAI & Data progress [Michael R.]
    • LFAI & Data: a single funding effort to support technical projects hosted under the [Linux] foundation
    • Current status: applying soon for Incubation, will be ready to apply for Graduation soon (dates TBD).
    • Incubation stage requirements:

      • 2+ organizations actively contributing to the project

        23 organizations

        A sponsor who is an existing LFAI & Data member

        To do

        300+ stars on GitHub

        1.1K GitHub stars

        A Core Infrastructure Initiative Best Practices Silver Badge

        Silver Badge earned on November 2

        Affirmative vote of the TAC and Governing Board

        Pending

        A defined TSC with a chairperson

        TSC with chairperson: Julien Le Dem

        Graduation stage requirements:


      • 5+ organizations actively contributing to the project

        23 organizations 

        Substantial flow of commits for 12 months

        Commit growth rate (12 mo.): 155.53%

        Avg commits pushed by active contributors (12 mo.): 2.18K

        1000+ stars on GitHub

        1.1K GitHub stars

        Core Infrastructure Initiative Best Practices Gold Badge

        Gold Badge in progress (57%)

        Affirmative vote of the TAC and Governing Board

        Pending

        1+ collaboration with another LFAI project

        Marquez, Egeria, Amundsen

        Technical lead appointed on the TAC

        To do


  • Implementing OpenLineage proposal and discussion [Julien]
    • Procedure for implementing OpenLineage is under-documented
    • Goal: provide a better guide on the multiple approaches that exist
    • Contributions are welcome
    • Expect more information about this at the next meeting
  • MANTA integration update [Petr]
    • Project: MANTA OpenLineage Connector
    • Straightforward solution:
      • Agent installed on customer side to setup an API endpoint for MANTA
      • MANTA Agent will hand over OpenLineage events to the MANTA OpenLineage Extractor, which will save the data in a MANTA OpenLineage Event Repository
      • Use the MANTA Admin UI to run/schedule the MANTA OpenLineage Reader to generator an OpenLineage Graph and produce the final MANTA Graph using a MANTA OpenLineage Generator
      • The whole process will be parameterized
    • Demo:
      • Example dataset produced by Keboola integration
      • All dependencies visualized in UI
      • Some information about columns is available, but not true column lineage
      • Possible to draw lineage across range of tools
    • Looking for volunteers willing to test the integration
    • Q&A
      • Are you using the Column-level Lineage Facet from OpenLineage?
        • Not yet, but we would like to test it
        • Find a good example of this in the OpenLineage/workshops/Spark GitHub repo
        • What would be great would be a real example/real environment for testing
  • Linking CMF (a common ML metadata framework) and OpenLineage [Suparna & Ann Mary]
    • https://github.com/HewlettPackard/cmf
    • Where CMF will fit in the OpenLineage ecosystem
      • linkage needed between forms of metadata for conducting AI experiments
      • concept: "git for AI metadata" consumable by tools such as Marquez and Egeria after publication by an OpenLineage-CMF publisher
      • challenges:
        • multiple stages with interlinked dependencies
        • executing asynchronously
        • data centricity requires artifact lineage and tracking influence of different artifacts and data slices on model performance
        • pipelines should be Reproducible, Auditable and Traceable
        • end-to-end visibility is necessary to identify biases, etc.
      • AI for Science example:
        • training loop in complex pipeline with multiple models optimized concurrently
          • e.g., an embedding model, edge selection model and graph neural model in same pipeline
          • CMF used to capture metadata across pipeline stages
      • Manufacturing quality monitoring pipeline
        • iterative retraining with new samples added to the dataset every iteration
        • CMF tracks lineage across training and deployment stages
        • Q: is the recording of metadata automatic, or does the data scientist have control over it?
          • there both explicit (e.g., APIs) and implicit modes of tracking
          • the data scientist can choose which "branches" to "push" a la Git
      • 3 columns of reproducibility
        • metadata store (MLMD/MLFlow)
        • Artifact Store (DVC/Others)
        • Query Cache Layer (Graph Database)
        • GIT
        • optimization
      • Comparison with other AI metadata infrastructure
        • Git-like support and ability to collaborate across teams distinguish CMF from alternatives
        • Metrics and lineage also make CMF comparable to model-centric and pipeline-centric tools
      • Lineage tracking and decentralized usage model
        • complete view of data model lineage for reproducibility, optimization, explainability
        • decentralized usage model, easily cloned in any environment
      • What does it look like?
        • explicit tracking via Python library
        • tracking of dataset, model and metrics
        • offers end-to-end visibility
      • API
        • abstractions: pipeline state, context/stage of execution, execution
      • Automated logging, heterogeneous SQ stand distributed teams
        • enables collaboration of distributed teams of scientists using a diverse set of libraries
        • automatic logging in command line interface
      • POC implementations
        • allows for integration with existing frameworks
        • compatible with ML/DL frameworks and ML tracking platforms
      • Translation between CMF and OpenLineage
        • export of metadata in OpenLineage format
        • mapping of abstractions onto OpenLineage
        • Run ~ Execution with Run facet
        • Job ~ Context with Job facet
        • Dataset ~ Dataset with Dataset facet
        • Namespace ~ Pipeline
      • Q&A
        • Pipeline might map to Job name
        • Context might map to Pipeline as Parent job
        • Model could map to a Dataset as well as Dataset
        • Metric as a model could map to a Dataset facet
        • 2 levels of dataset facet, one static and one tied to Job Runs

October 13, 2022 (10am PT)

...