Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents



Marquez Monthly Community Meeting

The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.

August 25, 2022

July 28, 2022

Attendees:

  • TSC:
    • Willy Lulciuc, Co-creator of Marquez
    • Michael Collado, Staff Software Engineer, Astronomer
  • And:
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Minkyu Park, Senior Engineer, Astronomer
    • John Thomas, Software Engineer, Dev. Rel., Astronomer
    • Ross Turk, Senior Director of Community, Astronomer
    • Ryan Hatter, Customer Reliability Engineer, Astronomer
    • Howard Yoo, Staff Product Manager, Astronomer

Agenda:

  1. Announcements
  2. Introducing the Marquez blog
  3. Architecture review: the lineage graph
  4. Discussion
    1. Marquez issue #2048

Meeting:

Widget Connector
urlhttp://youtube.com/watch?v=YMdhWpA3f3k

Notes:

  1. Announcements [Willy]

  2. Introducing the Marquez Blog [Michael R. and Ross]

    1. new blog can be found at marquezproject.ai/blog
    2. designed and built by Ross
    3. to contribute a blog post on GitHub:
      1. write post in Markdown, place it in new directory in OpenLineage/website/contents/blog 
      2. OR: open an issue first to suggest a topic or get feedback on your idea
      3. artwork: Ross happy to make the images; tag him
      4. Ross also happy to document the artwork creation process for others
  3. Architecture review: the lineage graph [Willy]

    1. What is Marquez doing in the background to surface lineage metadata at the run level during execution?
    2. What is a current lineage graph?
      1. bigraph with nodes for jobs and datasets
      2. run-level lineage is collected from OpenLineage events
      3. representation of job is based on datasets and the inputs and outputs they produce
      4. datasets stitched together using OpenLineage `ID` (global and unique)
      5. versioning of jobs enabled by OpenLineage `JobVersion`
        1. Marquez keeps track of changes to code and datasets behind the scenes
    3. Marquez data model
      1. Marquez keeps track of:
        1. job versions
        2. runs of each version
        3. sources
      2. each node represents the latest, or current, version of the job's lineage
      3. `Job` is `ID` and arrays representing input and output datasets
    4. Demo
      1. UI defaults to latest/current graph
      2. prior versions accessible via `version history` tab
      3. selecting a version makes another job node/datasets visible
      4. makes "time travel" possible in your pipeline
      5. all of this possible thanks to the OpenLineage spec
    5. Q & A
      1. If a job has not completed, will you not see metadata? [Howard]
        • no – a job has to complete in order for versioning logic to be applied 
      2. Is a job version associated with the code that produced it? [Ryan]
        • yes – if the code is provided as a source location facet
        • Marquez will determine if the code has changed
        • changes to schema also monitored using dataset versioning; this tied to job version
  4. Discussion

    1. Howard: issue 2048
      1. There is an edge case (using a custom extractor) where the TaskMetadata's given input or output dataset would NOT have the fields populated (`dataset.fields = []`).
      2. Having this type of metadata makes Marquez overwrite the existing version of the dataset with empty fields
      3. Proposal: Marquez should try to reuse the dataset instead of rewriting 
    2. Agreed; question remains about how to do it [Willy]
      1. behavior reflects versioning logic
      2. possible solution: use `null` value in OL spec rather than empty array
      3. challenge: we want to avoid making assumptions

June 23, 2022

Attendees:

  • TSC
    • Willy Lulciuc, Co-creator of Marquez
    • Julien Le Dem, Chief Architect, Astronomer
  • And
    • Martin Fiser, Head of Professional Services, Keboola
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Minkyu Park, Senior Engineer, Astronomer
    • John Thomas, Support Engineer, Astronomer
    • Naga Raghavarapu, Principal Software Engineer, Oracle
    • Ross Turk, Senior Director of Community, Astronomer

Agenda:

  • Announcements
  • Recent release: 0.23.0
  • User story by Martin Fiser (Keboola)
  • Open discussion

Meeting:

Widget Connector
urlhttp://youtube.com/watch?v=vdA8xOhFZCg

Notes:

  • Announcements [Willy]

    • Mqz/OL swag is still available!
    • Willy talked Mqz at OS Summit (LinuxCon)
  • Recent Release 0.23.0 [Michael R.]

  • Keboola Use Case [Martin]

    • Topic: OL integration with the Keboola platform
      • Overview of platform
        • modern data experience: data stack as a service
        • all-in-one service
        • writers/reverse ETL through component framework
        • enables version control, governance, etc., in workspaces
        • much metadata produced and collected, permitting visibility across entire pipeline
          • pipeline jobs
          • storage events
          • data loads/unloads
          • user-generated metadata
      • Purpose of OL integration
        • data governance to support users' feeding data to external tools
        • OL a "language" for speaking to various tools
        • offer API for OL information
        • native Keboola component
          • feeds OL information to an endpoint (e.g., Marquez)
          • can be orchestrated on customizable interval 
          • supports SSH
          • exports full job information to the endpoint
      • Demo
        • users have multiple projects on the platform
        • a few hundred components are offered to users out of the box (e.g., Google Drive, SQL, Python, Google Sheets)
        • metadata manually pushable to OpenLineage endpoint
        • orchestrator could benefit from parent/job support
      • Challenges
        • need: richer metadata 
          • component config
          • info about tables
        • lighter UI
          • reflects feedback about legibility
          • icon customizability
        • namespaces
          • connectivity between projects
        • more integrations
        • rounded logo
      • Q & A
        • Are you interested in contributing? [Julien]
          • would like to; possibly in the future
        • Would you like to open issues? (custom facets, UI) [Willy]
          • not currently able to
        • Are you using any integrations? java or python [Willy]
          • component can be anything in the docker container
          • multiple languages used in development
        • Customers using it already? [Conor]
          • some testing is going on
          • not in production yet
          • no plans to offer Marquez to customers
        • Does it work for every connector? [Conor]
          • each will produce at least a job
        • Auth model [Willy]
          • problem: slippery slope [Martin]
          • recommended at ingress level [Willy]
          • not a focus at the moment
          • contributions to related issues welcome
        • Is data discovery offered? [Naga]
          • built in with API 
          • additional tools can be added if integration would be seamless

May 26, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Peter Hicks, Senior Engineer, Astronomer

And:

  • Ross Turk, Senior Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
  • Sam Holmberg, Software Engineer, Astronomer
  • Dako Dakov, R&D Manager, VMware
  • Agita Jaunzeme, Community Manager, VMware
  • Radmila Radovanvic, Senior Data Engineer, Northwestern Mutual
  • Gage Russell, Data Engineer, Q2
  • Rae Green, Developer, Q2ebanking
  • Dimira Petrova, Supervisor of Data Analytics, VMware
  • Martin Fiser, Head of Professional Services, Keboola
  • Naga Raghavarapu, Principal Software Engineer, Oracle
  • Antoni Ivanov, Staff Engineer, VMware

Agenda:

  • Announcements
  • Use cases from Northwestern Mutual and VMware
  • New feature: linking job runs and datasets

Meeting:

Notes:

  • Announcements [Willy]

  • Northwestern Mutual Use Case [Joshua]

    • Big-picture role of Mqz at NWM
      • Mqz used to track data usage as a whole
      • Mqz critical at NWM to data ops, has special future here
    • Company background
      • Massive insurance co. with investment management arm
      • 150+ history with many customer touch points
      • Massive data with lots of users
    • Rationale for adoption
      • OL is where I spend most of my time
      • These tools will be the industry standards for dataset usage going forward
      • We desired one data standard, not random internal standards
    • Breakdown of use case
      • We track the HOW of usage from initial consumption to end usage
      • We record data product usage over time
      • Bonus: improved security
        • can see how/which users are actually using data
        • allows comparison to security frameworks, double-checking of work
      • Visualization is key
        • helps in building reports and modeling huge data systems
        • we can check the entire platform stack from ingest to updates, normalization, end-usage
    • Personal perspective
      • Mqz is data ops for data processing
      • Will we have a data ops center in the future like we have currently with NOCs?
      • The visual language is the key strength of the tool
      • This is the future of data
    • Q & A
      • Are screenshots available? Do you use Spark? [Naga]
        • Can't share due to proprietary concerns
      • How much data? [Naga]
        • Can't be specific, but it's a lot!
      • It's exciting to see others excited about the project. Are you using any custom integrations? [Willy]
        • Yes, custom integrations support streaming and ingestions across the platform
  • VMware use case [Antoni]

    • Demo of VDK
    • Our motivation
      • Verification problems
      • OLMqz was the solution
      • The common standard provided by OL is essential
    • Why Mqz?
      • It's helpful in debugging complex jobs, troubleshooting
      • It's key to understanding usage for maintenance – e.g., enabling removal of irrelevant datasets, jobs
      • The shared metadata is useful
    • Diagram of architecture
    • Code demo
    • Suggestions
      • Add visualization of parent/child relationships [note: see PR 1935]
      • Make output searchable by metadata (e.g., make it possible to find all late jobs)
    • Our stack
      • Postgres, Presto, Snowflake, Greenplum db, Trino
    • Q & A
      • How many integrations in use? [Gage]
        • 100 teams, 1000s of tables
      • Are you using the Python client? [Willy]
        • Yes
      • It's amazing to get this feedback [Willy]
      • The grouping of jobs is hard, but we're addressing this
      • Feel free to open issues and contribute
  • New feature linking job runs to datasets [Peter]
    • Recently added to jobs: created_by available on dataset views
    • Dataset versions also now available on version history tab
      • Allows for historical introspection in case of an issue
        • Allows for seeing if the code changed, for example
  • Open discussion
    • Is anyone using the Python client for OL? [Gage]
      • Based on today's discussion, the answer is yes
    • Projects, docs are coming [Willy]
      • You can also use the Airflow integration for insight into the Python client
    • Column-level lineage has been added to OL [Willy]
      • We worked with Microsoft on the spec
      • Look for this in the API in the next few months
      • Feedback on this appreciated
    • What's in the roadmap for multi-tenancy? How can this be used in Mqz? [Naga]
      • For every event, route it through Kafka –  we're working with a company to help us document this a bit more [Willy]
      • Alternate approach: use a namespace to add metadata
      • Issue with this: access control (see the project roadmap for more info) 

April 28, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Software Engineer, Astronomer
  • Julien Le Dem, Chief Architect, Astronomer

And:

  • Ross Turk, Senior Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Gage Russell, Data Engineer, Q2
  • Paweł Leszczyński, Data Engineer, GetInData
  • Joshua Wankowski, Associate Data Engineer, Northwest Mutual
  • Dillon Stadther

Agenda:

  • 0.22.0 preview [Willy]
  • lifecycleStateChange support [Pawel]
  • Updates to job renaming and symlinking [Michael C.]

Meeting:

Widget Connector
urlhttps://www.youtube.com/watch?v=s49wnzS6xLE

Notes:

  • Announcements [Willy]:

  • 0.22.0 Preview [Willy]:

    • lifecycleStateChange support will offer visibility into dataset lifecycle changes, including deleting of tables
    • Pawel:
      • change motivated by desire for more information about datasets
      • approach started out with the Spark integration
      • still more information about lifecycle changes is possible/desirable
      • additional feature idea: notification console friendly to backend developers
    • Additional possibility: grayed out nodes on graph for deleted datasets, logging to show lifecycle history
    • Pawel: panel on website could display changes to dataset over X days
      • Agreed. Create an issue and we can build on that idea.
    • Helm chart addition
      • allows annotations, e.g. Prometheus metrics
    • Support for renaming and redirection
      • introducing job hierarchy
      • symlink will permit visibility into name changes to datasets
  • Updates to job renaming and symlinking [Michael C.]

    • stemmed from desire to tie linked jobs together, e.g., jobs called by DAGs, even in cases where identical code is part of different chains
    • challenge: linking old jobs to fully qualified version
    • motivating factor: changes to job names results in junk nodes on graph
    • there was no way to remove the old job names from the graph
    • but there is frequently a need to keep track of old job names
    • hence the idea of symlinking a job
    • currently there's no API to do this
    • updating must be done manually currently
      • add the UUID of the new job to the db
      • from that point on, the job history will redirect to the new job (with a 301)
    • future: API will make this possible programmatically
    • Willy: is documentation needed for this?
      • Yes, I will post a change to the README
      • We want to do the same thing for datasets
  • Open discussion
    • Gage: is a home repo coming?
      • Willy: Minkyu has looked into this
      • Willy: we want to add the Helm chart to the new website
      • Willy: this is on our radar
    • New release coming soon!

March 31, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Engineer, Astronomer
  • Julien Le Dem, Chief Architect, Astronomer
  • Peter Hicks, Senior Engineer, Astronomer

And:

  • Ross Turk, Sr. Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Howard Yoo, Staff Product Manager, Astronomer

Agenda:

  • Website update
  • Backlog and roadmap discussion
  • Open discussion

Meeting:

Slides

Widget Connector
urlhttps://www.youtube.com/watch?v=WfPOKQxs68U

Notes:

Announcements [Michael R.]

  • Marquez stickers are now available: https://www.astronomer.io/datakin-swag
  • Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
  • The project's Github star count stands at 983. Have you starred the project yet?
  • 1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.

Website [Ross]

  • The project now has a new website.
  • Appropriately, it's an open-source project; PRs are welcome.
  • Tech: Gatsby, Github Projects
  • Dev: run yarn deploy to work on it
  • Plans: blog page. Proposals for posts welcome – post them in Slack or open a PR if you prefer.

Backlog and roadmap [Willy]

  • Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset versions, Pawel's lifecycle PR)
  • How to get the broader community involved? Want people to have more input/control over the issues we take up.
  • Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
  • Process: review issues on monthly basis, move to roadmap, then release.
  • Question from Howard about how to propose new features
  • Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
  • Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.

Open discussion

  • Michael C.: please note issue #1928: supporting job grouping and hierarchy.
    • Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
    • Proposal: a parent_job_id column should be added to the jobs table and to the runs table, both being uuids. 
  • Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.

February 24, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Engineer, Datakin

And:

  • Minkyu Park, Senior Engineer, Datakin
  • Michael Robinson, Developer Relations Engineer, Datakin
  • Ross Turk, VP of Marketing, Datakin

Agenda:

  • Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
  • Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
  • Demo: OL Java client
  • Dataset lifecycle management
  • Open discussion

Meeting:

Slides

Widget Connector
urlhttp://youtube.com/watch?v=5_rrEfatULE

Notes:

  • Announcements [Willy]

    • Release date of 0.21.0 is now 2/28
    • Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
      • We hope to have this merged for the next OL release
  • Integrations and OL demo [Willy]

    • OL integration
      • Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
      • Requirements.txt needs to install airflow
      • Set OpenLineage URL to local instance of Marquez
      • Marquez is moving towards using a task listener to pull metadata in real time 
      • For now use the OL Airflow DAG
      • You can still use the OL backend; there are limitations there, however
    • Spark integration
      • When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)

      • Point the host to your deployment

      • See the OL website for more details (openlineage.io/integration/spark-spark)
    • Upcoming: Flink and Kafka
    • Your feedback on these integrations appreciated
    • There are many connections you can use in your platform by switching over to OL to collect metadata
  • OL Java client demo [Willy]

  • Dataset lifecycle management [Willy]

    • Marquez can now capture changes to dataset names
    • Community voiced desire for this feature
    • Marquez now supports soft deletes of datasets
    • See PR #1847
    • Support of lifecycle now more concrete: can see the phases datasets go through
  • Open discussion

    • Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
    • Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark 

January 27, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Julien Le Dem, CTO of Datakin
  • Michael Collado, Staff Engineer, Datakin
  • Peter Hicks, Senior Engineer, Datakin
  • Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual

And:

  • Ross Turk, VP of Marketing, Datakin
  • Minkyu Park, Senior Engineer, Datakin
  • John Thomas, Support Engineer, Datakin
  • Michael Robinson, Developer Relations Engineer, Datakin

Agenda:

  • Marquez recent releases overview [Willy] 
    • Marquez release 0.21.0 overview
      • Upgrade to Java17
  • Migrating integrations to OpenLineage [Willy]
  • Cloud-based development instance of Marquez via Gitpod [Peter]
  • Open discussion

Meeting:

Slides

Widget Connector
urlhttp://youtube.com/watch?v=-BVpdDi77sY

Notes:

  • 0.21.0 overview [Willy]

    • Features:
      • Bug fixes
      • Removal of excess code
      • Upgrade to Java17
        • API image migrated
        • Eclipse Temurin integrated
        • All CI deployment updated to support Java17
    • Discussion [Kevin, Willy, Michael C.]:
      • Support for Java client possible in lower version
      • Proposed: schedule separate meeting about this
  • Migrating integrations to OpenLineage [Willy]

    • Spark library in Marquez now deprecated
    • Use of OpenLineage Spark integration recommended going forward
      • review the docs about how to configure your instance
      • remember to add underscore to marquez_airflow
    • OpenLineage integration allows task listener
      • workaround: import DAG from OpenLineage
    • See the changelog: environment variables for the Airflow instance have changed
  • Cloud-based development instance of Marquez [Peter]

    • Enabled by integration of Gitpod
    • Docker image in the cloud with Marquez and UI
    • Ideal for those not ready to install everything locally or who are having issues with their OS
    • Fast (30 seconds), eliminates risk
    • API also available
    • Can be made private or public
    • Big advantage: shareable within organizations via URL
    • Supports everything one could do locally in VS Code or similar IDE
    • Discussion [Willy, Peter, Kevin, Julien]:
      • common use case: potential users want to see metadata from their org and share the tool
      • potential side-effect: increase in Docker pulls
      • availability of metrics unknown
      • email address required
  • Open Discussion

    • Advantages of possible move from CircleCI to Github Actions 
      • CircleCI downsides: outages, billing issues [Willy]
      • Julien proposed: moving to Github actions eventually after running both in parallel
      • Kevin asked to experiment with Github Actions and report back
    • Issue #1800: add support for table operations reported from OpenLineage
      • Formal solution needed [Willy]
      • Willy proposed: deploy in two modes and use flags (Julien agreed)
    • NodeID
      • An easy win: add a field that returns a nodeID [Willy]
      • Willy proposed: prioritize in next release

Marquez Workflow Group Calendar Overview

Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars



Marquez Meetings List

Schedule

Title

Owner

Subgroup (mail list)

Purpose

Dial In Link

Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone)Meeting Title (Zoom Account Used)

Meeting Owner/Moderator

marquez-mail-list@lists.lfai.foundation


Meeting Purpose


Zoom Name: https://zoom.us/...
















Marquez Group Calendar 

Team Calendars
ide81bc711-e60f-4bf0-b0ef-9168d4eb0512,cbafa7a5-c0a2-476e-a723-e4e1d869180f,08e9289a-7a65-4bfa-926a-133ce26e3a64