Page History

...

Recent releases
Announcements
Coming soon: simplified job hierarchy in the Spark integration
Discussion items
Open discussion

Notes:

Summary

1. We have added a new communication resource, a LinkedIn company page.
2. We announced a new committer, Damien Hawes, from Booking.com, who has made significant contributions to the project.
3. Astronomer and Collibra are co-sponsoring a data lineage meetup on March 19th at the Microsoft New England Conference Center.
4. Members have talks upcoming at Kafka Summit and Data Council.
5. We discussed upcoming improvments to job hierarchy in Spark and how this can help answer questions about job scheduling and dependencies.
6. Damien shared his contributions to the Apache Spark integration, specifically addressing versioning conflicts with Scala.
7. Eric provided a general update on the interest in and adoption of OpenLineage, particularly in the enterprise space.
8. Atlan is considering releasing a DAG (Directed Acyclic Graph) instead of a plugin to help users with configuration and troubleshooting.
9. The next monthly call will be held at a different "location," and participants were encouraged to look out for the updated Zoom link.

Outline

Welcome and Announcements
- Michael Robinson welcomes everyone to the monthly call of the Open Lineage TSC, which is recorded and archived on the wiki. He mentions that the list has one more person since the last meeting and teases an exciting announcement.
- Michael Robinson shares a new communication resource, the LinkedIn company page, and asks for quick introductions from the participants.
- Harel introduces himself as an Open Lineage committer and hints at an interesting workplace announcement for the next meeting.
- Other participants introduce themselves, including their roles and companies.

Introductions
- Maciej introduces himself as a software engineer and warns about possible background noise due to copyright music.
- Eric, Suraj, and Damien introduce themselves and express their excitement to be part of the call.

Agenda Overview and New Committer Introduction
- Michael Robinson outlines the agenda for the call, including announcements, recent releases, and discussion items.
- Michael Robinson announces a change in the Zoom link and welcomes a new committer, Damien from Booking.com, who has made significant contributions to the project.
- Harel and Michael Robinson express their gratitude for Damien's contributions and explain how he added support for multiple Spark versions for the integration, which saved a lot of time and effort for the community.

Upcoming Events
- Michael Robinson announces a data lineage meet up on March 19th at the Microsoft New England Conference Center, co-sponsored by Astronomer and Collibra. More details and sign-up link available on Meetup.com.
- An updated agenda and information about speakers will be provided soon.
- Michael Robinson informs about an exciting talk at Kafka Summit on March 19th called "Open Lineage for Stream Processing" by Baimache and Pavel. There will also be a data standardization panel moderated by Julien at Data Council on March 27th, with participants to be finalized soon.

Recent Releases and Contributions
- Michael Robinson shares about the successful first London meetup with speakers from Decathlon, Confluent, and Astronomer. Decathlon's lineage graph was showcased, and more details about their architecture and use case will be shared in the future.
- Open Lineage 1.8 was released with contributions from Damian, Mata, Meta, Bertel, Peter, and Natalie.
- Michael Robinson thanks all contributors and welcomes Matea's first contributions to the project. Open Lineage 1.8 can be read about on the GitHub repo and docs.
- Maciej is asked if he would like to share his screen.

UI Feature and Streaming Integration
- Maciej explains two topics for the call: a store and a description of how they think of job-specific park. He discusses the job hierarchy and how they can answer questions about why a job ran at a certain time.
- He gives an example of a parent job and how it schedules events. He explains that for a spark job, there can be multiple events and actions, but they want to simplify it to one event at the start and one at the end with each action having a parent job.
- He gives a complex example of a sequence of events for a spark application. He explains that open consumers can collapse the information they receive for a simplified view of the spark application.
- Maciej explains the new UI feature that allows for a top-level view of data in spark levers, without distinguishing the internal actions. He also mentions the higher level execution feature that allows users to see what is scheduled across the platform.
- Harel praises the addition and mentions that it helps visualize dependencies and governance, making it easier to answer use cases visually. Maciej adds that the complete events feature allows users to know when a spark drop ended.
- Michael Collado asks about how well the feature works in the data bricks environment, which Maciej acknowledges as a great question and mentions that they need to try it more in data bricks, as it is always slightly different from the standard.
- Maciej explained that they wanted to have a streaming integration with Pink, which is currently the most popular streaming system. They had an idea to make a Pink integration, but the code they copied from the integration was not very beautiful and had a lot of reflection and instance checking.
- They decided to create a workaround to get as much value as they can and propose an interface that allows them to create a better integration. They had other things to do in the meantime, but then they discovered that a support customize job was created by Dance, which introduced several interfaces.
- They realized that the perfect interface was already created, but it had only one piece of information. The problem was that the IP had already passed, and the listener would have to know every connector Emerson to get information from it, which is impossible.
- Maciej explains the limitations of open source connectors and how it affects their integration process. They have resolved this issue by adding a data set interface to make connectors implement it and make the lineage vertex implement the list of data sets that they actually attach.
- This breaks the capping between the collector and listener because they both are bigger face that basically doesn't change and changes. It takes only forward compatible.
- The end result is that they have an interfacing thing that is open lineage but not quite named open lineage. This solution is easier to convince the community to create an interface, that there's concerns is done to be find like on a library that the third part and they can have a clear one to one mapping without breaking anything.
- Maciej asks if there are any questions.
- Michael Robinson thanks Maciej for his contribution and acknowledges that he joined the call after work hours. Julien also thanks Maciej for coordinating with the link commuters on this great collaboration.
- Eric offers to give a general update at the end of the call.

Open Discussion
- Michael Robinson moves on to open discussion and asks if anyone has any discussion items.

Update on Spark Integration
- Damien shared his experience with the scalar two point 13 support to Apache Spark integration. They deployed the open line spark integration into their own internal pipelines and it worked well.
- However, when they moved to new clusters running different versions of scalar, the jobs failed due to conflicting scalar major versions. The reason for this is that when Java code is compiled, the compiler injects the full class names or full type signature of a method, which includes what its return type is and what its input ran types are.
- When calling a method in Apache Spark, if the same method has two different types signatures, the JVM throws a runtime error. The solution to this is to compile the entire application for an entire project against the Apache Spark libraries.
- Damien explains how to configure the app to consume relevant jars and run integration tests for different versions of Spark, with the exception of Spark 2.4 which only uses Scala 2.12. Maciej thanks Damien for his contribution and expresses a desire for faster reviews.
- Michael Robinson congratulates Damien on becoming a committer and thanks him for his contributions. Eric provides a general update on interest in airflow and spark integrations, with a focus on enterprise adoption and versioning conflicts.
- They plan to release a Dag instead of a plugin to help with configurations. Michael Robinson concludes the call and announces the next meeting.

January 11, 2024 (10am PT)

...

Page tree

Versions Compared

Old Version 206

New Version 207

Key

January 11, 2024 (10am PT)