You are viewing the 2025 edition of FOSDEM. Click here to view the 2026 edition
Data Prep Kit: Open Source Data Engineering for LLMs
UB2.252A (Lameere) | Day 2 | 12:00 - 12:05 | Speakers: Joe Olson
Data Prep Kit: Open Source Data Engineering for LLMs
Abstract
Introducing Data Prep Kit (DPK) - an open source data engineering framework for LLMs. DPK was developed internally by IBM to assist with the development of its open source Granite family of LLMs and released in 2024. DPK is built on Kubeflow pipelines, running on scalable compute ranging from a developer's laptop to massive clusters. In addition Kubeflow pipelines allows the community to collaborate on common LLM data engineering workflows problems, such as determining licensing state of the data and determining the state of its GDPR compliance.
Speakers
Joe Olson
Links
External Links
Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.
