Skip to main content

Document interopability and conversion: it shouldn’t be that hard!

K.3.401 | Day 1 | 14:45 - 15:10 | Speakers: Stephan Meijer, Albert Krewinkel

Document interopability and conversion: it shouldn’t be that hard!
A picture of a devroom at FOSDEM 2024
Open in browser
Get involved in the conversation!Join the chat

Notes

Abstract

This talk is presented by Stephan Meijer (NL government, NLdoc/La Suite Docs) and Albert Krewinkel, maintainer of Pandoc.

Public administrations hold millions of documents trapped in formats that are hard to reuse and often fail WCAG requirements: PDFs, legacy Word templates, ad-hoc styles. At Logius, with the NLdoc project, we were tasked with turning those documents into accessible, reusable HTML and other open formats. Our first instinct was the obvious one: use Pandoc and wrap it with some pre- and post-processing. It worked… until it didn’t. Every new edge case, every new target editor, every new accessibility rule meant more custom glue code and brittle filters.

So we flipped the problem: instead of chaining converters, we designed a JSON-based document Abstract Syntax Tree (AST) with an OpenAPI specification and built dedicated conversion services around it. That AST now sits at the centre of a small ecosystem: PDFs and DOCX files are converted into the AST, and from there into editors such as Tiptap and BlockNote, or directly into formats such as HTML. Support for ODT, Markdown and EPUB is on the way.

The same AST also powers the NLdoc Tiptap-based editor, where authors get real-time accessibility validation and can export to accessible formats. It also powers the import functionality in La Suite (Docs), the FR–DE–NL sovereign collaboration stack.

In this talk we’ll walk through that journey: why "just use Pandoc" wasn’t enough, what our AST looks like, how we wired it into a queue-based microservice architecture, and how this approach turns document conversion from a one-off migration hack into an interoperability layer for accessible, sovereign collaboration tools.

Recent versions of the document specification are available at the Releases page of its repository.


Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.