Document interopability and conversion: it shouldn’t be that hard!
K.3.401 | Day 1 | 14:45 - 15:10 | Speakers: Stephan Meijer, Albert Krewinkel
Abstract
This talk is presented by Stephan Meijer (NL government, NLdoc/La Suite Docs) and Albert Krewinkel, maintainer of Pandoc.
Public administrations hold millions of documents trapped in formats that are hard to reuse and often fail WCAG requirements: PDFs, legacy Word templates, ad-hoc styles. At Logius, with the NLdoc project, we were tasked with turning those documents into accessible, reusable HTML and other open formats. Our first instinct was the obvious one: use Pandoc and wrap it with some pre- and post-processing. It worked… until it didn’t. Every new edge case, every new target editor, every new accessibility rule meant more custom glue code and brittle filters.
So we flipped the problem: instead of chaining converters, we designed a JSON-based document Abstract Syntax Tree (AST) with an OpenAPI specification and built dedicated conversion services around it. That AST now sits at the centre of a small ecosystem: PDFs and DOCX files are converted into the AST, and from there into editors such as Tiptap and BlockNote, or directly into formats such as HTML. Support for ODT, Markdown and EPUB is on the way.
The same AST also powers the NLdoc Tiptap-based editor, where authors get real-time accessibility validation and can export to accessible formats. It also powers the import functionality in La Suite (Docs), the FR–DE–NL sovereign collaboration stack.
In this talk we’ll walk through that journey: why "just use Pandoc" wasn’t enough, what our AST looks like, how we wired it into a queue-based microservice architecture, and how this approach turns document conversion from a one-off migration hack into an interoperability layer for accessible, sovereign collaboration tools.
Recent versions of the document specification are available at the Releases page of its repository.
- The Elixir poject is available on github.com/docspec/docspec-ex
- The import API for La Suite Docs is published at github.com/docspecio/api.
- The La Suite Docs application itself is available at github.com/suitenumerique/docs.
- The Pandoc website (pandoc.org).
- The Pandoc repository (github.com/jgm/pandoc).
Speakers
Stephan Meijer is a software engineer working on NLdoc at Logius (the digital government service of the Dutch Ministry of the Interior) and on La Suite Docs, the joint French–German–Dutch sovereign collaboration stack. His work focuses on document conversion, accessibility, and interoperability between modern web editors and legacy office formats, with a particular interest in turning messy, real-world documents into standards-compliant, reusable content.
Links
External Links
Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.
