Closed data, open software: building new ways into the French web archives
Day 1 | 12:35 | 00:25 | AW1.126 | Guillaume Levrier
Note: I'm reworking this at the moment, some things won't work.
This presentation aims at presenting a fully open-source pipeline to extract, curate, and explore web archives, a captive data source whose access is restricted. Its purpose is to detail both the technical pipeline and the socio-institutional setting that made it possible to emerge, highlighting the challenges of developing open tools for closed sources.
The French web archives is an institutional repository of data maintained by the French National Library (BnF). It contains more than 2 petabytes of data spanning over close to 30 years, which accounts to more than 50 billion web pages. Access to this data is restricted under the heritage and legal deposit law: academic researchers willing to work on web archives as data are expected to submit research projects that, upon a formal or informal agreement, will enable them to access this data. But what then? Building the methodological means to pursue epistemological goals in that context is particularly challenging. Web archivists do provide toolkits for exploration. Recent initiatives have scaled up the effort to make these sources more accessible. The RESPADON project has successfully managed to build a “captive web archive” capacity into the Hyphe software, and in doing so has opened a new way into developing tools for such data.
In this presentation, we will present a new solution to extensively study, at the qualitative level, specific topics in the full-text indexed collections of the French web archives. Built around the PANDORÆ software, this pipeline has been designed to interrogate the captive data source on site, but also extract relevant metadata in a compliant manner to enable its exploration off-site, while ensuring reproducibility by publishing the code. In doing so, this pipeline provides an up-to-date example of the “one-way mirror” situation of building open tools that are fit to operate on closed data sources.