Defoe: A spark-based toolbox for analysing digital historical textual data

Rosa Filgueira, Michael Jackson, Anna Roubickova, Amrey Krause, Ruth Ahnert, Tessa Hauswedell, Julianne Nyhan, David Beavan, Timothy Hobson, Mariona Coll Ardanuy, Giovanni Colavizza, James Hetherington, Melissa Terras

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.

Original languageEnglish
Title of host publicationProceedings - IEEE 15th International Conference on eScience, eScience 2019
PublisherIEEE
Pages235-242
Number of pages8
ISBN (Electronic)9781728124513
DOIs
Publication statusPublished - Sep 2019
Event15th IEEE International Conference on eScience 2019 - San Diego, United States
Duration: 24 Sep 201927 Sep 2019

Conference

Conference15th IEEE International Conference on eScience 2019
Abbreviated titleeScience 2019
Country/TerritoryUnited States
CitySan Diego
Period24/09/1927/09/19

Keywords

  • Apache Spark
  • Digital tools
  • Distributed queries
  • High-Performance Computing
  • Historical sources
  • Humanities research
  • Text mining
  • XML schemas

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Ecological Modelling
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Defoe: A spark-based toolbox for analysing digital historical textual data'. Together they form a unique fingerprint.

Cite this