Systematically corrupting data to assess data linkage quality

Ahmad Alsadeeqi, Alasdair J. G. Gray

Research output: Contribution to conferenceAbstract

Abstract

Computer algorithms use string matching techniques to assess how likely two historical records are to be the same. The quality of linkage is unclear without knowing the correct links or ground truth. Synthetically generated datasets for which ground truth is known are helpful but the data typically are too clean to be representative of historical records. We assess data linkage algorithms under different data quality scenarios, e.g. with errors typical of historical transcriptions. A data corrupting model injects four types of mistakes: character level (e.g. an f is represented as an s – OCR Corruptions), attribute level (e.g. male changed to female due to false entry), record level (e.g. missing records), and group of records level (e.g. coffee spilt over a page, lost parish records in fire). We then evaluate record linkage algorithms over synthetically generated datasets with known ground truth and data corruptions matching a given profile.
Original languageEnglish
Publication statusPublished - May 2017
EventThe Systematic Linking of Historical Records - University of Guelph, Guelph, Canada
Duration: 11 May 201713 May 2017
http://recordlink.org/

Workshop

WorkshopThe Systematic Linking of Historical Records
Country/TerritoryCanada
CityGuelph
Period11/05/1713/05/17
Internet address

Fingerprint

Dive into the research topics of 'Systematically corrupting data to assess data linkage quality'. Together they form a unique fingerprint.

Cite this