Systematically corrupting data to assess data linkage quality

Ahmad Alsadeeqi, Alasdair J. G. Gray, Peter Christen, Özgür Akgün, Tom Dalton

Research output: Contribution to conferenceAbstract

47 Downloads (Pure)


Various algorithms have been developed to automatically link historical records based on a variety of string matching techniques. These generate an assessment of how likely two records are to be the similar. However, it remains unclear how to assess the quality of the linkages computed due to the absence of absolute knowledge of the correct linkage of real historical records – the ground truth. The creation of synthetically generated datasets for which the ground truth linkage is known to help with the assessment of linkage algorithms but the data generated is commonly too clean to be representative of historical records.

We are interested in assessing record linkage algorithms under different data quality scenarios, e.g. with errors typically introduced by a transcription process or where books can be nibbled by mice. We are developing a data corrupting model that injects corruptions into datasets based on given corruption methods and probabilities. We have classified different forms of corruptions found in historical records into four types based on the effect scope of the corruption. Those types are character level (e.g. an ‘f’ is represented as an ‘s’ - OCR Corruptions), attribute level (e.g. gender swap - male changed to female due to false entry), record level (e.g. missing records due to different reasons like loss of certificate), and group of records level (e.g. lost parish records in fire). This will give us the ability to evaluate record linkage algorithms over synthetically generated datasets with known ground truth and with data corruptions matching a given profile. In this paper, we describe in detail these four types of corruptions and corresponding examples.
Original languageEnglish
Publication statusAccepted/In press - 2017
EventThe UK Administrative Data Research Network Annual Research Conference 2017 - Edinburgh, United Kingdom
Duration: 1 Jun 20172 Jun 2017


ConferenceThe UK Administrative Data Research Network Annual Research Conference 2017
Abbreviated titleADRN 2017
Country/TerritoryUnited Kingdom
Internet address


Dive into the research topics of 'Systematically corrupting data to assess data linkage quality'. Together they form a unique fingerprint.

Cite this