Computer algorithms use string matching techniques to assess how likely two historical records are to be the same. The quality of linkage is unclear without knowing the correct links or ground truth. Synthetically generated datasets for which ground truth is known are helpful but the data typically are too clean to be representative of historical records. We assess data linkage algorithms under different data quality scenarios, e.g. with errors typical of historical transcriptions. A data corrupting model injects four types of mistakes: character level (e.g. an f is represented as an s – OCR Corruptions), attribute level (e.g. male changed to female due to false entry), record level (e.g. missing records), and group of records level (e.g. coffee spilt over a page, lost parish records in fire). We then evaluate record linkage algorithms over synthetically generated datasets with known ground truth and data corruptions matching a given profile.
|Publication status||Published - May 2017|
|Event||The Systematic Linking of Historical Records - University of Guelph, Guelph, Canada|
Duration: 11 May 2017 → 13 May 2017
|Workshop||The Systematic Linking of Historical Records|
|Period||11/05/17 → 13/05/17|