Using Metric Space Indexing for Complete and Efficient Record Linkage
Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or
trial and error to configure the process.
Dr Özgür Akgün is a Lecturer of Computer Science at the University of St Andrews, UK. He has worked on constraint programming: applications and automated modelling methods. He has recently started working in the area of Record Linkage where he focuses on developing methods for efficiently computing high-quality links within historical population data. Dr Akgün's talk will be on using metric space indexing methods for record linkage.