PriMedLink

Description
Information technology has increasingly been used to support healthcare applications and clinical research. Medical data are recorded electronically to enable better patient care, resource management, advanced treatments, detection and prevention of diseases, risk management, clinical trials, and health surveillance more efficiently and effectively. Such digital medical data also support clinical research through the linkage and aggregation of records corresponding to same patients, matching of similar patients, and statistical analysis of observations gathered from populations of patients.
Due to the absence of unique entity identifiers in different databases, often personal identifying attributes such as names, addresses, ages or dates of birth of patients, or their medical attributes such as blood pressure, cholesterol level, or body weight mass have to be used for the linkage, matching, or analytics. Increasing concerns of privacy and confidentiality, however, preclude the exchange or sharing of such medical data across different organizations for data aggregation and analysis. Techniques are required to conduct data matching or linkage on masked (encoded) medical data such that no sensitive information is revealed to any party involved in the linkage or any other external parties.
Several data masking (encoding) techniques have been developed in the literature (privacy-preserving data mining and privacy-preserving record linkage) to allow matching and linking of masked data without compromising privacy. However, medical data brings its own challenges which need to be tackled to make Private Medical Data Linkage (PriMedLink) research feasible and practical in real settings. The ultimate goal of this project is to develop open source software addressing these challenges for PriMedLink which would allow researchers and practitioners to learn and apply techniques for private matching and linking of medical data, and develop enhanced techniques further towards this novel and emerging research direction.
License: GPL-3.0 Code Repository: Privacy-Preserving Similar Patient Matching Code
Project ideas
Space/time efficient privacy-preserving medical data representation
Description
A medical datum is a single observation of a patient that generally comprises of four elements: 1) the patient in question (generally identified by personal identifying values such as names, addresses, and contact details), 2) the parameter being observed (such as blood pressure, cholesterol level, age, and body weight), 3) the value of these parameters, and 4) the time of the observation (date and time of recorded). Medical data are multiple such observations. This includes several different observations made concurrently, observations of the same patient parameter made at several points in time, or both. Therefore, medical data are often longitudinal and of different types ranging from narrative, textual data to numerical measurements, recorded signals, drawings, and images and videos.
Representing such complex data efficiently in terms of space and time is a challenging aspect that has been researched over several decades. However, efficient and privacy-preserving (i.e. masked) medical data representation is an interesting research direction that requires more attention for PriMedLink research and applications to enable medical data linkage and matching without compromising patient data privacy.
The aim of this project is to research and develop efficient (in terms of memory space and computational complexities) data structures and masking functions for representing medical data in a privacy-preserving manner that will allow novel forms of linkage, matching, and analysis for PriMedLink.
Benefit for the Student:
Private medical data linkage is an emerging research field and is being widely required in many real health applications. This project allows the student to gain exposure to medical data storage and processing and privacy aspects in medical data linkage that would help to contribute to applied research in healthcare applications.
Benefit for the Project:
This contributes a baseline for private medical data linkage (PriMedLink) that would have a high impact in the healthcare and research industries.
Requirements/ Prerequisites:
Interested students should have good programming skills (ideally including in Python) and background knowledge in algorithms and data structures, data mining, and privacy.
It is of advantage if students have knowledge in medical data storage, data management, analysis and mining and/or have successfully attended some courses on databases, data structures and algorithms, data mining, cryptography, or health informatics.
Mentors
Dinusha Vatsalan, Peter Christen
More Information:
The following materials provide specific background literature on medical data, different data structures used for medical data storage and representation, and different masking functions for privacy-preservation that will be required to conduct the project.
- A taxonomy of privacy-preserving record linkage techniques. Dinusha Vatsalan, Peter Christen, and Vassilios S. Verykios, Elsevier Journal of Information Systems 2013, (http://www.sciencedirect.com/science/article/pii/S...)
- Data driven analytics in Healthcare: Problems, Challenges, and Future Directions (Fei Wang, ACM CIKM 2014, https://sites.google.com/site/feiwang03/cikm14-tut...)
- Medical Data: Their acquisition, storage, and use. Edward H. Shortliffe and G. Octo Barnett, ACM Medical Informatics: Computer Applications in Healthcare, (http://dl.acm.org/citation.cfm?id=87788)
- Standardized vectorial representation of medical data in patient records. Wolfgang Orthuber and Efthymios Papavramidis, Medical and Care Compunetics 2010, (http://www.orthuber.com/wICMCC2010.pdf)
Private medical data comparison functions for similar patient matching
Description
Privacy-preserving similar patient matching (PPSPM) is a core component of PriMedLink. Identifying patients with similar characteristics or conditions is required in several healthcare applications such as clinical trials, inpatient bed management, and advanced or personalized treatment. Due to privacy and confidentiality concerns, similar patient matching needs to be conducted using masked (encoded) records.
Bloom filter based encoding is one efficient data masking technique that has widely been used in the literature as it allows approximate matching of attribute values (i.e. errors and variations in the attribute values are considered when matching) while preserving privacy.
However, all the existing Bloom filter encoding based PPRL techniques only support approximate matching of string or categorical data types. Since matching of different data types such as integer, float, date, time, scan image, textual data, medical reports, and geographical data is commonly required in PriMedLink, developing approximate matching techniques for different data types using Bloom filter based encoding is an important research direction.
We have recently conducted some initial work on approximate matching of Bloom filter encoded integer, float, and modulus data types (which has been published in the Journal of Biomedical Informatics). The aim of this project is to research and implement advanced techniques for approximate matching of other different types of medical data (such as textual data, image data, geographical data) masked using Bloom filter based encoding privacy technique.
Benefit for the Student:
Privacy-preserving techniques are evolving and challenging research topics due to the increasing concerns of privacy in big data. This project involves studies on novel and viable techniques for data masking and matching using advanced and cutting edge techniques for practical PriMedLink applications.
Benefit for the Project:
This will project extend the scope of PriMedLink and provide a baseline for matching different types of medical data in a privacy-preserving setting.
Requirements/ Prerequisites:
Interested students should have good programming skills (ideally including in Python) and background knowledge in algorithms and data structures, data mining, and string comparison.
It is of advantage if students working on this project have knowledge in privacy aspects of data analysis and data mining and/or have successfully attended some courses on databases, data structures and algorithms, data mining, cryptography, or health informatics.
Mentors
Dinusha Vatsalan, Peter Christen
More Information:
The following background materials provide some basic understandings on data masking and similarity calculations for approximate matching.
- Privacy-preserving matching of similar patients. Dinusha Vatsalan and Peter Christen, Elsevier Journal of Biomedical Informatics, 2016, (http://www.sciencedirect.com/science/article/pii/S...)
- A taxonomy of privacy-preserving record linkage techniques. Dinusha Vatsalan, Peter Christen, and Vassilios S. Verykios, Elsevier Journal of Information Systems 2013, (http://www.sciencedirect.com/science/article/pii/S...)
- Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Peter Christen, Springer Data-Centric Systems and Applications 2012, (http://www.springer.com/gp/book/9783642311635)
Flexible and realistic synthetic medical data generator
Description
Clinical research and development requires medical data for evaluation of new algorithms and systems. However, privacy and confidentiality concerns impede the collection or sharing of such medical data across different organizations. An alternative is to generate synthetic medical test data for the evaluation of clinical research. Few data generators have been developed so far for medical test data. An important aspect of such medical data generators is that they should be able to generate data that exhibit real-world characteristics. To the best of our knowledge, there is no such realistic medical test data generator freely available for clinical research evaluation and development.
Medical data are longitudinal and contain different types ranging from narrative, textual data to numerical measurements, recorded signals, drawings, and images and videos. There have been several internationally accepted codes of diseases, drugs, etc. used in healthcare systems and applications. Developing a flexible and extensible tool that can generate realistic longitudinal medical data incorporating different data types and standard codes would be a useful direction for clinical research and health data analytics.
The project aims to develop a synthetic data generator tool for medical data with preserved various original data characteristics. We have previously developed an online synthetic data generator for personal data (https://dmm.anu.edu.au/geco/). The goals of this project are: 1. Study and analyze standards, codes and different types of medical data. 2. Design and develop a tool for synthetic medical data generation by modelling real data characteristics and relationships. 3. Test the tool by generating and analyzing different sets of medical data using the proposed tool.
It is of advantage if students have knowledge in medical data representation, analysis and mining and/or have successfully attended some courses on databases, data structures and algorithms, data mining, or health informatics.
Benefit for the Student:
Medical data linkage and research has emerged as a promising field in healthcare industry. This project allows the student to learn the basics of medical data generation, representation and storage, and to contribute to an important problem in the medical data research and development.
Benefit for the Project:
Much research in medical data linkage, mining and analytics rely on some medical test data for evaluating and comparing new techniques and algorithms. The proposed online freely available synthetic medical data generator would greatly help researchers in this field.
Requirements/ Prerequisites:
Interested students should have good programming skills (ideally including in Python) and background knowledge in algorithms and data structures, and software engineering.
Mentors
Dinusha Vatsalan, Peter Christen
More Information:
The following materials provide specific background literature on medical data, different data types, and characteristics of synthetic data generators that will be required to conduct the project.
- Flexible and extensible generation and corruption of personal data. Peter Christen and Dinusha Vatsalan, ACM CIKM 2014, (http://dl.acm.org/citation.cfm?id=2507815)
- GeCo: an online personal data generator and corruptor. Khoi-Nguyen Tran, Dinusha Vatsalan, and Peter Christen, ACM CIKM 2014, (http://dl.acm.org/citation.cfm?id=2508207)
- Medical Data: Their acquisition, storage, and use. Edward H. Shortliffe and G. Octo Barnett, ACM Medical Informatics: Computer Applications in Healthcare, (http://dl.acm.org/citation.cfm?id=87788)
- A Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring System. Joseph S. Lombardo and Linda J. Moniz, Johns Hopkins APL Technical Digest 2008, (http://techdigest.jhuapl.edu/TD/td2704/LombardoMet...)
- Customized test data generator for HL7v3 based healthcare information systems. Alexandru Egner et al., Journal of Control Engineering and Applied Informatics 2013, (http://www.ceai.srait.ro/index.php/ceai/article/vi...)