Researchers working on clinical and Electronic Health Record (EHR) databases such as MIMIC, eICU and the new AmsterdamUMCdb database will be familiar with the challenges posed by lack of standardisation between datasets. Common differences include: labelling of variables (“HR”, “heart rate”, “pulse”); units (“bpm, Hz”); recording equipment; hospital standards and protocols; languages (“hartslag”); and epidemiological variation in the patient population. It is not improbable that a model that works well on an American dataset will show differential performance on a European dataset, if it is even applicable. Therefore, it is often necessary to test on multiple datasets to validate the robustness of new models. Typically this would involve a huge amount of work in hand picking features that are common to each.
We aim to develop a formalised and general framework for the integration of data across EHR datasets. We focus on the MIMIC-III and AmsterdamUMCdb databases because they are extensively used by the community and they span across continents and languages. So far, we have developed a tool that will automatically generate likely matches between variables based on their names and the data distributions. Dutch variable strings are first translated into English using the Google Translate API, before being compared to the MIMIC-III variable strings. String similarity is ranked using the Levenshtein distance. The data distribution of likely string matches is then compared to verify the likelihood of the match (we are currently testing the use of t-tests and interquartile range overlap). Finally, the matches are presented to the clinician or researcher via an interactive tool in reverse likelihood order. The clinician can accept or reject matches which are stored as variable links. These are used to generate a harmonised dataset.
In further work, we hope to improve the matching algorithm by learning from past mistakes. For example, if the clinician tells the system that “Calcium ion” and “Calcium (urine)” are not the same, then it could update the probability of a match between “Kreatinine” and “Creatinine (urine)”. We would also like to incorporate the eICU dataset and improve documentation.
The software used for this project can be found here. The project was initially developed at the Milan Critical Care Datathon 2020 and is still in active development and we welcome advice and contributions from the community.