Multiple Barriers Identified Which May Hamper Use of Artificial Intelligence in the Clinical Setting
Reporter: Stephen J. Williams, PhD.
From the Journal Science:Science 21 Jun 2019: Vol. 364, Issue 6446, pp. 1119-1120
By Jennifer Couzin-Frankel
3.3.21 Multiple Barriers Identified Which May Hamper Use of Artificial Intelligence in the Clinical Setting, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 2: CRISPR for Gene Editing and DNA Repair
In a commentary article from Jennifer Couzin-Frankel entitled “Medicine contends with how to use artificial intelligence” the barriers to the efficient and reliable adoption of artificial intelligence and machine learning in the hospital setting are discussed. In summary these barriers result from lack of reproducibility across hospitals. For instance, a major concern among radiologists is the AI software being developed to read images in order to magnify small changes, such as with cardiac images, is developed within one hospital and may not reflect the equipment or standard practices used in other hospital systems. To address this issue, lust recently, US scientists and government regulators issued guidance describing how to convert research-based AI into improved medical images and published these guidance in the Journal of the American College of Radiology. The group suggested greater collaboration among relevant parties in developing of AI practices, including software engineers, scientists, clinicians, radiologists etc.
As thousands of images are fed into AI algorithms, according to neurosurgeon Eric Oermann at Mount Sinai Hospital, the signals they recognize can have less to do with disease than with other patient characteristics, the brand of MRI machine, or even how a scanner is angled. For example Oermann and Mount Sinai developed an AI algorithm to detect spots on a lung scan indicative of pneumonia and when tested in a group of new patients the algorithm could detect pneumonia with 93% accuracy.
However when the group from Sinai tested their algorithm from tens of thousands of scans from other hospitals including NIH success rate fell to 73-80%, indicative of bias within the training set: in other words there was something unique about the way Mt. Sinai does their scans relative to other hospitals. Indeed, many of the patients Mt. Sinai sees are too sick to get out of bed and radiologists would use portable scanners, which generate different images than stand alone scanners.
The results were published in Plos Medicine as seen below:
PLoS Med. 2018 Nov 6;15(11):e1002683. doi: 10.1371/journal.pmed.1002683. eCollection 2018 Nov.
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.
Zech JR1, Badgeley MA2, Liu M2, Costa AB3, Titano JJ4, Oermann EK3.
Abstract
BACKGROUND:
There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task.
METHODS AND FINDINGS:
A cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites using split-sample validation. A total of 158,323 chest radiographs were drawn from three institutions: National Institutes of Health Clinical Center (NIH; 112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), and Indiana University Network for Patient Care (IU; 3,807 from 3,683 patients). These patient populations had an age mean (SD) of 46.9 years (16.6), 63.2 years (16.5), and 49.6 years (17) with a female percentage of 43.5%, 44.8%, and 57.3%, respectively. We assessed individual models using the area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 (95% CI 0.855-0.866) on the joint MSH-NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU (P values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; P values both <0.001). The highest internal performance was achieved by combining training and test data from MSH and NIH (AUC 0.931, 95% CI 0.927-0.936), but this model demonstrated significantly lower external performance at IU (AUC 0.815, 95% CI 0.745-0.885, P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH-NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data (P = 0.88). When a 10-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10× MSH risk P < 0.001; 10× NIH P = 0.002), but this outperformance failed to generalize to IU (MSH 10× P < 0.001; NIH 10× P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system-specific biases.
CONCLUSION:
Pneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.
PMID: 30399157 PMCID: PMC6219764 DOI: 10.1371/journal.pmed.1002683
[Indexed for MEDLINE] Free PMC Article
Images from this publication.See all images (3)Free text
Surprisingly, not many researchers have begun to use data obtained from different hospitals. The FDA has issued some guidance in the matter but considers “locked” AI software or unchanging software as a medical device. However they just announced development of a framework for regulating more cutting edge software that continues to learn over time.
Still the key point is that collaboration over multiple health systems in various countries may be necessary for development of AI software which is used in multiple clinical settings. Otherwise each hospital will need to develop their own software only used on their own system and would provide a regulatory headache for the FDA.
ECG sensor patch is a diagnostic tool used by the clinicians for early detection of atrial fibrillation and to ensure timely treatment for such patients. It also acts as triggering alarm for the #cardiac patient about the stress levels and thus increasing the patient compliance. With advances in device miniaturization and wireless technologies and changing consumer expectations, wearable “on-body” ECG patch devices have evolved to meet contemporary needs. The wearable patch continuously record the ECG of user, which aids in arrhythmia detection and management at the point of care. It also acts as triggering alarm for the cardiac patient about the stress levels and thus increasing the patient compliance.
𝐆𝐞𝐭 𝐈𝐧𝐝𝐮𝐬𝐭𝐫𝐲 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡 𝐏𝐃𝐅, 𝐂𝐥𝐢𝐜𝐤 𝐇𝐞𝐫𝐞: https://lnkd.in/dbJYWhyH
#ecg #sensor #patch #ecgsensorpatch #atrialfibrillation #heart #heartbeat #technology #medicaldevice #wearable #wearabletechnology #healthcare