Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools

Allison Gates; Samantha Guitard; Jennifer Pillay; Sarah A. Elliott; Michele P. Dyson; Amanda S. Newton; Lisa Hartling

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools

Methods Research Report

Investigators: Allison Gates, Ph.D., Samantha Guitard, M.Sc., Jennifer Pillay, M.Sc., Sarah A. Elliott, Ph.D., Michele P. Dyson, Ph.D., Amanda S. Newton, Ph.D., R.N., and Lisa Hartling, Ph.D.

Rockville (MD): Agency for Healthcare Research and Quality (US); 2019 Nov.

Report No.: 19(20)-EHC027-EF

Structured Abstract

Background:

Machine learning tools can expedite systematic review (SR) completion by reducing manual screening workloads, yet their adoption has been slow. Evidence of their reliability and usability may improve their acceptance within the SR community. We explored the performance of three tools when used to: (a) eliminate irrelevant records (Automated Simulation) and (b) complement the work of a single reviewer (Semi-automated Simulation). We evaluated the usability of each tool.

Methods:

We subjected three SRs to two retrospective screening simulations. In each tool (Abstrackr, DistillerSR, and RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed and the workload and time savings compared to dual independent screening. To test usability, eight research staff undertook a screening exercise in each tool and completed a survey, including the System Usability Scale (SUS).

Results:

Using Abstrackr, DistillerSR, and RobotAnalyst respectively, the median (range) proportion missed was 5 (0 to 28) percent, 97 (96 to 100) percent, and 70 (23 to 100) percent in the Automated Simulation and 1 (0 to 2) percent, 2 (0 to 7) percent, and 2 (0 to 4) percent in the Semi-automated Simulation. The median (range) workload savings was 90 (82 to 93) percent, 99 (98 to 99) percent, and 85 (85 to 88) percent for the Automated Simulation and 40 (32 to 43) percent, 49 (48 to 49 percent), and 35 (34 to 38 percent) for the Semi-automated Simulation. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for the Automated Simulation and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for the Semi-automated Simulation. Abstrackr identified 33–90% of records erroneously excluded by a single reviewer, while RobotAnalyst performed less well and DistillerSR provided no relative advantage. Based on reported SUS scores, Abstrackr fell in the usable, DistillerSR the marginal, and RobotAnalyst the unacceptable usability range. Usability depended on six interdependent properties: user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s).

Conclusions:

The workload and time savings afforded in the Automated Simulation came with increased risk of erroneously excluding relevant records. Supplementing a single reviewer’s decisions with relevance predictions (Semi-automated Simulation) improved upon the proportion missed in some cases, but performance varied by tool and SR. Designing tools based on reviewers’ self-identified preferences may improve their compatibility with present workflows.

Prepared for: Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services, 5600 Fishers Lane, Rockville, MD 20857; www.ahrq.gov Contract No. 290-2015-00001-I Prepared by: University of Alberta Evidence-based Practice Center, Edmonton, Alberta, Canada

Suggested citation:

Gates A, Guitard S, Pillay J, Elliott SA, Dyson MP, Newton AS, Hartling L. Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools. (Prepared by the University of Alberta Evidence-based Practice Center under Contract No. 290-2015-00001-I) AHRQ Publication No. 19(20)-EHC027-EF Rockville, MD: Agency for Healthcare Research and Quality; November 2019. Posted final reports are located on the Effective Health Care Program search page. DOI: https://doi.org/10.23970/AHRQEPCMETHMACHINEPERFORMANCE

This report is based on research conducted by the University of Alberta Evidence-based Practice Center under contract to the Agency for Healthcare Research and Quality (AHRQ), Rockville, MD (Contract No. 290-2015-00001-I). The findings and conclusions in this document are those of the authors, who are responsible for its contents; the findings and conclusions do not necessarily represent the views of AHRQ. Therefore, no statement in this report should be construed as an official position of AHRQ or of the U.S. Department of Health and Human Services.

None of the investigators have any affiliations or financial involvement that conflicts with the material presented in this report.

The information in this report is intended to help healthcare decision makers—patients and clinicians, health system leaders, and policymakers, among others—make well-informed decisions and thereby improve the quality of healthcare services. This report is not intended to be a substitute for the application of clinical judgment. Anyone who makes decisions concerning the provision of clinical care should consider this report in the same way as any medical reference and in conjunction with all other pertinent information, i.e., in the context of available resources and circumstances presented by individual patients.

This report is made available to the public under the terms of a licensing agreement between the author and the Agency for Healthcare Research and Quality. This report may be used and reprinted without permission except those copyrighted materials that are clearly noted in the report. Further reproduction of those copyrighted materials is prohibited without the express permission of copyright holders. AHRQ or U.S. Department of Health and Human Services endorsement of any derivative products that may be developed from this report, such as clinical practice guidelines, other quality enhancement tools, or reimbursement or coverage policies, may not be stated or implied.

Persons using assistive technology may not be able to fully access information in this report. For assistance contact vog.shh.qrha@CPE.

Bookshelf ID: NBK550175PMID: 31790164

< PrevNext >

PubReader
Print View
Cite this Page
Gates A, Guitard S, Pillay J, et al. Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2019 Nov.
PDF version of this title (424K)

Other titles in this collection

AHRQ Methods for Effective Health Care

Related information

NLM Catalog
Related NLM Catalog Entries

Recent Activity

Clear Turn Off Turn On

Performance and Usability of Machine Learning for Screening in Systematic Review...
Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools

Structured Abstract

Background:

Methods:

Results:

Conclusions:

Contents

Suggested citation:

Views

Other titles in this collection

Related information

Similar articles in PubMed

Recent Activity