The prospect of improved clinical outcomes and more efficient health systems has fueled a rapid rise in the development and evaluation of AI systems over the last decade. Because most AI systems within healthcare are complex interventions designed as clinical decision support systems, rather than autonomous agents, the interactions among the AI systems, their users and the implementation environments are defining components of the AI interventions’ overall potential effectiveness. Therefore, bringing AI systems from mathematical performance to clinical utility needs an adapted, stepwise implementation and evaluation pathway, addressing the complexity of this collaboration between two independent forms of intelligence, beyond measures of effectiveness alone1. Despite indications that some AI-based algorithms now match the accuracy of human experts within preclinical in silico studies2, there is little high-quality evidence for improved clinician performance or patient outcomes in clinical studies3,4. Reasons proposed for this so-called AI chasm5 are lack of necessary expertise needed for translating a tool into practice, lack of funding available for translation, a general underappreciation of clinical research as a translation mechanism6 and, more specifically, a disregard for the potential value of the early stages of clinical evaluation and the analysis of human factors7.
The challenges of early-stage clinical AI evaluation (Box 1) are similar to those of complex interventions, as reported by the Medical Research Council dedicated guidance1, and surgical innovation, as described by the IDEAL Framework8,9. For example, in all three cases, the evaluation needs to consider the potential for iterative modification of the interventions and the characteristics of the operators (or users) performing them. In this regard, the IDEAL framework offers readily implementable and stage-specific recommendations for the evaluation of surgical innovations under development. IDEAL stages 2a and 2b, for example, are described as development and exploratory stages, during which the intervention is refined, operators’ learning curves are analyzed and the influence of patient and operator variability on effectiveness are explored prospectively, before large-scale efficacy testing.
Early-stage clinical evaluation of AI systems should also place a strong emphasis on validation of performance and safety, in a similar manner to phase 1 and phase 2 pharmaceutical trials, before efficacy evaluation at scale in phase 3. For example, small changes in the distribution of the underlying data between the algorithm training and clinical evaluation populations (so-called dataset shift) can lead to substantial variation in clinical performance and expose patients to potential unexpected harm10,11.
Human factors (or ergonomics) evaluations are commonly conducted in safety-critical fields such as aviation, military and energy sectors12,13,14. Their assessments evaluate the effect of a device or procedure on their users’ physical and cognitive performance and vice-versa. Human factors, such as usability evaluation, are an integral part of the regulatory process for new medical devices15,16, and their application to AI-specific challenges is attracting growing attention in the medical literature17,18,19,20. However, few clinical AI studies have reported on the evaluation of human factors3, and usability evaluation of related digital health technology is often performed with inconstant methodology and reporting21.
Other areas of suboptimal reporting of clinical AI studies have also recently been highlighted3,22, such as implementation environment, user characteristics and selection process, training provided, underlying algorithm identification and disclosure of funding sources. Transparent reporting is necessary for informed study appraisal and to facilitate reproducibility of study results. In a relatively new and dynamic field such as clinical AI, comprehensive reporting is also key to construct a common and comparable knowledge base to build upon.
Guidelines already exist, or are under development, for the reporting of preclinical, in silico studies of AI systems, their offline validation and their evaluation in large comparative studies23,24,25,26; but there is an important stage of research between these, namely studies focusing on the initial clinical use of AI systems, for which no such guidance currently exists (Fig. 1 and Table 1). This early clinical evaluation provides a crucial scoping evaluation of clinical utility, safety and human factors challenges in live clinical settings. By investigating the potential obstacles to clinical evaluation at scale and informing protocol design, these studies are also important stepping stones toward definitive comparative trials.
To address this gap, we convened an international, multi-stakeholder group of experts in a Delphi exercise to produce the DECIDE-AI reporting guideline. Focusing on AI systems supporting, rather than replacing, human intelligence, DECIDE-AI aims to improve the reporting of studies describing the evaluation of AI-based decision support systems during their early, small-scale implementation in live clinical settings (that is, the supported decisions have an actual effect on patient care). Whereas TRIPOD-AI, STARD-AI, SPIRIT-AI and CONSORT-AI are specific to particular study designs, DECIDE-AI is focused on the evaluation stage and does not prescribe a fixed study design.
Box 1 Methodological challenges of the AI-based decision support system evaluation
The clinical evaluation of AI-based decision support systems presents several methodological challenges, all of which will likely be encountered at early stage. These are the needs to: