The rise of multimodal foundation models in medicine and ophthalmology
Editorial Commentary

The rise of multimodal foundation models in medicine and ophthalmology

Samyyia Ashraf1,2 ORCID logo, Edward Adams1, Livia Faes1, Dun Jack Fu1 ORCID logo

1NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, UCL Institute of Ophthalmology, London, UK; 2NNEdPro Global Centre for Nutrition and Health, Cambridge, UK

Correspondence to: Dun Jack Fu, MD, PhD. NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, UCL Institute of Ophthalmology, 162 City Road, London EC1V 2PD, UK. Email: d.fu@nhs.net.

Comment on: Wu Y, Qian B, Li T, et al. An eyecare foundation model for clinical assistance: a randomized controlled trial. Nat Med 2025;31:3404-13.


Keywords: EyeFM; multimodal foundation models; clinical assistance; ophthalmology


Received: 10 December 2025; Accepted: 06 March 2026; Published online: 20 March 2026.

doi: 10.21037/aes-2025-1-73


The convergence of human and artificial intelligence (AI) is rapidly transforming medical research with recent advances in multimodal foundation models demonstrating capacity to integrate imaging, clinical metadata, and longitudinal patient information (1-3). The recent study published in August 2025 in Nature Medicine by Wu et al. 2025 (4) titled “An eyecare foundation model for clinical assistance: a randomized controlled trial” presents a rigorous evaluation of the multimodal foundation model, including findings from a randomised clinical trial designed to assess real-world clinical utility. Building on prior work such as RETFound (5), which demonstrated the potential of large-scale self-supervised learning in retinal imaging, the EyeFM study advances the field by showing that a multimodal foundation model can deliver measurable clinical benefit in prospective evaluation. The authors provide evidence that integrating several imaging modalities with clinical information improves diagnostic performance beyond what unimodal, image-only systems have achieved in retrospective settings. Importantly, the study directly tests the authors’ argument that rigorous, transparent, and clinically grounded evaluation is essential for AI intended for high-stakes decisions.

Recent years have seen a rapid acceleration in the development of medical foundation models-large-scale neural networks pretrained on extensive, heterogeneous datasets and adaptable to a wide range of downstream clinical tasks (2,3). The RETFound published in Nature in 2023 (5) model represents a landmark example, trained using self-supervised learning on 1.6 million unlabelled retinal images, it demonstrated that foundation models can learn highly generalisable visual representations from a specific medical domain such as medical retina. When fine-tuned, RETFound exceeded the performance of conventional models across multiple ocular and systemic disease-prediction tasks, reducing reliance on expert-labelled datasets. Despite these advances, most evaluations of medical foundation models remain restricted to retrospective analyses. A recent systematic review of AI interventions highlighted that that despite tens of thousands of AI-focused publications, only a small fraction involves prospective clinical evaluation, and even fewer are conducted as randomised controlled trials (RCTs) (6). This evidence gaps limits clinicians’ and policymakers’ ability to determine the real-world effectiveness, safety, and implementation value of AI-enabled decision-support tools (7,8).

EyeFM presents a multimodal vision-language foundation model developed to function as an AI co-pilot in ophthalmic practice (4). The model was pretrained on 14.5 million ocular images spanning five modalities [fundus photography, optical coherence tomography (OCT), optical coherence tomography angiography (OCTA), fundus autofluorescence and ultra-widefield imaging] and paired with accompanying clinical text such as examination notes and diagnostic impressions from global, multiethnic datasets. To rigorously assess model performance and clinical utility, the authors implemented a structured three-phase validation framework comprising:

  • Retrospective benchmarking: comparing EyeFM with existing unimodal and multimodal architectures across single-modality, multi-modality and vision-language tasks, typically using metrics such as area under the receiver operating curve (AUROC) and sensitivity for disease detection.
  • Reader studies and real-world evaluations: conducted across multiple countries to determine the model’s value as a decision-support tool for ophthalmologists, assessed using diagnostic accuracy, sensitivity and the quality of written reports.
  • A parallel, single-centre, double-masked RCT: designed to quantify the model’s impact on clinical decision-making and patient-level outcomes, with primary metrics including correct diagnosis and referral decisions.

This tiered evaluation strategy positions EyeFM among the first ophthalmic foundation models to undergo prospective and randomised clinical testing, marking a significant step toward evidence-based integration of AI systems in eye care.

The RCT enrolled 668 participants at a high-risk retinal screening centre in China, with 16 ophthalmologists randomised equally to either the EyeFM-assisted or standard care arms. Diagnostic correctness was assessed across common retinal diseases using an expert-adjudicated reference standard. Under these conditions, EyeFM increased the correct diagnostic rate to 92.2% compared with 75.4% in the control arm (P<0.001), with a similar improvement in referral accuracy.

The intervention also enhanced report quality, with the standardisation score rising from 33 to 37 on a structured reporting scale, indicating greater consistency and completeness of clinical documentation. This was accompanied by improved patient compliance with self-management (70.1% vs. 49.1%) and referral suggestions (33.7% vs. 20.2%). Notably, patient satisfaction with the screening process remained similar between groups, and post-deployment surveys indicated strong clinician acceptance.

In retrospective tests the model matched or exceeded state-of-the-art algorithms for disease detection, lesion segmentation and cross-modality tasks; for example, it achieved an AUROC of 0.883 for detecting central-involved diabetic macular oedema using fundus photos as a surrogate for OCT. This implies that the model was able to infer OCT-relevant features from fundus images, with an AUROC of 0.883 indicating good discriminatory performance for this surrogate task compared with ImageNet-based and prior foundation models (P<0.001; details in the supplementary data). In reader-study phases, 44 ophthalmologists from six countries showed higher sensitivity for common eye diseases when aided by EyeFM and generated higher-quality written reports while saving time compared with unaided assessment.

One of EyeFM’s most significant contributions lies in its translational framework which bridges retrospective and prospective evidence. By moving from retrospective benchmarking through reader studies and real-world implementation to a properly masked RCT, the authors supply rare prospective evidence for an AI co-pilot (7). This is crucial because the gap between algorithmic performance and clinical impact is wide (6); models may perform well on curated datasets yet falter when paired with clinicians under real-world constraints such as variable image quality or greater case complexity. EyeFM’s RCT shows that augmenting ophthalmologists with AI can significantly improve diagnostic accuracy and referral decisions while maintaining patient satisfaction.

EyeFM’s design integrates five imaging modalities with clinical language, reflecting how ophthalmologists synthesise diverse data. It also incorporates a human-in-the-loop component, whereby clinicians provide feedback or corrections during assessment that can be used to refine the system’s performance. Such multimodal and interactive approaches may enhance generalisability compared with single-modality foundation models. The system’s ability to operate in resource-constrained settings, using cheaper fundus photos to infer diagnoses typically requiring more advanced imaging, addresses equity and may broaden access to high-quality eyecare.

EyeFM also presents a step towards clinician-AI partnership. The post-trial analysis highlighted complementary strengths: ophthalmologists were more accurate when confirming normal findings, whereas EyeFM excelled at detecting pathological fundus abnormalities. This synergy underscores the importance of designing AI tools that augment rather than supplant clinicians (2). EyeFM improved standardisation of reports and adherence to management plans, suggesting that AI assistance can influence not only diagnostic accuracy but also patient adherence and short-term follow-up behaviours, an important yet underexplored dimension of AI impact.

Despite its multiple strengths, the study has limitations when interpreting its broader applicability and generalisability:

  • Single-centre RCT and homogeneous population: the RCT was conducted at one screening centre in China with mainly middle-aged participants; external validity to other populations, ethnicities and healthcare systems remains untested. Multi-centre and multiethnic trials are needed to assess generalisability.
  • Short-term outcomes: the trial measured immediate diagnostic accuracy, report quality and short-term compliance but did not assess long-term visual outcomes, progression of diseases or healthcare utilisation.
  • Limited task scope. EyeFM primarily targets common retinal diseases such as diabetic retinopathy and age-related macular degeneration; its utility for less-common pathologies, including inherited retinal diseases (9) or more complex management scenarios, warrants further evaluation.
  • Operational considerations. Implementing EyeFM clinically will require robust integration with electronic health records, user-interface optimisation and reliable network infrastructure. Broader adoption must also address data privacy, algorithmic bias and regulatory oversight. The study’s authors acknowledge these gaps and advocate for larger, more diverse trials and sustained post-deployment monitoring.

The EyeFM study provides compelling evidence that multimodal foundation models can yield measurable clinical benefit when subjected to rigorous, prospective evaluation. These benefits relate primarily to diagnostic performance, referral accuracy and report quality; the impact on longer-term visual outcomes was not assessed and remains an important area for future study. In a landscape where RCTs of AI interventions remain strikingly uncommon (6-8), this work represents an important milestone for the field.

Looking ahead, the development and deployment of medical AI systems will require sustained attention to several core principles: the use of diverse and representative training datasets; transparent reporting and reproducible evaluation; incorporation of human-in-the-loop workflows; and prospective, multi-centre trials capable of assessing generalisability across populations and healthcare settings (10).

In parallel, regulators and payers should encourage evaluation frameworks that extend beyond diagnostic accuracy to include patient-centred outcomes, workflow integration, safety and broader health-system impact, such as implications for cost, clinic efficiency and scalability. Clinicians, for their part, will play a crucial role in shaping responsible adoption by engaging with AI tools as collaborative partners while remaining attentive to their limitations, biases and potential failure modes, including clinically relevant errors such as false positives and false negatives.

Taken together, the EyeFM findings move the field beyond proof-of-concept demonstrations toward evidence-based clinical deployment. By showing that a vision-language foundation model, which integrates imaging with clinical text, can enhance ophthalmologists’ diagnostic performance and improve short-term patient follow-up behaviour, the authors offer a persuasive template for how AI copilots may be safely integrated into clinical workflows.


Acknowledgments

None.


Footnote

Provenance and Peer Review: This article was commissioned by the editorial office, Annals of Eye Science. The article did not undergo external peer review.

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://aes.amegroups.com/article/view/10.21037/aes-2025-1-73/coif). L.F. received grants from Bayer, Roche, AbbVie and received support for attending meetings from Apellis. D.J.F. received grants from Roche, Boehringer Ingelheim, Galimedix and NIHR, received consulting fees from Roche, Boehringer Ingelheim and Galimedix, and received payment or honoraria for lectures, presentations, speakers bureaus, manuscript writing or educational events and support for attending meetings and/or travel from Roche. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:44-56. [Crossref] [PubMed]
  2. Tanno R, Barrett DGT, Sellergren A, et al. Collaboration between clinicians and vision-language models in radiology report generation. Nat Med 2025;31:599-608. [Crossref] [PubMed]
  3. Lu MY, Chen B, Williamson DFK, et al. A visual-language foundation model for computational pathology. Nat Med 2024;30:863-74. [Crossref] [PubMed]
  4. Wu Y, Qian B, Li T, et al. An eyecare foundation model for clinical assistance: a randomized controlled trial. Nat Med 2025;31:3404-13. [Crossref] [PubMed]
  5. Zhou Y, Chia MA, Wagner SK, et al. A foundation model for generalizable disease detection from retinal images. Nature 2023;622:156-63. [Crossref] [PubMed]
  6. Lam TYT, Cheung MFK, Munro YL, et al. Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review. J Med Internet Res 2022;24:e37188. [Crossref] [PubMed]
  7. Longhurst CA, Singh K, Chopra A, et al. A call for artificial intelligence implementation science centers to evaluate clinical effectiveness. NEJM AI 2024;1.
  8. You JG, Hernandez-Boussard T, Pfeffer MA, et al. Clinical trials informed framework for real world clinical implementation and deployment of artificial intelligence applications. NPJ Digit Med 2025;8:107. [Crossref] [PubMed]
  9. Pontikos N, Woof WA, Lin S, et al. Next-generation phenotyping of inherited retinal diseases from multimodal imaging with Eye2Gene. Nat Mach Intell 2025;7:967-78. [Crossref] [PubMed]
  10. Norden JG. Shah NR. What AI in health care can learn from the long road to autonomous vehicles. NEJM Catalyst 2022; [Crossref]
doi: 10.21037/aes-2025-1-73
Cite this article as: Ashraf S, Adams E, Faes L, Fu DJ. The rise of multimodal foundation models in medicine and ophthalmology. Ann Eye Sci 2026;11:3.

Download Citation