Research Letter
Abstract
This study analyzed the capability of GPT-4o to properly identify knee osteoarthritis and found that the model had good sensitivity but poor specificity in identifying knee osteoarthritis; patients and clinicians should practice caution when using GPT-4o for image analysis in knee osteoarthritis.
JMIR Biomed Eng 2025;10:e67481doi:10.2196/67481
Keywords
Introduction
Osteoarthritis often affects the knee, causing pain and disability, and is typically diagnosed by X-ray [
]. Advancements in artificial intelligence (AI) offer potential to automate image analysis, reducing diagnostic burden [ ]. Given its widespread availability, tools like ChatGPT have potential as point-of-care diagnostic aids. AI has already been incorporated on the physician side through clinical decision support systems and robotic surgery. On the patient side, AI is used in applications such as virtual health assistants [ ].Orthopedic surgeons, radiologists, and primary care physicians can use AI tools to streamline their workflows and reduce errors while analyzing imaging for pathologies like osteoarthritis. Moreover, patients use ChatGPT to analyze their imaging to further understand their condition [
]. The ability of AI to read other radiological images (eg, computed tomography angiograms) has been shown to be subpar [ ]. However, studies have shown that AI can perform well with X-rays [ ]. As such, it is increasingly important for physicians to understand AI’s strengths and limitations to assess its use in imaging and guide patients using AI for self-diagnosis.Methods
We queried ChatGPT (using the GPT-4o version) and assessed its performance in classifying 500 X-ray images of normal knees and 500 images of knees with osteoarthritis from a publicly available Kaggle database [
]. Images were verified based on consensus among radiologists. A single standardized prompt was used: “This is an x-ray image found on examination, the multiple-choice question is as follows. Based on the x-ray image, does the patient have A) no osteoarthritis, B) osteoarthritis.” Key metrics included accuracy, sensitivity, and specificity. No images were rejected by ChatGPT. The code used for statistical analysis is included in .Results
The model’s performance in distinguishing osteoarthritis from nonosteoarthritis knee X-rays was mixed. The high recall (0.950, 95% CI 0.964-0.943) suggests that the model was sensitive in identifying arthritis cases, while the low specificity (0.114, 95% CI 0.134-0.104) indicated a poor ability to correctly identify nonosteoarthritis cases. The F1-score (0.670, 95% CI 0.699-0.655) balanced precision and recall, showing moderate effectiveness, but the precision (0.517, 95% CI 0.548-0.501) reflected that about half the predicted osteoarthritis cases were correct. Accuracy was 0.532 (95% CI 0.563-0.516).
shows sensitivity and specificity.The binomial test, where the null hypothesis assumed the model’s accuracy was 50% or less, indicated that the model was statistically better than random chance (P=.02). Additionally, the χ2 test (P<.001) indicated a strong dependence between the model’s predictions and the actual labels, demonstrating that its classifications were not purely random. However, the significance of this test should be interpreted with caution, as it does not necessarily reflect high accuracy or clinical reliability.

Discussion
The model had difficulty distinguishing between “not arthritis” and “arthritis.” While the recall for arthritis was high (0.950), indicating strong performance in identifying true arthritis cases, the low specificity (0.114) reflects a significant number of false positives, with many nonarthritis cases misclassified as arthritis. This bias toward predicting arthritis lowered precision (0.517) and accuracy (0.532); similar misclassification issues have been reported in other ChatGPT studies [
].Limitations include, first, that the prompt was binary. A binary prompt was used because it would have been difficult to analyze data obtained with an open-ended prompt. Second, the dataset was small; a larger dataset would have yielded more robust conclusions.
Even with its limitations, this study presents important data on GPT4o’s use in imaging for diagnosing osteoarthritis. This is vital, as our understanding of tools like this in health care contexts is limited. These results suggest a need for better class balance and improved feature differentiation. Similar misclassification patterns have been noted in previous studies, where overlapping features led to false positives [
]. A higher-resolution, more comprehensively annotated osteoarthritis dataset could improve model training, enhancing overall accuracy, sensitivity, and specificity. Thus, future work should focus on analyzing larger datasets and refining the model to handle more nuanced cases more effectively, improving performance statistics. Using image preprocessing techniques, such as contrast enhancement and noise reduction, and including metadata like medical history and clinical presentation could also help distinguish osteoarthritis from anatomical variations.Our results suggest that clinicians should use ChatGPT cautiously and as a screening tool prior to their own validation to help mitigate misclassification. Clinicians should also educate patients about the risks of using AI for self-diagnosis of osteoarthritis based on X-rays. Despite its shortcomings, AI has potential for developing more reliable diagnostic models for osteoarthritis.
Conflicts of Interest
None declared.
Code for analysis and prompting.
DOCX File , 17 KBReferences
- Choi MS, Lee DK. The effect of knee joint traction therapy on pain, physical function, and depression in patients with degenerative arthritis. J Kor Phys Ther. Oct 31, 2019;31(5):317-321. [FREE Full text] [CrossRef]
- Bejarano A. The benefits of artificial intelligence in radiology: transforming healthcare through enhanced diagnostics and workflow efficiency. Rev Contemp Sci Acad Stud. Aug 30, 2023;3(8):1-4. [CrossRef]
- Chatterjee I, Ghosh R, Sarkar S, Das K, Kundu M. Revolutionizing innovations and impact of artificial intelligence in healthcare. Int J Multidiscip Res. May 14, 2024;6(3):19333. [CrossRef]
- Zhang Z, Citardi D, Wang D, Genc Y, Shan J, Fan X. Patients' perceptions of using artificial intelligence (AI)-based technology to comprehend radiology imaging data. Health Informatics J. 2021;27(2):14604582211011215. [FREE Full text] [CrossRef] [Medline]
- Young A, Tan K, Tariq F, Jin MX, Bluestone AY. Rogue AI: cautionary cases in neuroradiology and what we can learn from them. Cureus. Mar 2024;16(3):e56317. [FREE Full text] [CrossRef] [Medline]
- Wu JT, Wong KCL, Gur Y, Ansari N, Karargyris A, Sharma A, et al. Comparison of chest radiograph interpretations by artificial intelligence algorithm vs radiology residents. JAMA Netw Open. Oct 01, 2020;3(10):e2022779. [FREE Full text] [CrossRef] [Medline]
- Kabir F. Osteoarthritis prediction. Kaggle. URL: https://www.kaggle.com/datasets/farjanakabirsamanta/osteoarthritis-prediction [accessed 2024-09-01]
- Dalalah D, Dalalah OM. The false positives and false negatives of generative AI detection tools in education and academic research: the case of ChatGPT. Int J Manag Educ. Jul 2023;21(2):100822. [CrossRef]
- Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep. Dec 17, 2023;13(1):20159. [FREE Full text] [CrossRef] [Medline]
Abbreviations
AI: artificial intelligence |
Edited by S Rizvi, T Leung; submitted 12.10.24; peer-reviewed by Y Chaibi, A Jahnen, M Nayak; comments to author 25.02.25; revised version received 13.03.25; accepted 25.03.25; published 23.04.25.
Copyright©Mihir Tandon, Nitin Chetla, Adarsh Mallepally, Botan Zebari, Sai Samayamanthula, Jonathan Silva, Swapna Vaja, John Chen, Matthew Cullen, Kunal Sukhija. Originally published in JMIR Biomedical Engineering (http://biomsedeng.jmir.org), 23.04.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Biomedical Engineering, is properly cited. The complete bibliographic information, a link to the original publication on https://biomedeng.jmir.org/, as well as this copyright and license information must be included.