TY - JOUR
T1 - “Can We Trust Them?” An Expert Evaluation of Large Language Models to Provide Sleep and Jet Lag Recommendations for Athletes
AU - Athlete Travel, Sleep Interest Group (ATSIG)
AU - Vitale, Jacopo
AU - McCall, Alan
AU - Cina, Andrea
AU - Skorski, Sabrina
AU - Sargent, Charli
AU - Rossiter, Antonia
AU - Roach, Gregory D.
AU - van Rensburg, Audrey Jansen
AU - Nedelec, Mathieu
AU - Miller, Dean
AU - Lastella, Michele
AU - Gupta, Luke
AU - Grandner, Micheal
AU - Fullagar, Hugh
AU - Filip-Stachnik, Aleksandra
AU - Dohi, Michiko
AU - Charest, Jonathan
AU - Biggins, Michelle
AU - Bender, Amy
AU - Alonso, Juan Manuel
AU - van Rensburg, Dina C.Janse
AU - Halson, Shona
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Background: With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes. Objective: This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes. Methods: Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests. Results: Experts’ response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21–0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall. Conclusions: This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.
AB - Background: With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes. Objective: This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes. Methods: Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests. Results: Experts’ response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21–0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall. Conclusions: This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.
UR - https://www.scopus.com/pages/publications/105018212492
UR - https://www.scopus.com/pages/publications/105018212492#tab=citedBy
U2 - 10.1007/s40279-025-02303-5
DO - 10.1007/s40279-025-02303-5
M3 - Article
AN - SCOPUS:105018212492
SN - 0112-1642
JO - Sports Medicine
JF - Sports Medicine
ER -