Assessing the Effectiveness of Large-Scale Artificial Intelligence Models in the Field of Medicine Using Statistical Methods_Vol. 4 No. 1 (JMHS 2026)_Journal of Medicine and Health Science (ISSN: 2959-0639)

Home > Journal of Medicine and Health Science (ISSN: 2959-0639) > Vol. 4 No. 1 (JMHS 2026) >

Assessing the Effectiveness of Large-Scale Artificial Intelligence Models in the Field of Medicine Using Statistical Methods

Download PDF

DOI: https://doi.org/10.62517/jmhs.202605119

Author(s)

Zirui Guo

Affiliation(s)

Shandong Normal University Affiliated Middle School, Jinan, Shandong, China

Abstract

With the remarkable of Large Language Models (LLMs) in natural language processing, artificial intelligence is undergoing significant technological progress and paradigm shifts in the medical field. These developments highlight the immense potential of LLMs in optimizing medical service processes and improving patient treatment outcomes. However, despite substantial progress, LLMs still face numerous challenges in medical scenarios, such as reasoning capabilities, the "model hallucination" problem, and safety risks involved in consultations. Therefore, this study aims to explore the application potential and limitations of LLMs in practical medical consultations. Based on current evaluation methods for large language models and combined with real clinical cases, this research focuses on the consultation process in surgical outpatient clinics and comprehensively assesses the performance of mainstream domestic and international LLMs (Tongyi Qianwen, Doubao, ERNIE Bot, Huatuo GPT, Zuoshou GPT, and Dr. ChatGPT) in surgical consultation scenarios through three distinct stages and multiple dimensions. Additionally, due to the relative lack of safety assessments for medical-specific LLMs, this study carefully designed 30 safety evaluation questions to investigate potential risks associated with the practical use of these models in consultations. Through experimental comparative analysis, this research not only reveals the potential advantages of current LLMs in surgical consultations but also identifies existing flaws and performance bottlenecks. This study provides valuable references for future research on medical LLMs and recommends expanding the scale of test datasets and increasing the diversity of test subjects to further promote the development of domestic LLMs.

Keywords

Large Language Models; Medical Consultation; Safety Assessment;, Surgical Outpatient Clinic

References

[1] Zhou, Z., Shi, J.-X., Song, P.-X., Yang, X.-W., Jin, Y.-X., Guo, L.-Z., & Li, Y.-F. (2024). LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model. ArXiv.org. https://arxiv.org/abs/2406.04614 [2] Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., & Mann, G. (2023). BloombergGPT: A Large Language Model for Finance. ArXiv:2303.17564 [Cs, Q-Fin]. https://arxiv.org/abs/2303.17564 [3] Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., Zhang, R., Song, L., Zhan, M., & Li, H. (2023, October 5). MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning. ArXiv.org. https://doi.org/10.48550/arXiv.2310.03731 [4] Gan, R., Song, Y., Zhang, J., & Zhang, Y. (2024). CHIMED-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences. Yuanhe Tian, 1, 7156–7173. https://aclanthology.org/2024.acllong.386v1.pdf [5] Wang, G., Yang, G., Du, Z., Fan, L., & Li, X. (2023b, June 16). ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. ArXiv.org. https://arxiv.org/abs/2306.09968 [6] Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., & Li, H. (2023). HuatuoGPT, towards Taming Language Model to Be a Doctor. ArXiv.org. https://arxiv.org/abs/2305.15075 [7] Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H., Qin, T., Usuyama, N., White, C., & Horvitz, E. (2023, November 27). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. ArXiv.org. https://doi.org/10.48550/arXiv.2311.16452 [8] Wang, Z., Dong, Y., Zeng, J., Adams, V., Sreedhar, Makesh Narsimhan, Egert, D., Delalleau, O., Scowcroft, J. P., Kant, N., Swope, A., & Kuchaiev, O. (2023). HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM. ArXiv.org. https://arxiv.org/abs/2311.09528 [9] Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal, D., Schaekermann, M., Wang, A., Amin, M., Lachgar, S., Mansfield, P., Prakash, S., Green, B., Dominowska, E., Blaise, A., & Tomasev, N. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. https://doi.org/10.48550/arxiv.2305.09617 [10] Pal, A., Umapathi, Logesh Kumar, & Sankarasubbu, M. (2023). Med-HALT: Medical Domain Hallucination Test for Large Language Models. ArXiv.org. https://arxiv.org/abs/2307.15343 [11] Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., & Dufour, R. (2024). BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. ArXiv.org. https://arxiv.org/abs/2402.10373 [12] Singhal, K., Azizi, S., Tu, T., S. Sara Mahdavi, Wei, J., Hyung Won Chung, Scales, N., Ajay Tanwani, Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Abubakr Babiker, Schärli, N., Aakanksha Chowdhery, Mansfield, P., DemnerFushman, D., & Blaise. (2023). Large language models encode clinical knowledge. Nature, 620. https://doi.org/10.1038/s41586-023-06291-2 [13] Wong, A., Cao, H., Liu, Z., & Li, Y. (2024). SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis. ArXiv.org. https://arxiv.org/abs/2410.15641 [14] Cai, Y., Wang, L., Wang, Y., Melo, de, Zhang, Y., Wang, Y., & He, L. (2023). MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models. ArXiv.org. https://arxiv.org/abs/2312.12806 [15] Liu, M., Ding, J., Xu, J., Hu, W., Li, X., Zhu, L., Bai, Z., Shi, X., Wang, B., Song, H., Liu, P., Zhang, X., Wang, S., Li, K., Wang, H., Ruan, T., Huang, X., Sun, X., & Zhang, S. (2024). MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. ArXiv.org. https://arxiv.org/abs/2407.10990 [16] Jingnant. (2024). GitHub - jingnant/Medical-LLMs-Chinese-Exam: MLCE -A Chinese medical examination dataset summarized for the medical proficiency test of large models. [17] GitHub. https://github.com/jingnant/ Medical-LLMs-Chinese-Exam [18] Toyhom. (2019). GitHub - Toyhom/Chinese-medical-dialogue-data: Chinese medical dialogue data Chinese Medical Dialogue Dataset. GitHub. https://github.com/Toyhom/Chinese-medical-dialogue-data [19] Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024, February 10). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. ArXiv.org. https://doi.org/10.48550/arXiv.2402.03216 [20] Krippendorff, K. (2011). Systematic disagreement Sampling errors Computing Krippendorff’s Alpha-Reliability. https://www.asc.upenn.edu/sites/default/files/2021-03/Computing%20Krippendorff%27s%20Alpha-Reliability.pdf