A Corpus-Linguistics-Based Comparison of AI-Aided Writing and Students' Writing
DOI: https://doi.org/10.62517/jbdc.202401420
Author(s)
Zheng Wang
Affiliation(s)
Xiamen University Tan Kah Kee College, Xiamen, Fujian, China
Abstract
This study investigates the linguistic and stylistic differences between AI-aided writing and students' writing through a corpus-linguistics-based analysis. Two corpora were constructed, each consisting of 20 essays. The corpora were analyzed using AntConc 4.2.0 and compared against the Brown and Frown reference corpora to provide a benchmark for modern American English. Quantitative indicators such as type-token ratio (TTR), word length, sentence length, high-frequency words, and entropy were calculated to uncover distinctive linguistic features and patterns. The results reveal notable differences between the two corpora. AI-aided writing employs longer words compared to students’ writing, indicating a more sophisticated vocabulary in AI-generated texts. In contrast, students’ writing exhibits greater lexical variety and more syntactically flexible sentences. These findings suggest that AI-aided writing aligns more closely with the lexical sophistication of standard American English, as represented by the Brown and Frown corpora, while students’ writing reflects a simpler and more conversational style. This study provides insights into the linguistic characteristics of AI-aided writing and its implications for education, language learning, and the evolving role of AI in writing practices.
Keywords
Corpus-Linguistics-Based; Comparison; AI-Aided Writing; Students' Writing
References
[1] Sinclair, J. (Ed.). (1991). Corpus, concordance, collocation. Oxford University Press.
[2] McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press.
[3] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.
[4] Hunston, S. (2006). Corpus linguistics. In K. Brown (Ed.), Encyclopedia of language & linguistics (2nd ed., pp. 234–248). Elsevier. https://doi.org/10.1016/b0-08-044854-2/00944-5
[5] Martinčić-Ipšić, S., Miličić, T., & Todorovski, L. (2019). The influence of feature representation of text on the performance of document classification. Applied Sciences, 9(4), 743. https://doi.org/10.3390/app9040743
[6] Mendhakar, A. (2022). Linguistic profiling of text genres: An exploration of fictional vs. non-fictional texts. Information, 13, 357. https://doi.org/10.3390/info13080357
[7] Fajri, M. S. A., & Okwar, V. (2020). Exploring a diachronic change in the use of English relative clauses: A corpus-based study and its implication for pedagogy. Sage Open, 10(4). https://doi.org/10.1177/2158244020975027
[8] Gries, S. T. (2010). Useful statistics for corpus linguistics. In A. Sanchez & M. Almela (Eds.), A mosaic of corpus linguistics: Selected approaches (pp. 269–291). Peter Lang.
[9] Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
[10] Altmann, G., & Köhler, R. (2015). Forms and degrees of repetition in texts: Detection and analysis. Walter de Gruyter GmbH.
[11] Geluso, J., & Hirch, R. R. (2019). The reference corpus matters: Comparing the effect of different reference corpora on keyword analysis. Register Studies, 1(2), 209–242. https://doi.org/10.1075/rs.18001.gel
[12] Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of Child Language, 14(2), 201–209. https://doi.org/10.1017/S0305000900012885.