سامانه پژوهشی دانشگاه ایلام | Comparison of Human and AI Ratings of Essays and Papers: A Correlational Study

عنوان	Comparison of Human and AI Ratings of Essays and Papers: A Correlational Study
نوع پژوهش	مقاله ارائه شده کنفرانسی
کلیدواژه‌ها	Academic writing assessment,AI-powered rating,essay rating,paper rating,rating by humans
چکیده	Generative AI and NLP have set new horizons in language-related fields such as ELT. Comparison of human and AI processing of natural language has been of focal interest to researchers. This study aimed to compare the ratings by human raters and AI chatbots and find the correlation between them. To this end, each of the 45 students enrolled in the three classes of an academic writing course at Ilam University, Iran, was asked to write an essay and a paper, both of 1500 words, at the end of the course. All 45 essays and 45 papers were given to 10 human raters and submitted to ChatGPT and Microsoft Copilot chatbots for rating. A rubric for the ratings was developed and was identically presented to the human raters and the chatbots. In case of Microsoft Copilot, which did not provide numerical scores due to “the subjectivity of the rating process”, the detailed reviews of strengths and weaknesses it generated were given to a team of 10 other human raters for conversion into numerical ratings based on the rubric without applying their own opinion or assessment. High inter-rater reliability between the members of the human rating group and the conversion group was confirmed separately for each group. The human rating was in double-blind manner and the chatbots did not receive any input other than the full texts of the essays and papers and the rating prompt. The statistical analysis of the data consisted of using Pearson correlation coefficient and t-test. The results of the data analysis revealed that there is a strong positive correlation between human ratings and the ratings by ChatGPT (r=0.87) and a strong positive correlation between human ratings and the ratings based on Microsoft Copilot reviews (r=0.85) was also observed. The results suggest there is a high degree of positive correlation between human ratings and ratings by, or based on the outputs of, AI chatbots. Further investigation and exploration with larger samples, using a diverse set of statistical techniques and methods, and different AI tools, could shed light on nature, aspects and components of the statistical relationship between ratings and assessment by humans and AI- generated ratings and assessments. More insight and understanding on similarities and differences between human and AI ratings in different contexts and for different areas can help with, and create opportunities for, enhanced testing and assessment, and development of more accurate AI-powered rating and grading tools.
پژوهشگران	رضا خانی (Reza Khany) (نفر اول)، محمدمهدی معادی خواه (نفر دوم)

مشخصات پژوهش