In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안
Enhancing Alignment Between Large Language Models and Teacher in Open-Ended Assessment through In-Context Learning
- 한국교원대학교 뇌·AI기반교육연구소
- Brain, Digital, & Learning
- 제15권 제3호
-
2025.09375 - 401 (27 pages)
-
DOI : 10.31216/BDL.2025.15.3.4
- 47
This study investigates the effectiveness of in-context learning (ICL) in enhancing the agreement between human teachers and large language models (LLMs) in the context of open-ended assessments. Using a dataset of 485 student responses to six open-ended questions from Korean, Technology, and Social Studies subjects administered in 2024, teacher-generated scores and feedback were collected alongside LLM-generated outputs under varying ICL conditions. Specifically, we provided GPT-4.1 with 0 to 20 examples in prompts to examine whether increasing example count improves agreement between the model and human raters. Quadratic Weighted Kappa (QWK) was used to assess score alignment, and BERTScore measured semantic similarity between teacher and model feedback. Regression and mixed-effects analyses revealed that increasing the number of examples generally improved alignment up to a certain threshold. The strongest improvements occurred with fewer than six examples, beyond which the benefits plateaued or even declined. Additionally, prompt length negatively moderated the effect of example count, suggesting that longer prompts may reduce the model’s capacity to focus on relevant information. These results provide practical guidance for teachers using LLMs in openended assessments. Including teacher-generated examples in prompts helps models align more closely with human scoring and feedback. However, the optimal number of examples depends on the type of question and expected answer length: more examples benefit shorter responses, while fewer examples (five or fewer) are more effective for longer or more complex answers.
Introduction
Materials and Methods
Results
Discussions
Conclusions
References
(0)
(0)