In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안

이주영; 이은아; 김강래

doi:10.31216/BDL.2025.15.3.4

한국교원대학교 뇌·AI기반교육연구소 Brain, Digital, & Learning 제15권 제3호 In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안

학술저널

In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안
Enhancing Alignment Between Large Language Models and Teacher in Open-Ended Assessment through In-Context Learning

한국교원대학교 뇌·AI기반교육연구소
Brain, Digital, & Learning
제15권 제3호
2025.09

375 - 401 (27 pages)
DOI : 10.31216/BDL.2025.15.3.4

원문보기

원문저장

This study investigates the effectiveness of in-context learning (ICL) in enhancing the agreement between human teachers and large language models (LLMs) in the context of open-ended assessments. Using a dataset of 485 student responses to six open-ended questions from Korean, Technology, and Social Studies subjects administered in 2024, teacher-generated scores and feedback were collected alongside LLM-generated outputs under varying ICL conditions. Specifically, we provided GPT-4.1 with 0 to 20 examples in prompts to examine whether increasing example count improves agreement between the model and human raters. Quadratic Weighted Kappa (QWK) was used to assess score alignment, and BERTScore measured semantic similarity between teacher and model feedback. Regression and mixed-effects analyses revealed that increasing the number of examples generally improved alignment up to a certain threshold. The strongest improvements occurred with fewer than six examples, beyond which the benefits plateaued or even declined. Additionally, prompt length negatively moderated the effect of example count, suggesting that longer prompts may reduce the model’s capacity to focus on relevant information. These results provide practical guidance for teachers using LLMs in openended assessments. Including teacher-generated examples in prompts helps models align more closely with human scoring and feedback. However, the optimal number of examples depends on the type of question and expected answer length: more examples benefit shorter responses, while fewer examples (five or fewer) are more effective for longer or more complex answers.

Introduction

Materials and Methods

Results

Discussions

Conclusions

References

In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안
Enhancing Alignment Between Large Language Models and Teacher in Open-Ended Assessment through In-Context Learning

(0)

(0)

(0)

(0)

In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안 Enhancing Alignment Between Large Language Models and Teacher in Open-Ended Assessment through In-Context Learning

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

In-Context Learning을 통한 서·논술형 평가에서의 대형 언어 모델과 교사 간 채점 및 피드백 정합성 향상 방안
Enhancing Alignment Between Large Language Models and Teacher in Open-Ended Assessment through In-Context Learning

(0)