서·논술형 평가에서 생성형 AI 활용 가능성 탐색: 리젠트 시험 데이터를 중심으로

안해연

doi:10.31158/JEEV.2025.38.3.823

본 연구는 서·논술형 평가에서 생성형 AI의 활용 가능성을 실증적으로 분석하기 위해 미국 뉴욕주 리젠트 시험의 서·논술형 답안을 대상으로 GPT-4o, Gemini 2.0, Gemini 2.5 모델의 평가 성능을 비교·검토하였다. 가중 파카 계수(QWK), 평균 절대 오차(MAE), 상관 계수(PCC)를 분석한 결과, 모든 모델이 QWK 0.889~0.935, MAE 0.210~0.410, PCC 0.904~0.944를 기록하며 높은 정확도를 보였다. 자료 기반 논증형 문항에서는 Gemini 2.5, 텍스트 분석형 문항에서는 GPT-4o가 가장 우수했다. 혼동 행렬 분석에서도 대부분의 오차가 ±1점 이내였으나, 등급 경계 혼동과 0점 과대평가 등 일부 한계가 확인되었다. 본 연구는 정교한 평가 기준표와 등급별 예시 답안을 활용하여 등급 차이를 보정하였고, LLM의 편향을 방지하기 위한 시스템 명령 프롬프트를 적용하였다는 점에서 기존 연구와 차별성을 지닌다. 이를 통해 생성형 AI가 서·논술형 평가에서 신뢰성 있는 도구로 기능할 가능성을 확인하고, 인간-AI 협업 평가 체계를 제안하였다.

This study empirically analyzed the potential of generative AI in constructed-response assessment by comparing the scoring performance of GPT-4o, Gemini 2.0, and Gemini 2.5 on written responses from the New York State Regents Examinations. Analyses using the Quadratic Weighted Kappa (QWK), Mean Absolute Error (MAE), and Pearson Correlation Coefficient (PCC) showed that all models achieved high accuracy, with QWK scores ranging from 0.889 to 0.935, MAE from 0.210 to 0.410, and PCC from 0.904 to 0.944. Gemini 2.5 performed best on evidence-based argument tasks, while GPT-4o showed the highest accuracy on text-analysis items. Confusion matrix analysis revealed that most errors were within ±1 point, though some limitations were observed, including boundary-level misclassifications and overestimation of zero scores. By employing a refined scoring rubric and grade-specific anchor papers as preparatory materials, and by implementing system prompts to mitigate large language model bias, this study distinguishes itself from prior research. These findings suggest that generative AI can serve as a reliable tool for evaluating constructed responses and propose a collaborative human-AI scoring framework.

서·논술형 평가에서 생성형 AI 활용 가능성 탐색: 리젠트 시험 데이터를 중심으로
Exploring the Potential of Generative AI in Essay-Based Assessments: Evidence from the Regents Exam Data

(0)

(0)

(0)

(0)

서·논술형 평가에서 생성형 AI 활용 가능성 탐색: 리젠트 시험 데이터를 중심으로 Exploring the Potential of Generative AI in Essay-Based Assessments: Evidence from the Regents Exam Data

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

서·논술형 평가에서 생성형 AI 활용 가능성 탐색: 리젠트 시험 데이터를 중심으로
Exploring the Potential of Generative AI in Essay-Based Assessments: Evidence from the Regents Exam Data

(0)