국내 서·논술형 평가 자동 채점의 효용에 관한 메타분석

신윤범; 김하정; 원효헌

doi:10.31158/JEEV.2025.38.4.999

본 연구는 국내 서·논술형 평가 자동 채점(AES) 연구에서 보고한 채점 도구의 전반적인 수준과 그 변동 원인을 체계적으로 규명하고자 다층 메타분석을 수행하였다. 분석 대상은 인간 채점자와 자동 채점 간의 관계를 피어슨의 상관계수(r)로 보고한 22편의 독립 연구(K=22)로부터 추출된 총 161개의 효과크기(k)로 설정하였다. 효과크기 추정의 통계적 안정성을 확보하기 위해 Fisher의 z 변환을 거쳤으며, 한 연구 내 다수의 효과크기 간 종속성을 통제하고 추론의 강건성을 높이기 위해 3수준 랜덤효과모형과 CR2 군집 강건 표준오차 및 Satterthwaite 자유도 보정을 적용하였다. 분석 결과, 서·논술형 문항 자동 채점의 전체 평균 효과크기는 r=0.711(95% CI: [0.620, 0.783])로 산출되었다. 이는 국내에서 수행된 서·논술형 평가 자동 채점 연구의 전반적인 결과가 인간 채점자와 높은 수준의 일관성을 보임을 시사한다. 이 평균 효과크기의 강건성은 민감도 검증 및 출판편향 검증을 통해 뒷받침되었다. 동질성 검증 결과 총 변동의 약 81.7%가 실제 이질성으로 설명되었다. 특히 연구 간 변동(Level 3)이 약 66%를 차지하여, 자동 채점 성능 차이가 연구 외적 특성에 의해 더 크게 발생함을 보여주었다. 조절변수 메타회귀 분석 결과, 교과(영문 에세이·수학·교양)는 전통적 z-검정에서 통계적으로 유의한 경향을 보였으나, 강건한 통계 검증(CR2–Satterthwaite)을 적용한 후에는 모두 비유의로 전환되었다.

This study conducted a multilevel meta-analysis to systematically characterize the overall performance of automated essay scoring (AES) as examined in studies published in Korea, focusing on constructed-response and essay assessments, and to identify sources of variability. The analytic corpus comprised K = 22 independent studies drawn exclusively from peer-reviewed domestic journals and dissertations, reporting the association between human and automated scores as a Pearson correlation (r), from which k = 161 effect sizes were extracted. To stabilize the statistical properties of the effect-size estimates, all correlation coefficients were transformed using Fisher’s z. To account for the dependence among multiple effect sizes within studies and to ensure robust statistical inference, we employed a three-level random-effects model in conjunction with CR2 cluster-robust standard errors and Satterthwaite-adjusted degrees of freedom. The pooled mean effect was r = 0.711 (95% CI [0.620, 0.783]), indicating a high level of concordance between human raters and AES in the Korean domestic research context. The robustness of this mean effect was supported by sensitivity (leave-one-out) analyses and small-study bias checks. Tests of heterogeneity showed that approximately 81.7% of the total variance was attributable to true heterogeneity; notably, between-study (Level 3) variability accounted for about 66%, implying that differences in AES performance arise more from study-level characteristics than from within-study factors across domestic AES studies. In meta-regression analyses, the moderator subject (English essay, mathematics, liberal arts) exhibited suggestive patterns under conventional model-based z-tests, but all such signals became non-significant once CR2-Satterthwaite robust inference was applied.

국내 서·논술형 평가 자동 채점의 효용에 관한 메타분석
A Meta-Analysis of the Utility of Automated Scoring for Essay and Constructed-Response Assessments in South Korea

(0)

(0)

(0)

(0)

국내 서·논술형 평가 자동 채점의 효용에 관한 메타분석 A Meta-Analysis of the Utility of Automated Scoring for Essay and Constructed-Response Assessments in South Korea

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

국내 서·논술형 평가 자동 채점의 효용에 관한 메타분석
A Meta-Analysis of the Utility of Automated Scoring for Essay and Constructed-Response Assessments in South Korea

(0)