LLM의 윤리성 강화를 위한 한국어 비윤리 텍스트 필터링 모델

정유남; 김종찬; 신광성

doi:10.13067/JKIECS.2025.20.5.1061

대형 언어모델(LLM)의 윤리성 문제가 대두되는 가운데, 본 연구는 한국어 환경에 적합한 비윤리 표현 필터링 모델을 개발하였다. 총 36만여 개의 한국어 대화 데이터를 기반으로 KoBERT 및 KoELECTRA 모델을 활용하여 7개 유형(차별, 혐오, 비난, 폭력, 범죄, 선정, 욕설)의 비윤리 표현에 대해 이진 및 다중 라벨 분류를 수행했다. LoRA 기법을 적용해 전체 파라미터의 0.1%만 학습하면서도 성능 향상을 달성했다. 그 결과, KoELECTRA + LoRA 모델은 이진 분류에서 정확도 93.1%, F1-score 0.930을 기록하며 최고 성능을 보였고, 다중 라벨 분류에서는 Micro F1 0.858, Macro F1 0.816의 성능을 달성했다. 본 모델은 한국어 LLM의 안전성 확보와 온라인 커뮤니케이션의 질 향상에 기여할 수 있으며, 향후 클래스 불균형 및 실시간 처리 문제를 개선할 여지가 있다.

As ethical concerns surrounding large language models (LLMs) grow, this study presents a filtering model tailored to detect unethical expressions in the Korean language environment. Utilizing approximately 360,000 Korean conversational data samples, we developed binary and multi-label classifiers for seven categories of unethical content: discrimination, hate, censure, violence, crime, sexually explicit content, and profanity. By applying the Low-Rank Adaptation (LoRA) technique, we achieved parameter-efficient training, updating only 0.1% of the total parameters while enhancing performance. As a result, the KoELECTRA + LoRA model achieved the highest performance in binary classification with an accuracy of 93.1% and an F1-score of 0.930. In multi-label classification, the KoELECTRA model reached a Micro F1-score of 0.858 and a Macro F1-score of 0.816.This model contributes to enhancing the safety of Korean LLMs and promoting healthier online communication. Future improvements may focus on addressing class imbalance and optimizing real-time processing.

LLM의 윤리성 강화를 위한 한국어 비윤리 텍스트 필터링 모델
Korean Unethical Text Filtering Model for Enhancing the Ethicality of Large Language Models (LLMs)

(0)

(0)

(0)

(0)

LLM의 윤리성 강화를 위한 한국어 비윤리 텍스트 필터링 모델 Korean Unethical Text Filtering Model for Enhancing the Ethicality of Large Language Models (LLMs)

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

LLM의 윤리성 강화를 위한 한국어 비윤리 텍스트 필터링 모델
Korean Unethical Text Filtering Model for Enhancing the Ethicality of Large Language Models (LLMs)

(0)