텍스트 분할을 적용한 KoBART 기반의 실시간 장문 뉴스 요약 시스템 구현

김명권; 이상록

doi:10.29279/jitr.k.2024.29.3.27

본 논문은 KoBART 모델을 기반으로 실시간 장문 뉴스 요약 시스템을 구현하였다. KoBART 모델은 토큰길이가 1024 이상의 뉴스에 대해 모델 특성상 요약이 불가하다. 이를 보완하기 위해 장문의 뉴스를 문단으로 분할하고, 분할된 문단을 요약한 후에 요약된 문장을 재요약하는 방식으로 구현하였다. 그리고, 공인된 AI 허브의 데이터셋으로 성능을 평가하여, 구현된 2단계 요약 방식의 타당성을 입증하였다. 하지만, AI 허브의 데이터셋에서 제공하는 뉴스의 토큰 길이가 대부분 1024 이하이므로, 허깅 페이스에서 제공하는 토큰 길이 1024 이상의 데이터셋을 적용하여 장문의 뉴스에 대한 요약 성능을 분석하였다. 토큰 길이가 1024 이상인 장문 뉴스를 512의 문단 크기로 분할하여 요약하면, 루지 스코어는 평균 33.99%이고, 요약 시 소요되는 실행시간은 0.8492초로 측정되었다. 따라서, 구현된 장문 뉴스 요약 시스템이 토큰 길이가 1024 이상의 장문 뉴스에서도 실시간 서비스가 가능함을 확인하였다.

In this study, a real-timelong-news summarization system is implemented based on the model. Owing to its characteristics, the KoBART model cannot summarize news with a token length of 1024 or more. Hence, we implemented a method of dividing long news into paragraphs, summarizing the divided paragraphs, and then resummarizing the summarized sentences. First, we evaluated the performance using an AI Hub dataset to validate the implemented two-stage summarization method. However, because the token length of most of the news provided in the AI Hub dataset is 1024 or less, we analyzed the performance for long news by applying the dataset provided by Hugging Face with a token length of 1024 or more. When summarizing long news with a token length of 1024 or more by dividing it into 512 paragraphs, the average Luge score is 33.99% and the runtime required for summarization is 0.8492 s. Therefore, we confirmed that the implemented long-news summarization system can provide real-time services, even for long news with a token length of 1024 or more.

텍스트 분할을 적용한 KoBART 기반의 실시간 장문 뉴스 요약 시스템 구현
Implementation of KoBART-Based Real-Time Long-News Summarization System Using Text Segmentation

(0)

(0)

(0)

(0)

텍스트 분할을 적용한 KoBART 기반의 실시간 장문 뉴스 요약 시스템 구현 Implementation of KoBART-Based Real-Time Long-News Summarization System Using Text Segmentation

(0)

(0)

(0)

(0)

텍스트 분할을 적용한 KoBART 기반의 실시간 장문 뉴스 요약 시스템 구현
Implementation of KoBART-Based Real-Time Long-News Summarization System Using Text Segmentation