인공지능기반 사전학습언어모델 적용방안에 관한 연구

배재권

doi:10.38115/asgba.2024.21.2.64

사전학습언어모델(Pre-trained Language Model)은 대량의 텍스트 데이터를 활용하여 사전에 학습(pre-training)된 자연어 처리 모델을 의미한다. 사전학습언어모델이 다양한 영역에서 활용되고 있으나 전문용어 학습데이터가 부족한 영역에서 도메인에 특화된 용어를 이해하지 못하는 한계점을 가진다. 따라서 최근 BERT(Bidirectional Encoder Representations from Transformers)와 GPT(Generative Pretrained Transformer)를 기반으로 추가 사전학습을 통해 변형된 도메인 특화 언어모델의 필요성이 강조되고 있다. 본 연구에서는 BERT의 사전훈련방법과 BERT 기반의 변형기법(ALBERT, RoBERTa, ELECTRA)을 분석하고, 대표적인 도메인 특화 분야인 바이오의학, 금융, 법률 도메인에서 활용 가능한 사전학습언어모델을 제안하고자 한다. 바이오의학 특화 사전학습모델은 바이오의학 분야의 전문 용어, 의학적 문장 구조, 의학적 개체명 인식 등의 도메인 특정 언어 특성을 학습하도록 설계된다. 이것은 주로 BERT의 사전훈련방법과 아키텍처를 기반으로 전이학습을 통해 바이오의학 작업에 적용될 수 있도록 조정된다. 바이오의학 특화 사전학습모델은 의료 문서 분류, 의료 개체명 인식, 의료 질문 응답, 바이오의학 관련 정보 검색 등의 다양한 자연어 처리 작업에 사용될 수 있다. 금융 특화 사전학습모델은 금융 전문 용어, 금융 시장 동향, 금융 상품 및 서비스에 관련된 문장 구조 등을 이해하고 처리할 수 있는 모델이다. 금융 시장 동향에 관한 자동화된 뉴스 기사를 생성하고, 금융 보고서, 보도 자료 등과 같은 긴 텍스트를 간결하게 요약하여 핵심 정보를 추출하는 작업에 활용될 수 있다. 또한 금융 특화 사전학습모델은 금융 분석가들이 기업의 재무 상태, 성과 및 전망에 대한 투자 제안을 생성하는 데 도움을 준다. 마지막으로 법률 특화 사전학습모델은 법률 문서에 적합한 언어 모델로 법률 문서 분류 및 요약, 법률 문서 유사성 평가 등에 활용된다. 법률 특화 사전학습모델은 BERT 모델을 법률 분야의 특수한 텍스트에 대해 사전학습하고, 이를 통해 법률 문서에 특화된 특성을 학습한다. 이러한 특성은 법률 분야의 특수한 용어, 문맥, 문법 등을 포함한다. 법률 특화 사전학습모델은 법률 말뭉치를 사용한 스크래치 사전학습과 추가 사전학습을 통해 법률 관련 태스크를 해결하도록 성능을 고도화할 수 있다.

Pre-trained Language Model(PLM) refers to a natural language processing(NLP) model that has been pre-trained using large amounts of text data. The PLM has the limitation of not being able to understand domain-specific terminology due to a lack of training data for terminology. Therefore, the need for a domain-specific language model modified through BERT- or GPT-based pre-trained learning has recently been emphasized. In this study, we analyze BERT's pre-training method and BERT-based transformation techniques (ALBERT, RoBERTa, ELECTRA) and propose a PLM that can be used in biomedical, financial, and legal domains. The biomedical-specific pre-trained learning model is designed to learn domain-specific language characteristics such as technical terminology, medical sentence structure, and medical entity name recognition in the biomedical field. It is mainly adjusted to be applied to biomedical tasks through transfer learning based on BERT's pre-training method and architecture. For this purpose, it is pre-trained with pre-trained biomedical text data, and this pre-training transfers domain-specific knowledge to the model through learning representations for biomedical-related texts. The finance-specific pre-trained learning model is a model that can understand and process financial terminology, financial market trends, and sentence structures and vocabulary related to financial products and services. It can be used to generate news articles about financial market trends and to extract key information by concisely summarizing long texts such as financial reports and corporate press releases. Additionally, finance-specific pre-trained models help financial analysts generate investment recommendations based on a company's financial condition, performance, and prospects. The legal-specific pre-trained model is a language model suitable for legal documents and is used for legal document classification, legal document summarization, and legal document similarity evaluation. The legal-specific pre-learning model was created by pre-training the BERT model on special texts in the legal field, and through this, it learns characteristics specialized for legal documents. The performance of the legal-specific pre-training model can be improved to solve legal-related tasks through scratch pre-training and additional pre-training using legal corpora.

인공지능기반 사전학습언어모델 적용방안에 관한 연구
A Study on Application of the Artificial Intelligence-Based Pre-trained Language Model

(0)

(0)

(0)

(0)

인공지능기반 사전학습언어모델 적용방안에 관한 연구 A Study on Application of the Artificial Intelligence-Based Pre-trained Language Model

(0)

(0)

(0)

(0)

인공지능기반 사전학습언어모델 적용방안에 관한 연구
A Study on Application of the Artificial Intelligence-Based Pre-trained Language Model