유전체 언어 모델의 토큰화 및 사전학습 전략에 대한 표현 학습 관점에서의 고찰

정원식; 고동현

doi:10.13067/JKIECS.2026.21.2.701

최근 대규모 언어 모델(large language model, LLM)의 발전은 문자열 기반 데이터를 언어로 해석하는 표현 학습 패러다임을 확장시키며, 유전체 염기서열을 직접 다루는 유전체 언어 모델(genomic language model, gLM) 연구를 촉진하였다. 그러나 유전체 서열은 자연어와 달리 명확한 의미 단위 경계가 없고, 초장거리 의존성과 단일 염기 변이 수준의 민감도를 동시에 요구하므로 기존 자연어 처리 모델의 단순 적용에는 구조적, 계산적 한계가 따른다. 본 논문은 gLM의 핵심설계 변수인 토큰화를 중심으로 고정 길이 k-mer, 뉴클레오타이드 단위, 데이터 기반 하위 단어 토큰화 전략을 체계적으로 정리하고, 각 접근이 유전체 표현에 미치는 영향을 분석하였다. 또한 마스킹 언어 모델링, 오토리그레시브 모델링, 도메인 특화 목적 함수 및 대조 학습 등 주요 사전학습 목표를 정리하고, 토큰화-목적 함수-모델 구조의 결합 방식에 따라 표현이 국소 모티프 중심으로 학습되는지 또는 장거리 상호작용을 포착하는지와 같은 학습 특성의 차이를 논의한다. 이를 바탕으로 gLM의 성능과 해석 가능성을 동시에 향상시키기 위해서는 유전체의 다중 스케일 구조를 반영한 토큰화 선택, 도메인 특화 사전학습 목표, 장거리 특성을 고려한 효율적 시퀀스 모델을 공동 설계하는 접근이 중요함을 제안한다.

Recent advances in large language models (LLMs) have extended the representation learning paradigm that treats string-based data as language, thereby accelerating research on genomic language models (gLMs) that directly model DNA sequences. However, unlike natural language, genomic sequences lack clear semantic boundaries and require modeling both long-range dependencies and single-nucleotide-level sensitivity, which makes a naive transfer of standard natural language processing models. In this paper, we review tokenization as a key design choice in gLMs by systematizing fixed-length k-mer, single-nucleotide, and data-driven subword tokenization schemes, and analyzing how each affect genomic representations. We further summarize major pretraining objectives such as masked language modeling, autoregressive modeling, domain-specific objectives, and contrastive learning, and discuss how different combinations of tokenization, objectives, and model architectures learned representations. Based on this analysis, we suggest that future gLMs should be designed within an integrated framework that jointly considers genome-aware mixed tokenization, genomics-informed pretraining objectives, and efficient sequence model architectures.

유전체 언어 모델의 토큰화 및 사전학습 전략에 대한 표현 학습 관점에서의 고찰
Tokenization and Pretraining Strategies in Genomic Language Models: A Representation Learning Perspective

(0)

(0)

(0)

(0)

유전체 언어 모델의 토큰화 및 사전학습 전략에 대한 표현 학습 관점에서의 고찰 Tokenization and Pretraining Strategies in Genomic Language Models: A Representation Learning Perspective

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

유전체 언어 모델의 토큰화 및 사전학습 전략에 대한 표현 학습 관점에서의 고찰
Tokenization and Pretraining Strategies in Genomic Language Models: A Representation Learning Perspective

(0)