BERT 기반 양방향 어텐션 멀티모달 트랜스포머를 활용한 설명가능한 작물 분류 모델 연구

김명훈; 국선호; 이관형; 김지수

doi:10.37675/jat.2025.00759

기후 변화의 가속화와 글로벌 식량 안보 위기 심화로 인해, 다양한 환경 조건에서 작물의 상태를 정확하고 신뢰성 있게 분류하는 기술의 중요성이 점차 높아지고 있다. 기존의 작물 분류 모델들은 주로 위성 영상 기반의 스펙트럼 특징이나 시계열 패턴을 학습해 정확도 향상에 집중해왔으나, 모델이 어떤 근거로 특정 작물을 분류했는지 설명하기 어려운 블랙박스 특성으로 인해 실제 농업 의사결정 현장에서 활용이 제한되는 문제가 존재하였다. 특히 고가의 센서나 대규모 인프라 구축이 필요한 시스템은 개발도상국이나 농업 기반이 취약한 지역에서 적용하기 어렵다는 점에서 기술 격차를 심화시키는 한계도 지닌다. 이에 본 연구는 BERT 기반 양방향 어텐션 메커니즘을 멀티모달 트랜스포머 구조에 적용하여, 작물 분류 성능을 유지하면서도 예측 근거를 명확히 해석할 수 있는 설명가능성을 확보하고자 하였다. 제안한 BERT Hybrid 모델은 PVT 백본을 활용해 Sentinel-2 위성 영상에서 공간적 패턴을 추출하고, 이를 기상 시계열 임베딩과 결합한 후 양방향 어텐션을 통해 시점 간·모달리티 간 상호관계를 통합적으로 학습한다. 또한 기존 MMST-ViT (Multi-Modal Spatial-Temporal Vision Transformer) 모델과 동일한 조건에서 비교 실험을 수행하여, 전체 정확도뿐 아니라 생육 단계별 시간축 어텐션 분포와 기상 변수 중요도를 정량적으로 분석하였다. 실험 결과 양방향 어텐션은 시간축과 변수축에서 차별적 학습 전략을 나타냈다. 시간축에서는 중요 생육 단계(개화·결실기)에 선택적으로 집중하여 핵심 시기를 명확히 식별하였으며, 변수축에서는 다양한 기상 요인을 균형 있게 고려하여 특정 변수에 대한 과도한 의존을 회피하였다. 이러한 이중적 특성은 모델이 '언제(when)' 주목할지는 선택적으로, '무엇을(what)' 고려할지는 포괄적으로 판단함을 의미하며, 해석가능성 측면에서 유용한 패턴으로 확인되었다. 본 연구는 멀티모달 농업 AI 모델에서 정확도와 설명가능성 간 트레이드오프를 규명함으로써, 신뢰성 있는 딥러닝 기반 농업 분석 시스템 구축을 위한 중요한 기반을 제시한다. 특히 Sentinel-2 위성영상과 공공 기상자료처럼 전 세계 어디서나 접근 가능한 개방형 데이터만을 활용함으로써, 고가 장비나 복잡한 인프라에 의존하지 않는 저비용 농업 모니터링 체계를 구현할 수 있음을 보여준다. 이는 자원·기술·인프라가 제한된 지역에서도 활용 가능한 적정기술적 접근으로, 농업 정보 격차를 완화하고 지속 가능한 의사결정을 지원할 수 있는 잠재력을 지닌다.

Accelerating climate change and the intensifying global food security crisis have increased the importance of reliable crop classification across diverse environmental conditions. Existing crop classification models have primarily focused on improving accuracy by learning spectral and temporal patterns from satellite imagery; however, their black-box nature makes it difficult to understand the rationale behind each prediction, limiting their applicability in real-world agricultural decision-making. To address this issue, this study introduces a multimodal Transformer model that incorporates a BERT based bidirectional attention mechanism, aiming to retain classification performance while enhancing interpretability. The proposed BERT Hybrid model employs a PVT backbone to extract spatial features from Sentinel-2 satellite imagery and integrates them with meteorological time-series embeddings; bidirectional self-attention is then used to jointly model cross-temporal and cross-modal interactions. We further conduct comparative experiments under the same conditions as the MMST-ViT(Multi-Modal Spatial-Temporal Vision Transformer) baseline, evaluating not only overall accuracy but also temporal attention patterns across crop growth stages and the relative importance of different weather variables. Experimental results show that bidirectional attention alleviates excessive focus on specific timestamps or single variables, producing more consistent and interpretable attention distributions. This study highlights the performance interpretability trade-off in multimodal agricultural AI models and provides a foundation for building trustworthy deep learning systems for crop monitoring. In addition, because the proposed approach relies solely on globally accessible Sentinel-2 satellite imagery and publicly available meteorological data, it demonstrates the potential for constructing large-scale crop monitoring systems at low cost, aligning with the principles of appropriate technology.

BERT 기반 양방향 어텐션 멀티모달 트랜스포머를 활용한 설명가능한 작물 분류 모델 연구
Explainable Crop Classification Using a BERT-Based Bidirectional Attention Multimodal Transformer

(0)

(0)

(0)

(0)

BERT 기반 양방향 어텐션 멀티모달 트랜스포머를 활용한 설명가능한 작물 분류 모델 연구 Explainable Crop Classification Using a BERT-Based Bidirectional Attention Multimodal Transformer

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

BERT 기반 양방향 어텐션 멀티모달 트랜스포머를 활용한 설명가능한 작물 분류 모델 연구
Explainable Crop Classification Using a BERT-Based Bidirectional Attention Multimodal Transformer

(0)