소스 코드 취약점 탐지를 위한 서브워드 토큰화 기반의 딥러닝 모델

김재경

doi:10.38115/asgba.2022.19.3.47

웹 애플리케이션은 오픈 액세스 특성으로 인해 외부 공격에 취약할 수 있기 때문에 소스 코드의 취약점 탐지에 대한 연구는 산업계와 학계에서 주목받고 있다. 본 연구는 소스코드 취약점 탐지 분야에서 딥러닝 모델을 구축하고 성능을 평가하는 것을 목적으로 한다. 제안된 딥러닝 모델은 소스 코드 취약점을 감지하는 데 있어 어려운 문제인 클래스 불균형 문제, 장기 종속성 문제, 어휘 외 문제에 대한 해결책을 제시하였다. 실험 결과, 서브워드토큰화 기반 1차원 컨볼루션 모델의 정밀도는 39%로 우연으로 예측되는 모델의 정밀도인 1.92%보다 약 20배 높은 정확도를 보였다.

The study of vulnerability detection in source code has been attracting attention in practice and academia because web applications can be vulnerable to attacks from the outside due to the open access characteristics. This study aims to build deep learning models and evaluate their performances for the field of source code vulnerability detection. The proposed deep learning models tackle class imbalance problem, long-term dependency problem, and out-of-vocabulary problem which are challenging problems in detecting source code vulnerabilities. As an experiment result, the precision of the subword tokenization-based one-dimensional convolution model showed 39%, which is about 20 times higher than the expected precision of 1.92% of the model predicted by chance. Although Conv1d+BT model using the BERT tokenizer showed the highest AUC value of 0.9116, the precision and recall of this model were 0.39 and 0.35, so it is judged that further improvement is needed for practical application.

소스 코드 취약점 탐지를 위한 서브워드 토큰화 기반의 딥러닝 모델
Deep Learning Models Based on Subword Tokenization for Vulnerability Detection of Source Code

(0)

(0)

(0)

(0)

소스 코드 취약점 탐지를 위한 서브워드 토큰화 기반의 딥러닝 모델 Deep Learning Models Based on Subword Tokenization for Vulnerability Detection of Source Code

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

소스 코드 취약점 탐지를 위한 서브워드 토큰화 기반의 딥러닝 모델
Deep Learning Models Based on Subword Tokenization for Vulnerability Detection of Source Code

(0)