스파크를 이용한 머신러닝의 분산 처리 성능 요인

류우석

doi:10.13067/JKIECS.2021.16.1.19

본 논문에서는 아파치 스파크를 이용하여 머신러닝을 분산 처리할 때의 성능 요인을 분석하고 효율적인 분산 처리를 위한 실행 환경을 실험을 통해 제시한다. 먼저, 분산 클러스터 환경에서 머신러닝을 수행할 때 고려 해야 하는 성능 요인으로 클러스터의 성능, 데이터의 규모, 스파크 엔진의 속성으로 구분하여 분석한다. 그리고 하둡 클러스터에서 동작하는 스파크 MLlib을 이용하여 회귀분석을 수행할 때 노드의 구성과 스파크 Executor의 설정을 변화하면서 성능을 측정한다. 실험 결과 최적의 Executor 개수는 데이터의 블록의 수에 영향을 받으나 클러스터 규모에 따라 최대값, 최소값은 각각 코어의 수, 워커 노드의 수로 제한됨을 실증하였다.

In this paper, we study performance factor of machine learning in the distributed environment using Apache Spark and presents an efficient distributed processing method through experiments. This work firstly presents performance factor when performing machine learning in a distributed cluster by classifying cluster performance, data size, and configuration of spark engine. In addition, performance study of regression analysis using Spark MLlib running on the Hadoop cluster is performed while changing the configuration of the node and the Spark Executor. As a result of the experiment, it was confirmed that the effective number of executors was affected by the number of data blocks, but depending on the cluster size, the maximum and minimum values were limited by the number of cores and the number of worker nodes, respectively.

스파크를 이용한 머신러닝의 분산 처리 성능 요인
Performance Factor of Distributed Processing of Machine Learning using Spark

(0)

(0)

(0)

(0)

스파크를 이용한 머신러닝의 분산 처리 성능 요인 Performance Factor of Distributed Processing of Machine Learning using Spark

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

스파크를 이용한 머신러닝의 분산 처리 성능 요인
Performance Factor of Distributed Processing of Machine Learning using Spark

(0)