In this study, we investigated the degree of word semantic relationship recognized by humans and the degree of word semantic similarity calculation by three popular Korean embedding models(LSA, Word2Vec, FastText) and ETRI Open API. The results of comparing the differences are as follow. First, as a result of analyzing the embedding models using the NIKLex as a performance indicator, there was a significant difference between human perception and computer perception overall. In other words, it can be said that there are currently no machine embedding models that show a significant relationship with the Korean evaluation dataset, which is the only human judgment dataset for the Lexical Relations Data. As a result of dividing the Korean semantic relation data into detailed characteristic groups such as beta coefficient, frequency, and semantic relationship and comparing them with the embedding model, antonyms, high-frequency words, and the fourth quartile of beta coefficients showed overall high relevance. Existing embedding models cannot explain the entire human perception of semantic relationships, but may show a high correlation with human perception in certain semantic relationships or characteristics. The FastText model better captures low-frequency words or out-of-vocabular words, responds well to antonyms and fourth-quartile word pairs with beta coefficients, and most importantly, shows a high correlation with the WordSim353 dataset. This is because while existing embedding models process words(or morphemes) as the minimum unit, FastText treats words as objects that can be broken down into subwords. Therefore, it is expected that the performance of the embedding model can be improved when applied to Korean, which is both a syllabic script and a phonemic script.
In this study, we investigated the degree of word semantic relationship recognized by humans and the degree of word semantic similarity calculation by three popular Korean embedding models(LSA, Word2Vec, FastText) and ETRI Open API. The results of comparing the differences are as follow. First, as a result of analyzing the embedding models using the NIKLex as a performance indicator, there was a significant difference between human perception and computer perception overall. In other words, it can be said that there are currently no machine embedding models that show a significant relationship with the Korean evaluation dataset, which is the only human judgment dataset for the Lexical Relations Data. As a result of dividing the Korean semantic relation data into detailed characteristic groups such as beta coefficient, frequency, and semantic relationship and comparing them with the embedding model, antonyms, high-frequency words, and the fourth quartile of beta coefficients showed overall high relevance. Existing embedding models cannot explain the entire human perception of semantic relationships, but may show a high correlation with human perception in certain semantic relationships or characteristics. The FastText model better captures low-frequency words or out-of-vocabular words, responds well to antonyms and fourth-quartile word pairs with beta coefficients, and most importantly, shows a high correlation with the WordSim353 dataset. This is because while existing embedding models process words(or morphemes) as the minimum unit, FastText treats words as objects that can be broken down into subwords. Therefore, it is expected that the performance of the embedding model can be improved when applied to Korean, which is both a syllabic script and a phonemic script.
(0)
(0)