상세검색
최근 검색어 전체 삭제
다국어입력
즐겨찾기0
학술저널

The use of context vectors in determining Thai compounds

  • 39
121614.jpg

In this paper we first discuss the problem of identifying compound words in Thai. It will be shown that the structures of compound words are often identical to the structure of phrases or sentences. Determining whether a sequence of words is a compound or a phrase or a sentence has to be determined within the context in which it occurs. Therefore, there is no clear cut way of determining the boundary of a compound. The longer the word sequence, the less likely it is that it will be considered a compound. In this study, we focus on extracting compound words consisting of two words from a large corpus using a vector space model. The basic assumption is that the context in which a compound occurs should be different from the context in which its parts occur. Two experiments were conducted on known compounds and on general bigram words. The test on known compounds was to verify that the cosine similarity of the context vector clearly indicates the differences of context vectors between the compound and its parts. The test on general bigram words was to further verify that the cosine similarity of context vectors between a non-compound bigram and its parts is different from that found in known compounds. When applying the cosine similarity of context vectors to compound candidates which have been extracted from a large corpus and ranked by statistics of collocation, we can determine a compound correctly with the F-measure at 0.81. The results indicate that the cosine similarity of context vectors is useful for determining a compound in Thai. (Chulalongkorn University)

Abstract

1. Introduction

2. Thai Compounds

3. Framework of Analysis

4. Compound Extraction

5. Findings

6. Analysis of Context Vector on Compound Types

7. Discussion

References

(0)

(0)

로딩중