분야연상어를 이용한 화제분야의 계산방법과 단란검색

한국정보처리학회
정보처리학회논문지(B)
제12권 제1호
2005.02

57 - 68 (12 pages)

커버이미지 없음

텍스트엥 임베디드 되어 있는 부가적인 정보를 이용하여 문서의 실제적인 의미단위인 텍스트를 분리하는 단락검색은 중요한 기술이다. 본 논문에서는 문서의 분야에 적합한 단락만을 분리하여 사용자의 요구에 적합한 단락을 추출하는 기술을 설명한다. 문서에서 분야연상어를 추출하여, 각 문장마다 화제의 분야가 어떻게 커져가고, 줄어들고, 변화하여 가는지를 측정하는 방법을 실험을 통해 설명한다. 긴 문서에서 어떤 화제가 출현하는가를 파악하고, 화제가 계속되거나 혹은 전환되는 지점을 측정하고, 분야별로 단락을 구분하는 방법을 계산한다. 12,500개의 한국어 신문기사를 이용하여 실험한 결과 88%의 정확률과 78%의 재현율을 얻을 수 있었다.

It is important to segment a text, which is indpendent upon any text-embedded auxitiary information. This paper presents a technique for dividin the text into field-coherent passages. The presented method is based upon extracting field-associated terms from the text measuring how the topics grow, shrink and shift from sentence to sentence. We propose measures of topic continuity and of topic transition and suggest how those could be used to find the boundaries among passages. Aftercollecting 12,500 documents, we obtain 88% for average precision and 78% for recall in Korean training set.

분야연상어를 이용한 화제분야의 계산방법과 단란검색

(0)

(0)

(0)

(0)

분야연상어를 이용한 화제분야의 계산방법과 단란검색

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

(0)