한글문서 분류용으로 이용할 복합어로 구성된 분야연상어의 추출법

한국정보과학회
정보과학회논문지 소프트웨어 및 응용
제32권 제2호
2005.02

636 - 649 (14 pages)

커버이미지 없음

분야연상어는 어휘자체가 분야정보를 가지므로 인간이 분야를 인지할 때와 유사하게 문서의 분야를 판단한다. 한국어의 경우 180분야로 분류된 약 15,000개의 문서뱅크를 수집하여 구축·실험한 결과 88,782개의 단일 분야연상어가 8,405개로 전체의 약 98%로 압축되며, 재현율 0.77 이상(평균 0.85), 정확률 0.90 이상(평균 0.94)의 높은 추출 정밀도를 얻었다. 구축한 분야연상어를 문서분류의 초기결정에 적용하여 인간에 의한 분야결정과 비교한 결과 약 90%이상의 정답률을 얻었다. 연구결과를 문서분류의 초기단계에 관한 기초연구로 이용하고, 다언어(multilingual) 간의 문서검색에 적용하여 다국어 정보검색에 대한 기초연구로 이용할 수 있다.

Field-associated Terms itself have field information. So, they determine field of document just like when human being perceives field. In case of Korean, we organized and experimented them by collecting approximately 15,999 document banks that are classified into 180 fields. We obtained high precision of extraction that 88,782 single field-associated terms are contracted into 8,405 ones thus recording compression rate as approximately 9% and recall as above 0.77 (average 0.85), precision as above 0.90 (average 0.94). By applying established field-associated terms to initial determination for document classification and comparing it with field determination by human being, we got correct answers above approximately 90%. We can use results of research as fundamental research for initial stage and apply it document retrieval between multilingual environment thus utilizing it as fundamental research for multilingual information retrieval.

한글문서 분류용으로 이용할 복합어로 구성된 분야연상어의 추출법

(0)

(0)

(0)

(0)

한글문서 분류용으로 이용할 복합어로 구성된 분야연상어의 추출법

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

(0)