This paper introduces the process of building a Korean diachronic corpus based on articles in Chosun Ilbo and Donga Ilbo from 1920 to 2019. Newspapers reflect not only the social but also the linguistic reality of their time, as they convey a variety of information and thoughts in the language of ordinary people. Such data must be processed into a form that can be analyzed quantitatively for an effective understanding of this linguistic reality. In order to do so, the spacing and notation of some vocabulary items were modified to meet current norms, and vocabulary listed in various dictionaries was added to the dictionary referenced by the morphological analyzer to improve vocabulary unit detection. After this pre-processing, changes in linguistic form were investigated to show the application of this corpus. The mean number of syllables in words decreased and the length of the sentences showed a continuous decrease. In addition, the proportion of Chinese characters in articles dropped and the use of Hangul and Alphabets has increased.
This paper introduces the process of building a Korean diachronic corpus based on articles in Chosun Ilbo and Donga Ilbo from 1920 to 2019. Newspapers reflect not only the social but also the linguistic reality of their time, as they convey a variety of information and thoughts in the language of ordinary people. Such data must be processed into a form that can be analyzed quantitatively for an effective understanding of this linguistic reality. In order to do so, the spacing and notation of some vocabulary items were modified to meet current norms, and vocabulary listed in various dictionaries was added to the dictionary referenced by the morphological analyzer to improve vocabulary unit detection. After this pre-processing, changes in linguistic form were investigated to show the application of this corpus. The mean number of syllables in words decreased and the length of the sentences showed a continuous decrease. In addition, the proportion of Chinese characters in articles dropped and the use of Hangul and Alphabets has increased.
(0)
(0)