The advantages and challenges of “big data”: Insights from the 14 billion word iWeb corpus

원문보기

원문저장

The iWeb corpus contains nearly 14 billion words from 22 million web pages, and it has been designed in a way that allows users to quickly and easily create “Virtual Corpora”, in order to focus on websites that are related to their areas of interest. The data from this very large corpus provides very detailed information on syntactic, morphological, lexical, and semantic phenomena, in ways that would never be possible with a small 100 million or 500 million word corpus. In addition, the corpus provides a number of features that are not available with other large corpora, such as the ability to perform advanced searches of the top 60,000 words in the corpus, and to see a wealth of information on each of these words – definitions, links to images and audio, translations, detailed frequency information, related topics, collocates, word clusters, re-sortable concordance lines, and much more. Finally, we discuss the challenges of large corpora, and how the corpus architecture that is used for iWeb has uniquely been designed to address these challenges.

1. Introduction

2. Creating the iWeb corpus

3. The advantages of very large corpora for syntax and morphology

4. The advantages of very large corpora for lexis and meaning

5. The challenges of very large corpora

7. Conclusion

The advantages and challenges of “big data”: Insights from the 14 billion word iWeb corpus

(0)

(0)

(0)

(0)

The advantages and challenges of “big data”: Insights from the 14 billion word iWeb corpus

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

(0)