Derestricting web corpus building

Language scientists and developers need corpora, but licensing them (no, not the linguists) can be costly. Either they are too expensive, or there are heavy restrictions on making versions or using them for commercial purposes. Or they add a heavy administrative overhead for gaining permission from all parties involved. In an effort to make it easier to build up corpora from existing web resources, Björn Lindström in Uppsala has come up with a ‘Creative Commons for Corpus Consruction’ which basically collects and parses web pages to check the metadata to see if they have an appropriate Creative Commons license. He’s found that the amount of material on the web licensed under Creative Commons licenses is “more than enough to build a large corpus”. The next step is doing something interesting with the corpus.

Andrew Joscelyne
European, a language technology industry watcher since Electric Word was first published, sometime journalist, consultant, market analyst and animateur of projects. Interested in technologies for augmenting human intellectual endeavour, multilingual méssage, the history of language machines, the future of translation, and the life of the digital mindset.

RELATED ARTICLES

Weekly Digest

Subscribe to stay updated