Derestricting web corpus building

March 31, 2005

Language scientists and developers need corpora, but licensing them (no, not the linguists) can be costly. Either they are too expensive, or there are heavy restrictions on making versions or using them for commercial purposes. Or they add a heavy administrative overhead for gaining permission from all parties involved. In an effort to make it easier to build up corpora from existing web resources, BjÃ¶rn LindstrÃ¶m in Uppsala has come up with a â€˜Creative Commons for Corpus Consructionâ€™ which basically collects and parses web pages to check the metadata to see if they have an appropriate Creative Commons license. Heâ€™s found that the amount of material on the web licensed under Creative Commons licenses is â€œmore than enough to build a large corpusâ€. The next step is doing something interesting with the corpus.

Andrew Joscelyne

European, a language technology industry watcher since Electric Word was first published, sometime journalist, consultant, market analyst and animateur of projects. Interested in technologies for augmenting human intellectual endeavour, multilingual méssage, the history of language machines, the future of translation, and the life of the digital mindset.

Weekly Digest

Subscribe to stay updated

MultiLingual Media LLC

Derestricting web corpus building

RELATED ARTICLES

Embracing AI in Regulated Industries Without Compromising Meaningful Access

Boaty McBoatface: Man versus Machine at Localization World

Localization Services Industry: Does It Scale Down?

MySpace Localization and Local Sites Battle It Out

Apple to provide interpreters in stores

Weekly Newsletter, Subscribe to stay updated!

Login or Register