Seems to me there is a knowledge hole in the language technology communication space. There are lots of R&D sites and community events, a few dedicated news sites such as here and here that relay press releases, and one notable publication that is gracious enough to host this blog. Whatâ€™s missing is any sense of what gets language technologists going. So hereâ€™s the first of an occasional series of quickie chats with some lang tech rockers and rollers to find out what’s cooking – and why.
Lingster is an open source initiative to collect and share dictionaries for innovative translation automation systems. It was started by Olga Beregovaya, who works by day as a computational linguist for a large software developer in California, and whose night job is building a multilingual content management system with computer-assisted translation capabilities. Olga is a graduate of St. Petersburg State University, and UC Berkeley.
â€œI realized I needed lots of lexicons and glossary data. What was available consisted of very plain monolingual word lists, which makes multilingual alignment hard. And they werenâ€™t marked up for morphology or part of speech.
â€œYou can find material on the web but if itâ€™s free and available itâ€™s crap. If itâ€™s decent you have to pay US$ 15,000 or so which is too much for me. I am planning to include 15 language pairs in my application, and I need up to date terms and expressions. So I started lingster as an open source dictionary portal so others could help me and I in turn could help them.
â€œThe idea is to identify and contribute linguistic data in the form of lexicons and glossaries, idiom lists etc. The next step is to add in part of speech and grammatical information to expressions, and then the R&D community can download the resulting resources for free. This way we all benefit.
â€œA lot of other people seem to be suffering from the data problem, as the interest in the last few months has been terrific. Lots of linguists see the value of this sort of initiative. Especially since web crawlers programmed to harvest dictionary data end up by hitting Intellectual Property problems.
â€œQuality is obviously an issue. We have a filtering system to validate terms using community voting, with a volunteer moderator for each language who throws out unacceptable equivalents. We use a UTF8 control mechanism to stops garbage characters getting into the files, and a conversion mechanism for file formats.
â€œThere is no charge for the software which is all open source, with no hidden dictionary data. We welcome the concept of a community of development, since contributors will be rewarded according to the open source licensing system. Later on, we shall set up a subscription system for those wanting truly premium data.â€
And what exactly is Olgaâ€™s pet project? An elearning facility that turns any web resource into a potential language learning resource, with instant word lookup. The betaâ€™s due out in a couple of months.