Words are great! We learn them at an early age in order for us to communicate. We use them to inform our family of our needs, to make new friends, and to learn a variety of subjects during our education. We spend our life continually learning new words and how to use them. This enables us to make our lives more successful and enjoyable. Who could not love words? My answer to this question is: anyone performing localization or translation of natural language!
The use of words creates problems that make these tasks more difficult.
The first problem with words is that there are so many of them!
The large number of words makes it necessary for huge word glossaries. Many of the words in these glossaries have a small probability of being used. The probability of a sentence composed of these individual words is even smaller.
These tiny probabilities make it difficult to accurately process Natural Language text.
Many words are also used for different grammatical purposes at different times. Does the word “bill” act as a proper pronoun (a name), a verb (an action to obtain money), or a noun (a legislative proposal)? These multiple possible purposes result in a variety of possible interpretations of the grammatical structure and meaning of a sentence.
So why are words being used as a basis to process natural Language?
The reason may simply be because we have always done it that way!
But, what if we adopted a different basis?
Could a higher-level abstraction of natural language create a different basis that provides an easier approach to localization and translation?
Instead of individual words, this abstraction might utilize sequences of words as a basis. These sequences would be different from our familiar phrases and clauses of a sentence. Each sequence of words would perform a new, high-level grammatical function and be grouped into categories of sequences performing that same function.
The first advantage of this approach would be the reduction in the number of separate language units that would need to be manipulated. The very large number of individual words would be decreased to a small number of categories.
If chosen carefully, each category would perform one, and only one, high-level grammatical function. The mapping to a single grammatical function would eliminate the set of possible grammar and meaning of words in a sentence and replace it with a single grammatical structure. The variability of the probability of a sentence would be decreased.
The use of grammatical categories would not increase the low probability of any specific text. However, each category would contain multiple sequences of words, each of which would contribute their probability to the category. This would increase the overall probability of that category.
It would take a large effort to implement this new natural language basis. Some of the challenges that would need to be addressed include common agreement on the set of categories, education of personnel about the new basis, development of new text parsers and glossaries, and the measurement of the probabilities of each text category.
These efforts would reap the benefits of a reduction in the number of separate language units, the identification of a single possible grammatical sentence structure, a decrease in the variability of probability, and an increase in probability values.
This change in basis might provide a simpler and easier approach to processing Natural Language.
What do you think?