Congree Authoring Server
Linguistic intelligence combined with corporate wording provides intriguing authoring possibilities
Nicole Keller, Ph.D., is a qualified translator for English and Spanish focusing on medical translation and software localization. For the last ten years, she has been in charge of the tool section of MDÜ magazine and the edition magazine, evaluating dozens of traditional and modern tools entering the translation market. She also works as a consultant and trainer for translation technology. Since 2007 she has been teaching at the Institute for Translation and Interpreting.
ongree Language Technology GmbH is a software manufacturer based in the south of Germany. They specialize in the field of computer linguistics, and support companies of any size in the process of content optimization. They mainly concentrate on the work of technical writers and corporate authors who generate source content. Since they integrate the Congree Linguistic Engine, which performs a morphological analysis, their Language Check is only available for a small selection of languages (English, German, French, and Spanish). For Italian and Japanese, Congree offers a basic Language Check that is not as sophisticated as for the main languages. The other components are language-independent and can be used for all languages.
In the newest version, Congree announced a comprehensive revision of the administration module and many new features for the English language.
Particularly in times in which the use of machine translation (MT) is increasing, the computer-based verification of text quality is becoming increasingly important. In this context, I want to focus on the quality of the source text, even though Congree can also be used for translated texts as long as the language is supported. However, one point is clear: the less ambiguous the source text, the better the translation (MT output or human translation).
The Congree Authoring Server consists of three components offering authoring assistance: linguistic language check, terminology component, and authoring memory. Prices vary depending on the number of concurrent users, selected modules and languages, and the editor environment.
Even without a great demand for translations, corporate language in general plays an important role for companies to individually describe their unique products and create a consistent brand image.
The Congree Authoring Server
The heart of the system is the Congree Authoring Server that is customizable and uses three different resources to optimize a company’s content:
1) The Rule-based Congree Language Check ensures the correct use of language and style rules. This does not only cover correct spelling, grammar and terminology but also the adherence to predefined Style Guides.
2) The terminology component uses the information of either an existing, third-party terminology system (such as SDL MultiTerm) or an internal terminology list to check for consistent and correct term use.
3) The Authoring Memory is a repository for sentences. It can be compared to the functionality of translation memories in CAT tools. Congree can detect text segments that are similar to new sentences that were written and approved in the past. Through more consistent and identical sentences there is a big potential to lower costs in the subsequent translation process because more 100% matches are produced.
On the practical side, Congree offers plugins or add-ons for the integration of existing authoring environments (DTP, CMS, or word processing software such as Microsoft Word). The main target tools are used in technical documentation, marketing, and the translation process. But Congree also offers a neutral web interface, so anyone who wants to check a text, independent of an editor, can use the Congree technology.
Figure 1: Congree and Microsoft Word.
Figure 1 shows three Congree panels, one for each resource, grouped around Microsoft Word. The English sample text for this article is a Word document.
Rule-based Congree Language Check
The rule-based Congree Language Check is the most extensive and diverse check in Congree and covers five different categories: spelling, style, abbreviation, grammar, and terminology. An overview of the number of potential errors and their acceptability (green = safe, yellow = acceptable and red = unsafe) is displayed at the top of the Congree panel (see Figure 2).
Figure 2: Notification categories of the Congree Language Check.
In general, the detected error can be either corrected, ignored, or even ignored for all occurrences of the potential error at once. If a rule leads to too many false positives, the user can also deactivate a rule manually while working on a specific text.
Spelling: Congree does not rely on a third-party spellchecker, like Hunspell or the Microsoft spellchecker, but implemented its own spellchecker that offers more comprehensive functions. First, it not only detects words from a predefined list, but also inflections, that means that nouns need only be listed in their singular form. Second, the underlying terminology database can be populated with company-specific terms like product names or the terminology of a certain subject area. And third, all terms stored in the terminology database are ignored, even though they are not in the internal spellchecker list.
The spellchecker also looks for the correct usage of “a” and “an” before vowels and consonants or the correct British or American spelling (analyse vs. analyze or instalment vs. installment). But the system cannot distinguish between typical British (e.g. football) and American (e.g. soccer) words unless they are stored in the terminology database. It does actually make more sense to store them in the terminology database because in examples like flat vs apartment, Congree would always suggest apartment, even if flat is used as an adjective.
For example, if the word “possible” was used, the system could state that the usage of “possible” is too colloquial for many texts, including technical documentation.
Grammar: There are a lot of underlying rules for the integrated grammar check available, so I just want to mention some examples. There are more general rules, like “Infinitives take no s at the end” but also more specific ones like “Don’t combine too with a comparative.” Most grammar rules are based on the correct identification of part of speech. In the sample text, the noun phrase “safety informations” was detected because mass nouns (information) do not have a plural form (see Figure 3).
Figure 3: Wrong usage of plural form.
Style: This is the most comprehensive category in the Congree Language Check. It offers a huge selection of predefined style rules that can be used to create individual style sets. The German association for technical communication (tekom) for example published two guidebooks for rule-based writing for technical authors in English and German containing predefined guidelines also covering a huge variety of rules for translation-oriented authoring.
Additionally, there are Style Guides covering the guidelines for Simplified Technical English (STE). An example rule for STE: Write instructions in the imperative form. A general example: Avoid the use of “should,” “could,” “would.”
On a term basis, Congree can also check for basic style criteria, including language registers (formal, informal, colloquial). This information is stored together with different synonyms.
But if there are any company-specific rules, they can also be added to the existing sets. These rules might come from existing terminology guidelines which often define specific usage of terminology but also the maximal length of sentences or whether active or passive voice should be used in the documentation. Especially machine translation systems benefit from these more standardized texts and generally produce better output.
Terminology: Terminology errors can only be detected when a terminology database is integrated or an internal terminology list is available. Congree only offers an integrated terminology component with basic terminology functionalities.
For the practically relevant detection of the incorrect use of rejected synonyms, the system needs the “usage information,” which means that terms and synonyms are marked as “preferred,” “allowed,” or “deprecated.” Only with this information can Congree detect incorrect usage — even if variants or different writings (e.g. with and without hyphens) are used. If the terminology database contains contradictory information, Congree suggests the user checks this specific term in context displaying all possible matches.
For example, imagine there are two entries containing the term “improvement.” One is marked as “allowed” and one as “deprecated.” Via direct access to the terminology database, the technical author can read the additional information for both entries and decide which one applies to the actual context.
Another integrated resource in Congree is the authoring memory database. It is a monolingual sentence-based database storing all sentences that were previously used in other documents or created in the same text. During text creation, Congree checks new sentences against existing ones in the Authoring Memory, and suggests similar or identical sentences. The similar matches range from 70-99% similarity and are comparable to fuzzy matches in the translation management environment. The aim of this feature is not to write the same idea or instruction in different words. This has two main advantages. First, the statements are clear and cannot be misinterpreted because of different ways of expressing the content. Second, with regard to the translation process, identical sentences save money since they need not be translated twice.
The recognition process works very well. With the means of a tracked changes mode, the differences between the sentence in the text and the stored sentence in the Authoring Memory are highlighted in a separate window.
Besides the general text information, Congree can also use formatting information including headline, paragraph, list elements, and so on if a sentence is used several times but with different formatting settings.
It should be positively noted that there is also in interaction between the Authoring Memory and the Congree Language Check. If there is a 100% match, minor errors detected in the Language Check will be ignored for this specific sentence. But on the other hand, there is also the option not to save sentences with major errors in the Authoring Memory.
The third main resource is the terminology database. Either the integrated terminology component or the connection to a third-party terminology database (e.g. SDL MultiTerm) are used as the basis for the terminology check described above. Beside the terminology check, the additional information of the terminological entry is displayed in a separate Congree panel. Here, the users can see and individually select which information they want to see: concept information, synonyms, translations and/or term information. By clicking on the link next to the term, the web interface of the connected terminology database opens in a browser window (e.g. SDL MultiTerm Online) and the users can see the complete terminological entry in the original terminology database. But this is not absolutely necessary since all the information can be displayed within the Congree panel (Figure 4).
Figure 4: Terminological information displayed in the Congree panel.
In addition to the terms already stored in the database, Congree also suggests new term candidates from the actual text that could be optionally stored as new entries. Since the system is based on a comprehensive linguistic system, the suggested term candidates are more reliable than term candidates from traditional term extraction tools. In particular, multi-term words are recognized in a more accurate way with linguistic information than with a statistical approach. This does not mean that all the term candidates are actually real terms that would be added to a terminology database, but the noise (too many useless term candidates) and silence (term candidates that are not recognized) aspect of terminology extraction tools are reduced. In one sample text, Congree suggested “work step,” which is definitely a good term candidate but also “normal review” or “individual passage,” which are not real collocations but more loose word combinations that would usually not be added to a terminology database.
The Congree Control Center is the heart of the system. All the individual configurations for the different resources can be made here. In my opinion, the most interesting and important section is the “rules” section. Here, users can define all their specific configurations for the Congree Language Check including spelling, grammar, terminology, style, and abbreviations. Some sample rules have already been mentioned. A company, however, can define as many sets as many as needed, such as Best Practices for Technical Documentation and Simplified Technical English. Depending on your editor, working language, or text type, you can select the appropriate rule configuration. Furthermore, you can define terminology rules for the creation of multi-word terms, add company-specific terms to your user lexicon, define general synonyms, and create notifications and explanations for error types. The notifications do not only contain the category (such as style or grammar) and unique code, but also a keyword that explains what to check, e.g. “Review the word order,” an instruction to help correct the mistake, e.g. “This adverb is in a marked position. Check if it is better to put the adverb behind the auxiliary verb,” and an explanation why, e.g. “Adverbs do not precede the finite auxiliary verb unless they are heavily stressed.” Moreover, you can always add an example that demonstrates the mistake:
WRONG: They often are companies with a high turnover.
CORRECT: They are often companies with a high turnover.
All the text fields mentioned above can be individually created, so that company-specific examples are used that might be clearer than general examples.
In the end, all these individual configurations can be combined in a Style Guide that forms the basis for a document-specific check.
In the Control Center the Reporting Operator can decide whether and which data are collected and how long they are stored. In this area you can also define the “release level,” which is Congree’s quality metric and indicates whether the number of errors in a specific category is safe, acceptable, or unsafe.
If the data collection is activated, the results of each check, independent of the editor, is stored anonymously and can be reviewed by the administrator. Beside general information, like how many grammar mistakes were detected in a document, there is also more detailed information available, including concrete action each author performed (e.g. corrected the error or ignored the rule).
Finally, Congree creates diagrams with statistical information, including the number of notifications for each check, or how often a rule was disregarded, which could be a hint that the underlying rules might need to be reviewed. The statistics can be restricted to certain period of times, languages, user groups, and so on (Figure 5).
Figure 5: Diagram on the notifications per category from June to August.
Congree offers a lot of possibilities to check content in different environments. Depending on the preferences of the authors, these checks can either be made in real-time or in batch mode at the end. Certainly, they can also be performed for already-existing documents.
If you just want to check a text independent of a specific editor, you can also use a web interface and just copy and paste the text. The same Style Guides are available there as in the plugin versions.
Since Congree integrates into the existing authoring tool, there is no need to actually “learn” a new application. The few features for the authors are easy to understand. But to get the most out of the system, it has to be thoroughly configured by the system administrator and that can be real challenging, because companies have to define which linguistic aspects are important for their texts.
In the end, Congree can definitely help to improve content quality and make it more consistent. As the results of the tested text were promising, it is certainly worth a look.