Language technology in Saudi Arabia

By Mansour Alghamdi, Mohamed Alkanhal & Faisal Alshuwaier November 6, 2012

With the advancement in communication and information technology and their widespread usage, Saudi Arabia is a leading country in mobile penetration rates and Arabic applications, including internet and social networks. Arabic is the first language of the 18 million Saudis, while English is the second.

Since Saudis tend to use Arabic in most mobile and PC applications, more research and development need to be done on human language technology with more focus on Arabic. The institution that is in charge of the advancement of Saudi Arabia in science and technology is King Abdulaziz City for Science and Technology (KACST). It is an independent scientific organization that is in charge of Saudi Arabian national science labs, and is involved in science and technology policy making, data collection, funding of external research and providing services such as the patent office.

KACST was directed by its 1986 charter to propose a national policy for the development of science and technology, and to devise the strategy and plans necessary to implement them. In accordance with this charter, KACST launched a comprehensive effort in collaboration with the ministry of economy and planning, and other public and private sectors, to develop a long-term national policy for science and technology. The policy was approved by the council of ministers in July 2002.

This policy draws up the broad lines of the future general directions of science, technology and innovation in Saudi Arabia to lay down an integrated guidance framework, which will serve as a reference basis to ensure the continuation of the system development efforts and the enhancement of its performance in the way that achieves the objectives sought by Saudi Arabia in the long-term. As a result of this policy, the National Plan for Science, Technology and Innovation was developed. A crucial part of this national plan was to identify the most relevant technologies to Saudi Arabia’s immediate need, one of which is information technology (IT).

IT has been a key driver of productivity and economic growth in many countries around the world. IT generates economic growth throughout the world by creating new services, aiding education and improving training. IT, especially computer modeling, language processing and data analysis, also enables advancement in almost all fields of science and technology. The competitiveness of Saudi Arabia as it moves into knowledge-based industries, such as finance, telecommunications, health care and education, relies heavily on IT. The subareas of IT technology that were identified as important to Saudi Arabia include speech and language, and the national plan has defined objectives and areas of interest for the speech and language technology area. These objectives include conducting research and also developing software and databases that can be utilized for Arabic language processing. In addition to the fact that KACST is supporting many research projects at Saudi universities related to human language technology, KACST has been working on several projects itself.

Arabic language technology

One of the research projects in language technology involves an Arabic corpus. KACST has gathered more than 700 million words from texts found in Arabic publications. These words, which form the core of the Arabic corpus and which KACST hopes to see reach billions of words in the future, cover a period of time commencing with the first usage of Arabic script in the pre-Islamic era (400 CE), and continue until modern times. This corpus takes into account religion, medicine, engineering and so on. The KACST Arabic corpus defines Arabic vocabulary and provides the meaning of words as well as the changes that have affected them structurally, linguistically and semantically with the passage of time. The function, structure and meanings of words are defined in the corpus, while content is indexed according to specialty, date of publication, authors and sources. In addition to the immediate benefit from the available information, the corpus will also be used to compose specialized dictionaries in different areas of knowledge and include them in various applications, including computer-aided translations, automated translations, automated analysis of Arabic texts, Arabic search engines and Arabic language processing in general. Moreover, the corpus can be used in the creation of curricula and courses that teach Arabic.

Another project in the language technology area is the machine translation lab which was established with the help of IBM. The lab’s target is to build state-of-the-art Arabic translation systems for regional languages such as Hebrew, Turkish, Urdu, Hausa and Pierson. Early efforts in the lab have resulted in building a very reliable Hebrew-Arabic translation system and a basic Farsi-Arabic translation system. Both systems were trained using the open-source statistical machine translation system Moses. The future plan includes developing several parallel corpora and language specific processing tools for the target languages. This service is currently available online at http://translate.kacst.edu.sa/.

Morphological analyzers are considered an important component of language processing. Jointly with regional institutes, KACST has developed a morphological analyzer for the Arabic language. The algorithm is based on the morphological and grammatical characteristics extracted from a large linguistic corpus. To develop it, comprehensive morphological and grammatical properties were first defined to encode the characteristics of Arabic vocabularies. A morphological and grammatical database was built and derived from a corpus of carefully selected articles. Next, an expert system was developed to assist in encoding grammatical and morphological characteristics for Arabic vocabularies. Finally, characteristics of vocabularies’ records were encoded. This analyzer can be used by linguists, researchers in the area of language properties and by developers of automated systems.

In addition to this, KACST has developed an Arabic name Romanizer to standardize the method of Romanizing Arabic names and make it available to others. Today, people travel around the world more than they did at any other time. Thus, it is necessary that their names be written on their travel documents in the Roman alphabet. Proper names are written in the Roman alphabet on passports, airline tickets, credit cards, driving licenses and certificates. Transliteration of Arabic names has not been consistent because Arabic orthography is different from that of the Roman alphabet. The result is that the same name is Romanized in different ways, and such inconsistencies can have negative effects in terms of security and property rights.

Arabic speech technology

IT is increasingly incorporating speech applications in almost all of our daily activities. In light of this global trend, and given the importance of this technology to Arabic speakers, KACST has been working on several projects related to this concept. One of them is the Saudi Voice Bank on which Arabic speech recognition systems are trained. The Saudi Voice Bank contains speech data that is phonetically rich and balanced to empower speech recognition systems to recognize Arabic speech regardless of the speaker’s gender, dialect and age. The Saudi Voice Bank is available at KACST and can be licensed to research centers and companies that develop computer speech systems and to researchers interested in speech recognition and speaker verification. IBM signed an agreement with KACST in 2002 to use the Saudi Voice Bank to develop a telephony human machine communication system for Arabic.

The use of speech recognition systems over fixed-line and mobile networks has significantly increased lately. With speech technology, users can, for example, call airlines or travel agents and make reservations, or check flight details verbally. Although speech technology is being increasingly applied to a range of languages, little effort has been devoted to Arabic speech recognition. KACST has recently worked with IBM to develop a speech recognition system for Saudi speakers. This system is based on the Saudi Voice Bank. KACST has also built and evaluated a speaker verification system using the Saudi Voice Bank. Customer services and call centers can benefit from this product.

In the mid-1990s, KACST started to build an Arabic phonetic database. By the end of the 1990s, the KACST Arabic Phonetic Database became available for researchers and research centers. The most sophisticated equipment and tools were used to compile the database with data that can easily be utilized by researchers and parties interested in speech and phonetics. This technology could prove useful to speech therapists, as well as researchers studying speech synthesis, speech recognition and speaker identification.

Another component that is important in any speech solution is the text-to-speech (TTS) system. TTS is a complex system that includes programs and algorithms designed to produce a sound that can be understood by humans, and is as close as possible to the human natural voice. During the last four decades, significant progress has been made in TTS technology of European languages. Only recently have efforts on Arabic TTS been initiated, however, and the resulting systems remain as closed software with limited usage and capabilities. Jointly with King Fahd University for Petroleum and Minerals, KACST has completed the first open source Arabic TTS system. The system is called KACST Arabic Text-to-Speech, and is freely available on the Sourceforge source code repository (sourceforge.net).

Modern Arabic writing includes only letters representing consonants, which means that Arabic vowels and geminates are not represented in the daily writing of Arabic. The absence of the vocalic and geminate symbols does not allow full usage of other computational systems such as text-to-speech and automatic speech recognition systems and search engines. Therefore, KACST has started working on automatic Arabic diacritization to develop a system that can be integrated into other related computer systems. The result is KACST Arabic diacritizer, which is available at the Computer Research Institute at KACST.

Arabic document

processing technology

Arabic document recognition is one of the projects currently supported by KACST, which supports different applications of the Arabic language. The main objective of the project is to recognize a scanned printed Arabic text and produce a corresponding editable text. This pro- ject will present a cursive Arabic script recognition system. After acquiring the document, the system decomposes the document image into text line images and divides each text line image into smaller overlapped frames. The system extracts a set of simple statistical features from each frame and then injects the sequence of the feature vectors to the Hidden Markov Model Toolkit.

In the document processing area, Braille recognition has drawn the attention of KACST. The Braille system has enabled the blind to read and write, a luxury they were deprived of before the invention of the Braille system. With the Braille system, blind people are able to acquire knowledge and communicate with others. The growing use of the Braille system, almost all over the world, has created a need to develop an automated system for Braille. Many languages have produced optical systems to recognize Braille documents. KACST has developed an Arabic Optical Braille Recognition system. The function of this system is to translate Braille codes into readable Arabic text.

Data mining technology

Since Arabic internet search engines are rare, and since existing ones do not process the Arabic language properly, KACST has developed an experimental open-source Arabic search engine called Naba Search Engine. Techniques employed in constructing global search engines were identified. Various tools to identify Arabic websites were also developed. A huge database of Arabic websites was then constructed, in order to automate their categorization, as well as the building of a database of Arabic synonyms. You can visit Naba at http://naba.kacst.edu.sa/.

KACST has utilized data mining techniques to develop an autoscoring system, called Abbir, for scoring Arabic essays. Manual evaluation of essay writing faces several obstacles including the time and effort it takes and the inconsistency among the human raters. Autoscoring is a method of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades.

Enriching Arabic Wikipedia

Wikipedia, the free encyclopedia, is the most extensively used encyclopedia in the world and is ranked by alexa.com as the sixth most popular site on the internet. Its English version contains almost four million articles and is used by 15 million registered members. As for its Arabic version, it contains about 190,000 articles and is used by 500,000 registered members. In terms of the number of articles, it ranks twenty-fifth in comparison to other languages. This is relatively low when taking into account the number of Arabic speakers in the world. Some languages with a total number of speakers not exceeding the number of residents in an Arab capital rank higher. Catalan, for example, which is spoken by approximately 11 million people, is rated fifteenth.

Taking these statistics into consideration, the growing importance of Wikipedia as a vital source of knowledge in all fields is indisputable. So, too, is the necessity to enrich the Arabic version of this encyclopedia to enable the Arab reader to access the sources of knowledge with ease.

The first stage of this project includes translation from the English version of Wikipedia into Arabic, varying in topics from biotechnology and nanotechnology to public health and medicine. KACST has launched a website (www.wikiarabi.org) to encourage volunteers across the Arab world to contribute to this project (Figure 1). More than 2,100 articles have already been translated, with KACST contributing half of the translations, while the other half was carried out by Saudi universities. In addition to increasing the scientific Arabic content, the project also contributes to the development of a community of Arab Wikipedia volunteers and writers who will increase the opportunities of enriching the Arabic version of the encyclopedia. The translated articles have been welcomed by Arab readers, and more than one million users have consulted them. Some of these articles have been read by more than 100,000 readers within two months. KACST is currently working on the second phase of the project with the aim of translating articles from 12 foreign languages into Arabic. Universities from the Arab world are competing to take part in this task.