Wordscope: creating a kind of 'Google Health'

Wordscope: creating a kind of ‘Google Health’

By Philippe Mercier August 18, 2014

Often, translation is limited to just that: translation. However, translation databases and memories don’t have to be used for translation purposes only. We overlook the fact that they contain an inexhaustible quantity of information on a plethora of subjects. But how do we make the most of them?

We all know that if we want quality translations, we have to work with translators specialized in particular areas. Is that enough to satisfy our clients? Surely not! The richness of language allows things to be expressed in different ways, and each client has his or her preferences, though even large organizations rarely have strict rules and a unique style.

So much the better! It would be sad if we all had to speak a language that was limited and highly structured. But how, then, can a translator and translation company make sure that the services they provide live up to their clients’ expectations? This is the challenge that we faced when providing translation services to a large international health organization.

When you’re dealing with massive international organizations, it is almost impossible to work on the basis of a centralized system of terminology because these companies regularly collaborate with other organizations on various reports, conferences and studies, and the terminology used would have to be the same for all participants in these projects, which is a very complicated or even hopeless task. Who would be “right” if there was disagreement about a term? Is there a supreme authority in matters of terminology? Obviously not, nor can we ask authors, who are high-level professionals in their field, to constantly consult style guides, as this would eliminate creativity and would most certainly disrupt the document writing process.

So in 2010, during meetings with our client on the subject of terminology, we asked, “How can our translators know your terminology preferences?” It was suggested that we take the documents previously published by the organization as our basis. We then asked if these documents were available and if someone could provide them to us.

This question is both a simple and complex one. Yet again, the size of the organization and the vast number of participants complicated this task. Not only were previous reports needed, but also the translations of treaties or conventions referenced in the documents, as the latter have legal value and so forth. It was of course impossible for such a large volume of data to be sent to us on any medium. Moreover, these documents would have had to be organized, sorted by language, subject matter and so on. We were back to square one.

That was when one person at the meeting made a brilliant suggestion: “Why don’t you download our entire website and all of the documents on it?” The structure of the site allows users to identify the language of the documents, and the language buttons of one page link to the same page in different languages, so it was possible to locate the different language versions and match them up. These documents could then be divided into sentences, aligned and indexed.

We looked into this suggestion. Such a system could act as a search engine, allowing the user to research terms; it would list all sentences containing the desired terms and their equivalent sentences in the chosen language. Our team of engineers then developed a software application to carry out this task automatically, given the large quantity of documents to be analyzed. It was put into service for translation projects as an internal tool called Wordscope.

We found the idea to be quite interesting because, in addition to past translations, Wordscope showed translators on which pages the selected sentences had been found, giving them context. This provided a real added value, since the meanings of words are influenced by their context. The results were so positive that we thought it would be a good idea to index other websites in the health care field, and the database rapidly expanded to several hundred million words.

However, as we added more websites to Wordscope, our engineers’ lives became increasingly complicated. This seemingly simple idea conceals a daunting complexity because some websites are badly structured, meaning they don’t follow the same rules even within different pages of the same site. Additionally, the language links don’t always direct the reader to the right pages — or worse, they lead to the website’s home page.

Furthermore, experience has shown us that it is technically relatively easy to build efficient systems with data limited to a few thousand or million words, but once we try upgrading to gigantic databases, the technological challenges become infinite. These insidious problems only appear as the databases increase in size, depending on the varying structures of the websites indexed, errors encountered, different character sets and so on. According to “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page of Stanford University, “Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.” It should also be said that the indexing process is more complex than a classic search engine because the multilingual phase imposes additional, complex steps: first, matching up pages according to language, and second, division of pages and alignment of sentences.

Thence, this small project begun in 2010 progressed so much that it has required several thousand hours of work on the part of the engineers as well as large hardware investments. Wordscope is now a mature software application used for a large number of translation projects at Locordia Communications.

An e-health search engine

But that’s not the whole story! We noticed that Wordscope was being used outside of the scope of translation projects. What were our users looking for? Health information, quite simply. We had effectively built a translation database, but in addition to that and perhaps above all, we had created a database offering quality health content and information (Figure 1).

Thus, we discovered that Wordscope is not just for translators. Almost everyone needs quality health information: individuals, families, health care professionals who can use Wordscope’s content to keep abreast of various new techniques, researchers in one country who can quickly and easily consult the research done in another, students and the list goes on.

How many websites containing essential, quality information are unknown or underused? Does a health care professional have the time to skim through several sites to find the information he or she is seeking? Of course not, and that’s the great advantage of themed search engines such as Wordscope.

We all know that search engines such as Google and Bing, whose purpose is to index everything on the web, are very useful, but they also have many disadvantages. We see the evidence of this every day. We have to repeat or reword our queries to try to find what we seek within the jungle of high- and low-quality websites. These internet giants are aware of this problem, and many changes have been announced during the last few years to improve quality and, above all, to identify quality websites.

Now, is it really necessary to index all websites on the planet that mention health in order to build a health search engine? The information is repeated on many websites, so broad coverage of the subject can certainly be done if a judicious choice is made. The question of choice naturally arises — why choose one site over another?

In its current version, Wordscope only indexes multilingual content because this is also a qualitative selection criterion: websites translated into several languages generally have higher-quality content (Figure 2). This is certainly not true across the board, but on the other hand, it is a fact that websites with limited content or discussion groups are never translated. This selection criterion therefore prevents them from being indexed. This is a choice of quality over quantity.

It should be noted that specialized search engines providing health content do exist. However, these are mostly paid sites reserved for important customers, since the subscriptions are very expensive.

Google and Bing have both announced that they intend to redirect their focus to semantic content. But this will not solve the problems that come up in simple queries. Just as in translation searches, we think that subject indexing is the key to quality results. Indeed, when searching for a simple word having several meanings, such as cancer, no search engine with only one large database can ever determine with certainty if the user is looking for medical information, information on the Tropic of Cancer, or the horoscope for Cancer. In French, for example, the same term is used for all three meanings. And what about acronyms? Is a user who types FCTC looking for information on the Framework Convention on Tobacco Control, the First Coast Technical College, the Farmers Cooperative Telephone Company or something else altogether? The search engine can of course favor one option over the other, but is it the right one? Or it may list all of the results, but then it’s up to the user to select what he or she wants. Conversely, when users search for cancer or FCTC in the Wordscope Health database, we know by definition that they are looking for medical information (Figure 3).

Whatever the quality of its algorithm, in certain cases a search engine with only one database will be incapable of really knowing what the user is looking for. So which one should we choose? Should we even choose between them? Probably not. It all depends on the needs of users. Search engine giants can perhaps be compared to huge department stores where people can find everything, and targeted search engines are more like specialized boutiques offering focused, quality service. Surely there is room for both approaches.

Ranking and big data

To date, Wordscope has indexed about 600 million words, and even though we are adding websites every day and are only at the beginning, this is still very limited compared to Google or a similar site. However, even with this kind of limit, and a particular field classification (health), another question comes to mind: how do we calculate rankings? A ranking determines which results the search engine will display first. How can it guess what the user wants?

If we search for cancer and Wordscope returns 10 million results, but only displays the top 20 on the first page, is it choosing the right 20 results? Studies have shown that 90% of Google users never look at the second page of results.

To understand this, let’s look at another case that was pointed out to us. When a search for the term nausea was performed, Wordscope first displayed all of the leaflets of the many medicines for which this term appeared in the list of side effects. This is fine if you’re looking for the translation of the word, but it’s probably the wrong choice if the user is looking for information about nausea. We had thought at the beginning of the project that adding medicine leaflets would bring value and content — but this adversely affected the quality of our results, which again shows that quantity does not necessarily equal quality.

We subsequently changed the ranking to give inserts a penalty so that other documents with more relevant content appear first, since we found this to be an interesting point. But obviously, it’s a choice.

This example illustrates the many decisions that must be made in order for a system managing a large data set, or big data as it’s sometimes called, to perform an optimized service for its users. As good as it may be, a ranking that indexes single words no longer makes sense today given the quantities of data that are indexed and available.

Big data involves systems that bring together large quantities of information. These databases open new horizons for researchers, who can use them to create statistical and relational models with endless possibilities. In “Google, le nouvel Einstein,” an article published in the French Science & Vie in June 2012, it was noted that “All the evidence suggests that the greatest discoveries of the future will be made not from brilliant human intuition but from the scrutiny of data stored on obscure disk drives.”

Experiments have already been carried out in this field. The same article cites the example of two researchers, Jessica and Bradley Voytek of the University of California-San Diego, and their document analysis project. The results were published in the June 2012 issue of the Journal of Neuroscience Methods. This project revealed great statistical proximity between the words serotonin and migraine as well as between serotonin and striatum. And yet, serotonin and striatum only appear together in 16 articles. This does not mean that there is a link between migraines and that part of the brain, but perhaps this type of conclusion may one day help define new hypotheses and direct researchers’ future work.

For several years, there has been fierce competition between Bing and Google to improve search algorithms. There was Google’s purchase of Metaweb (which was incorporated into the Knowledge Graph algorithm), Bing’s purchase of Powerset a few years ago and so on.

In a March 15, 2012 article in The Wall Street Journal, Amit Singhal, head of Google Search, explained that semantic search would allow Google to better directly answer web surfers’ questions in its results pages. 10% to 20% of all queries would be affected, a sign that a major change is underway.

Ramez Naam of Microsoft also mentions this problem in the July 9, 2008 article “Le Net incollable” in Le Figaro: “Search engines force the user to think about the exact words likely to be found on the page containing the desired information. This makes things more complicated than they need to be.”

Semantic search or not, when the user types a single word, there’s no semantic algorithm in the world that allows the engine to know what he or she wants. This is all the more true if this word has several meanings and the information is not classified by subject. And even when a word only has one meaning (such as nausea), how can the engine guess what the user is looking for? Should the search return a medicine leaflet, information on what nausea is, the user’s symptoms or how to cure nausea?

Is a semantic search the answer? Shouldn’t we instead offer the user the choice of limiting the search field, such as a sort of search-refining funnel? In our case, we’ve chosen to structure the data on the basis of filters, subject indexing being one of them. Users can then refine their searches by limiting them to a certain country, website or group of sites, such as government sites or commercial sites, all with a few simple clicks, and always in a specific field. The filters may vary depending on the needs of the users, companies or other international or governmental organizations. Other options will surely arise — the future will tell us if they are the right strategies or if there is another miracle method.

Meanwhile, other players are jostling to dominate internet searches, such as Facebook, Apple and so on. On its part, Wordscope has expanded its reach from solely health care to the legal and finance fields, and will soon be available for free to everybody, with no content restriction.