Community Lives: Whose data is it anyway?

During the many years I’ve been involved in the language industry, I have encountered copyright and attribution issues in many diverse ways. When I decided to write this article, I needed some way to structure the many snippets of legalese and industry lore on copyright into a coherent whole.

Wishful thinking! Even looking at monolinguistic texts with a single author, there is no easy way to summarize the issues. If we multiply original source texts by any number of target texts, we suddenly face increasing complications. Factor that into the topographic undulations of our industry’s supply chain, then try to accommodate the minefield that the technologies we use bring into the picture and any chance of a simple description that everyone is happy with dissolves into a morass. Yet each and every day, millions of translators, interpreters and computers somehow manage to do a great day’s work facilitating communication across the globe, perhaps with varying degrees of satisfaction, but with enough to live life and get up the next day to do the same thing all over again. How is this possible?

On the face of it, this is a simple question. But there is no simple answer. In fact, I believe, it would take a hefty volume to cover the many different issues wrapped up here. We can try to move the discussion forward with a few other questions.

Who owns what

First, a thought experiment adapted from the cyberspace law guru Lawrence Lessig: I have a friend who resides in California. He likes to gamble but online gambling there is illegal. I live in Nevada where online gambling is legal and I run an online gambling site. He opens an account, logs in and places a bet. But to complicate things, the server on which my site runs is located in New Jersey.

The question is, where does the bet take place? To move on with the thought, let’s change the transaction from a wager to a translation! In this case, the site I run offers machine translation running memories that are a combo of input from human translators and word pairs that have been formed by my amazing machinelearning software. So to restate the question: where does the translation take place? Who did the work? Who owns the copyright?

This might seem far-fetched, but things can easily get a lot more complicated. The server in New Jersey interacts with another server in Alicante, Spain, which houses the word-pair data. My server in Nevada runs project management software in Mumbai, India. My friend, the customer, is accessing his company’s head office computer in Silicon Valley, but he works remotely from Vancouver. And by the way, we all have legal representation to take care of copyright issues, but our attorneys are in New York City and San Francisco. Oh, I forgot to mention: there are 84 target languages, involving 84 translators and 84 reviewers, all in different locations and countries!

This isn’t just network complexity, this is a complicated morass of dynamic connections that simply defies transaction recording and regulation. Perhaps it would be better just to avert our eyes from the tangle of tails and just accept that we are dealing with a rats’ nest.

The Creative Commons answer

But ecommerce is still exploding across the globe. Billions of emails are exchanged daily. There are goodness knows how many websites online, buzzing with all kinds of cyber-activities. Are there other areas that have copyright issues and how have they been dealt with? The answer is yes and a perfect example to begin with is Creative Commons.

Creative Commons was founded in 2001 with the goal of providing a simple means of sharing ‘creative’ work at no cost to users. However, Creative Commons’ licenses protect creators from misuse of their work. This replaced the cumbersome system of reuse licensing that was thought to hamper the distribution of creative works. Given that well in excess of one billion Creative Commons licenses have been issued since the scheme’s inception, it may be judged a resounding success. In particular, the widespread use of images across the web can be attributed to Creative Commons. Naturally, there has been criticism. To take one example, there has been concern over certain unclear distinctions between commercial and noncommercial use. While this is most certainly an issue that would attract the attention of language industry professionals, we should note that this has not actually stopped successful reuse of creative works. The phrase, where there’s a will, there’s a way, comes to mind. The bottom line, in overly simple terms, comes down to whether you want your work to reach a wider audience or not.

The DMCA answer

Another notable initiative has been the Digital Millennium Copyright Act (DMCA) of 1998. This US copyright law was devised in particular to protect creators of online content by extending the scope of copyrights and to deal with the threat of piracy and other such misuses in the burgeoning world of global connectedness. Given the scope of the law’s reach, it is not surprising that it is has provoked much debate and a number of high-profile lawsuits and high-profile reviews in Congress. There is no question that the internet, which has resulted in massive changes to the ways in which we in the language community work, has brought both good and bad opportunities. The advent of the DMCA should also forcefully emphasize that there is no magic bullet that will solve the copyright problems we face if we want to share our work with the world.

The EFF answer

At present, sharing content with the world involves using the services of a cloud-computing service. The content here is likely to be some form of entertainment like movies or music, but what if multilingual content is included and what about the people who created it? Not only is this unclear, how many of us are aware that according to the US government, we forfeit property rights to it? Let’s get this straight: your translations can cease to be yours based simply on the means of storing content. Fortunately situations like this have kicked up storms of protest, but organizations like the Electronic Frontier Foundation (EFF) are there to try and resolve the issues.

The EFF was founded to protect the openness of the internet and they have fought passionately to maintain the principles of open sharing that the internet was built upon. Preserving this openness is, of course, a battleground for advocates of various rights like the right to privacy and freedom of expression and from those who seek to control, legislate and monetize from software and network use. With multilingualism now identified as a digital right, what exactly is at stake for us in the language community? It’s quite ironic that for language professionals, computer technologies are both powerfully enabling and potentially restrictive at the same time. And as a result, what exactly is our place in the digital world?

Big Data

We are now immersed in oceans of data and it’s no misnomer that this is now known as Big Data. Information science along with global corporations have seized upon this lucrative resource with both hands, determined to wring whatever value they can from this true treasure trove. However, where is the individual in this gargantuan mass of assets? More to the point for us in the language community, what ownership remains for the creators of this colossal corpus? Specifically, how can ownership be attributed, tracked and credited? In his 2013 book Who Owns the Future? computer scientist, musicologist and pioneer of virtual reality Jaron Lanier proposed an intriguing solution to accrediting content. He pointed out that whoever controls the data controls the wealth because they own and manage the tech infrastructure. Lanier suggested using a system that would link users of data with its originator, who would be compensated by means of micropayments. In fact, he specifically has used the example of translators in a multilingual world to argue his case. As yet, his proposal has not yet found many proponents among the Silicon Valley behemoths who control the data. However, all is not lost and the merits of Lanier’s ideas may find a more feasible solution in the emerging technology of blockchain.

Open and closed cases

Ownership issues are diverse and plentiful. In fact, trying to discuss them at any length is impossible. I feel as if I’m afloat in an endless sea not knowing in which direction to swim. The truth is that just as in every other business and cultural community, the digital era with all the benefits it brings also brings problems and attributing ownership of translated material can seem intractable. In the interest of simplifying the issues, I want to suggest two main areas of concern that will, I hope, serve to illustrate our difficulties. These are translated materials in a closed environment and those in an open environment. The former are held by business entities that seek to gather, manage and use translations strictly for their own purposes. The latter are characteristically large global corporations that provide some form of automated translation tools for general use. They rely on Big Data repositories of materials culled from many different sources. Both present their own set of copyright, ownership and attribution issues.

Let’s take the fictitious Acme Widget company as an example of a closed repository. They have built their corpora up from work with their own in-house and contracted translators. They use translations for their own purposes: sales and marketing, customer support and so on. Through their own commercial activity they own the copyright on their material, both sources and targets. But eventually their database will mature sufficiently that they are able to use existing material without adding much to it. If they only used old material, then the translators who worked for them would have to go elsewhere in search of another job. In real life, though, there is always new content to be translated or updated, thankfully keeping those translators busy for years to come.

Let’s now take the fictitious Binary Inc., a global corporation and household name that provides all manner of digital services. Their translation material is vast and used by their own translation app, available free to anyone with a phone, tablet or desktop, along with hundreds of mobile apps that use the technology. Trying to figure out who made a penny from translating such powerfully-enabling material is, however, enough to have you reaching for the Pepto-Bismol.

The fact is that 100% attention has been given by both Acme and Binary to the end product. Okay, not quite 100%, but you can figure out that the language translators who create the resources are not the highest priority. Having been a part of the language industry for so many years, I have heard no end of complaints about this situation. Translators and interpreters need a better deal. Is there a direction in this vast sea of language that we can swim through to a safe haven of equitable ownership? And this brings me back to blockchain and artificial intelligence (AI) researcher Alan Stewart’s suggestion that investigating blockchain further may benefit the language community and in particular attribution of target texts.

Could blockchain
be the answer?

Blockchain is a concept that defies simple explanation. But we’re all familiar with Bitcoin and digital currencies, which are instances of blockchain technology at work. Unfortunately this might be more because of certain shortcomings that have received adverse publicity. Nevertheless the technology, admittedly still in its infancy, is with us and promises, if its proponents are to be believed, to revolutionize our world.

A blockchain calculation creates a block, a unique identifier that can be linked to a unique, identified owner (in our case attributing a translator’s target text) and this can never be changed, thus conferring permanent ownership with unlimited reuse. Successive blocks are filed in a ledger to form a chain. These are available for anybody who subscribes to the database, with the content (a target text) and its creator indelibly marked.  Figure 1 illustrates how blocks are produced, with the internal working of Block 11 shown as well. Each block is the product of intense mathematical calculations and it is only when those calculations are complete that work can proceed to the next block.

Essentially blockchain technology is derived from a constant process of generating blocks of records, each guaranteed to be unique by a rigorous cryptologically-driven algorithm. Effectively it is a massive distributed database available to all who subscribe to it. Essentially it offers the capability of linking online work with its original creator in an uneditable and secure form.

This grossly simplified description is clearer with an example. The following is an amended version of a post on the tech website The words in italics replace words in the original so that it makes a certain amount of sense to the language community:

Imagine that every digital translation sent updates about amendments, quality problems and ownership details to an open source, community-wide trusted ledger, so additions and subtractions to the translation were well understood and auditable across organizations. Instead of just displaying data from a single database, the text could display data from every database referenced in the ledger. The end result would be perfectly reconciled community-wide information about the text, with guaranteed integrity from the point of data generation to the point of use, without manual human intervention.

The original is a report on an article in the Harvard Business Review in conjunction with MIT describing how blockchain might facilitate a real-time, medical record system. My question is, if the health care community can do it, why can’t the language community? Why shouldn’t we? If the much-hyped Internet of Things is any indication of things to come, we could see a massive proliferation of wired devices with some form of language component built in. Who is going to provide those language components? If we don’t yet have an answer, we better start thinking of one.

A golden opportunity

Our community would most certainly benefit from at least investigating whether this digital technology could enable language professionals to literally leave their mark on their work when creating target texts. Payment for reuse of work could also be facilitated using the so-called salami-slicer approach in which fractional payments are received on a per-use basis. Remember, with enough slices, you end up with a whole salami! Ownership is easily transferrable. We most certainly now possess the computing infrastructure to handle vast outputs of material and to do so in close to real-time. Back-ups are guaranteed so data integrity is never a problem. Finally, this truly is a community-driven technology and that should exert strong appeal to us all.

In Stewart’s words, “I have only provided a sketch on the back of an envelope of an idea that really needs to be taken up at large. We need the input of academics, business and technical workers, linguists, the whole throng that works to deliver multilingual solutions to the world. Dismiss it out of hand if you wish, but there is quite likely a golden opportunity to overcome a number of obstacles to giving language services their rightful place in the world.”  Walking away from blockchain without investigating further— when IBM, Microsoft and Intel to name but three blue-chip companies of many investing heavily in this new technology — would be short-sighted and leave us languishing where we always have as Cinderellas missing out on the grand ball.