How Unicode enabled eBay to create a global platform

Pierre Omidyar’s first foray into e-commerce began less as a business venture than as an experiment. In 1995, curious about the effects of information technology on markets, Omidyar spent along September weekend writing software for a new kind of online trading platform. He launched the software as part of his personal web page, and the first item sold was a broken laser pointer. After that, things started to get interesting. Today eBay, the company that grew out of Omidyar’s experiment, is a global online marketplace where practically anyone can trade practically anything. In 1996, its first year of operation under the eBay brand, its users traded merchandise worth an estimated $95 million. Today, it maintains a presence in 39 markets worldwide, with 84 million active users trading in excess of $60 billion in merchandise each year. That’s roughly $1,900 every second. But global expansion isn’t easy, even on the web. By 2002, eBay’s information systems were straining to keep up with its growth. eBay had established marketplaces in countries as diverse as Australia, Canada, France, Germany, Italy, Switzerland, and Taiwan, but while its business had grown, its infrastructure had not kept pace. Having begun life as a US-only site, eBay worked well with text written in English and other Western European languages. Asian languages, however, presented unanticipated challenges. When eBay began expanding into Asian markets, the only way it could store Asian textual data was to use software workarounds, the digital equivalent of duct tape, which were both cumbersome and difficult to maintain.

If eBay wanted to repeat its stateside successes on a global scale, it needed to globalize its systems. While you can find more than 250 books on how to run a business using eBay, there’s no definitive guide to reengineering a multibillion-dollar internet business for the global economy. Transforming eBay into a truly global marketplace would take careful planning, meticulous execution, and thousands of person-hours of labor. Today, nearly six years since eBay’s UTF-8 migration project began, its experience stands as a model of success for any large enterprise. The characters of nations Within the computer internals, it didn’t really matter whether eBay’s users spoke English, French, Korean or Chinese. Computers are fundamentally logical machines; numbers are their only language. As such, before a computer can store or manipulate textual data, the text must first be encoded as a sequence of numbers, but not all character encodings are created equal. In the early days of eBay, the de facto character encoding for most computer software was ISO 8859-1. By default as much as by design, the eBay software output ISO 8859-1 text, accepted ISO 8859-1 input, and stored information in databases configured for ISO 8859-1. At the time, it was a perfectly suitable choice. Using one byte of storage per character, ISO 8859-1 specifies numeric values, called code points, for the letters A through Z, the numerals 0 through 9, and most common symbols and punctuation marks, including accented letters and diacriticals. It’s mostly sufficient to encode textual data in English or any other Western European language.

When eBay expanded into Asia, however, ISO 8859-1 became a liability. Because the standard specifies just one byte per character, code points in ISO 8859-1 can only have values ranging from zero to 255. But unlike European languages, which are written using compact alphabets, Asian scripts can comprise tens of thousands of characters. No single-byte encoding could ever hope to do the job.

The modern standard for handling multi-byte character sets is Unicode. Billed as the universal character-encoding scheme for written characters and text, the current version of the Unicode standard defines code points for over 100,000 characters from the world’s scripts — including the Western European alphabets, Chinese, Japanese, Korean, Hindi, and Arabic — and the list is steadily growing. In fact, Unicode’s total potential capacity is over a million code points, which is more than enough to encode the characters of every script known throughout history. It was only natural that eBay chose Unicode as the character encoding for its Asian marketplaces. Specifically, it adopted a form of Unicode called UTF-8, which provides maximum compatibility with singlebyte-oriented software designed for ASCII. But compatibility only goes so far. Back in eBay’s flagship US marketplace, ISO 8859-1 still ruled, as it did in Canada, the United Kingdom, Australia, and throughout Western Europe. To eBay’s databases and administrative systems, all of which were designed to work with ISO 8859-1 data, the UTF-8 data generated by eBay’s Asian marketplaces was indecipherable.

Round holes, square data As a workaround for the encoding problem, eBay adopted a data storage policy for its Asian marketplaces that amounted to “don’t ask, don’t tell.” When eBay’s Asian sites sent textual data to the database, no attempt was made to verify that the data was valid ISO 8859-1. When the Asian sites received data, they interpreted it as UTF-8, even though the database is labeled to the contrary. They assumed that what the database didn’t know couldn’t hurt it. Fooling a database is risky business. Storing both ISO 8859-1 and UTF-8 in the same database increased the complexity of eBay’s backend systems, making them more difficult and therefore more costly to design, develop and maintain. This didn’t just affect eBay. Third-party developers had to modify their own software to support both encodings before they could integrate with eBay’s systems. Worse, eBay was deliberately short-circuiting measures designed to preserve data integrity. Any mistakes could potentially lead to widespread data corruption. Equally important, a mixed-encoding environment was an impediment to eBay’s business growth. As long as its Asian sites were generating textual data that its US and Western European sites couldn’t display properly, eBay would never be able to facilitate true cross-border trade. For example, item descriptions with Asian characters submitted by sellers in Asian marketplaces were stored in UTF-8, which was not supported by the character encoding of European marketplaces. Even if the sellers wrote in English, some Unicode characters cannot be properly translated into strict ISO 8859-1 and would appear garbled — the euro symbol, for example.

The obvious solution was to standardize on a single character encoding model. If eBay converted all of its systems to UTF-8, including those sites based on Western European languages, it would put all of its worldwide marketplaces on equal footing. For a company of eBay’s size, this was easier said than done. It meant converting hundreds of terabytes of data, scattered across thousands of columns, in hundreds of database tables, stored on more than a hundred database servers. For a long time, eBay debated the idea. Such a project would be both costly and incredibly invasive. Still, the longer eBay waited to begin the conversion to UTF-8, the more ISO 8859-1 data it would accumulate, and the longer the conversion would take. Then something happened to force eBay’s hand. Without warning, Oracle issued a patch that changed the behavior of its database drivers. The new drivers mandated strict ISO 8859-1 compliance: Anything sent to an ISO 8859-1 database that wasn’t valid ISO 8859-1 data would be thrown out. Because UTF-8 makes use of code points that are undefined in the ISO 8859-1 standard, attempts to pass off UTF-8 data as ISO 8859-1 would corrupt the data. Although this was technically correct behavior, it undermined the “don’t ask, don’t tell” workaround. eBay’s Asian marketplaces could soon grind to a halt.

Frantic eBay engineers convinced Oracle to issue a custom patch to fix the problem, but it was only a temporary solution. Future versions of the drivers would continue to enforce ISO 8859-1 compliance. If eBay wanted to store UTF-8 data, it would have to store it in UTF-8 databases. Effectively, the clock was ticking. eBay had no choice but to begin the process of migrating to an all-Unicode environment. A conversion conundrum Life would have been much simpler if eBay could have shut down its servers over a three-day weekend, upgraded them to support UTF-8, and then started them up again on Tuesday morning. Unfortunately, this was impossible; eBay had come a long way since Omidyar’s experiment. Converting a database to UTF-8 is simple enough, but converting a database that powers a 24/7 global marketplace is not. As a matter of operational policy, eBay’s systems must maintain a minimum of 99.94% uptime or less than 25 minutes of outage per month. Even a major upgrade like the UTF-8 migration would have to meet this requirement. A more typical customer might have used Oracle’s standard conversion scripts to copy its data into new UTF-8 databases. But Oracle’s scripts assume that all of the data in the source database is ISO 8859-1, whereas eBay’s databases contained a mix of ISO 8859-1 and UTF-8. Moreover, this method requires a second, mirror-image database to copy the data into. Even if eBay wrote its own scripts, the cost of the equipment needed to duplicate its entire data store would have been stratospheric. The only way to complete the transition to UTF-8 successfully was to migrate eBay’s databases in place, while they were still serving data to the site. There would be no downtime during the process of converting text from ISO 8859-1. Only after all of the data in a given database was converted would the database be taken offline, reconfigured to operate as a native UTF-8 database, and restarted. Throughout the entire migration process, this reboot cycle would be the only step that would affect the site’s availability. Total downtime for each database would be roughly 20 minutes.

The core team responsible for executing this plan would be relatively lean, led by the chief globalization architect and including a development manager, a project manager, a development lead, two or three developers, a database administration lead, and a quality assurance (QA) lead. If this core team was small, however, it would ultimately collaborate with hundreds of others across development, QA and operations organization throughout the course of the project. The transition to Unicode would impact virtually every aspect of eBay’s business. From here onward, the transition team would tread very carefully. Planning for change Even before the core transition team was assembled, the first six months were devoted entirely to planning and preparation. During this initial period, architects and domain experts conducted a thorough survey and analysis of eBay’s data. This was important for two reasons. First, not every database column would need to be converted. Numeric data and other non-textual data, such as dates, could be ignored — as could any data that was already encoded as UTF-8. Second, some code points that are defined as single bytes in ISO 8859-1 are represented by multiple bytes in UTF-8. This meant that some textual data would take up more space after conversion, which in turn could increase the total amount of physical storage required by the database. The migration team would need to coordinate with eBay’s operations division to ensure that sufficient storage resources would be available to complete the conversion.

An even bigger challenge lay in knowing which data was ISO 8859-1 and which was UTF-8, once conversion had begun. Previously, eBay had determined the correct encoding on a site-by-site basis. Data for North American and Western European sites was treated as ISO 8859-1 while all other sites’ data was processed as UTF-8.  For the duration of the conversion process, however, eBay’s North American and Western European sites would be serving a mixture of both encodings. If a user’s browser received data that it wasn’t expecting, the result would be garbled characters. The solution was to add a flag to every row in the database that was to be converted, indicating the row’s encoding method. The magnitude of this step cannot be understated; it meant adding new columns to literally thousands of tables. So significant was the impact of this procedure, in fact, that eBay’s database administrators initially assumed the request was a joke. In turn, every piece of software that accessed the modifi ed tables had to be made aware of the new flag and how to interpret it. To minimize risk, the code changes were deployed in a methodical manner. First, a version of the new code was pushed out to the servers with all new features disabled, to ensure that it functioned as a perfect drop-in replacement for the old code. Only after everything checked out in the production environment for a period of time would the new capabilities be activated to support the data conversion. Once data conversion began, the application servers would be able to process the encoding of data using either the old or the new method, as appropriate. The migration begins Once preparations were complete, eBay was ready to begin data conversion. Timing was essential. The less impact the transition had on eBay’s operations, the better. It was decided that the process would begin during the week before Christmas in 2004 — traditionally, the week that’s as close to a lull period as eBay ever gets. Database conversion would proceed in four phases. Phases 1 and 2 would transition all of eBay’s most essential databases, including item, user and feedback data. Phases 3 and 4 would repeat the process for all the remaining databases, including those used for back-end operations and administrative tasks.

The first step of each phase was to get eBay’s data under control. A system-wide flag was added to indicate that no more records should be created using ISO 8859-1 encoding to “stop the bleeding.” Then, once new data was being generated in UTF-8 only, the process of converting the existing data could begin. One by one, each ISO 8859-1 fi eld was translated into UTF-8, and its conversion-status flag was updated to reflect the change. After the last record was converted, the migration team examined the database one last time to ensure that nothing had been overlooked. If the database appeared to have been successfully scrubbed clean of ISO 8859-1, the team scheduled the reboot and reconfiguration cycle that would
change the database’s encoding label from ISO 8859-1 to UTF-8. When the database came back on line 20 minutes later, it would be Unicode from top to bottom. There were challenges along the way. In practice, eBay’s data-conversion procedure worked almost too well. The data translation tools had been designed to finish the job as quickly as possible. But on an ordinary business day, eBay’s data- NXT base servers might process some 462,000 SQL queries per second at peak volume. The additional load imposed by the data conversion process proved to be too much. The migration team was forced to make last-minute adjustments that allowed the tools to monitor the load of the database and dynamically adjust the speed of the conversion, to avoid crippling eBay’s marketplaces with database updates. All in all, however, eBay’s meticulous planning paid off. Over the next seven months, eBay converted roughly 18 billion database records to UTF-8, spread across some 100 separate database hosts. Site uptime throughout the conversion period was greater than 99.94%. In other words, it exceeded eBay’s operational standards. The database migration process, although exhausting and highly invasive, had been a success.

Migrating the user interface After completing Phases 1-4, eBay took a long-deserved break from the migration, but while its focus was elsewhere, eBay’s transformation into a Unicode-based platform was not yet complete. Phases 1-4 of the migration process had concerned the back-end databases only. The customer-facing user interface had remained untouched. This meant that, even after every one of eBay’s database tables had been converted to UTF-8, eBay’s North American, European and Australian sites were still serving pages in ISO 8859-1. It would not be until 2007 that the migration team would finally regroup to finish the job. The problem with the sites whose HTML was still encoded in ISO 8859-1 was twofold. First, web browsers only know how to display one character encoding at a time. If two or more encodings are combined on the same page, some text will inevitably appear garbled to the user. Thus, for eBay’s ISO 8859-1 website to render correctly, they needed to receive ISO 8859-1 data from the databases, which now contained nothing but UTF-8. Second, web-based user interfaces go in both directions. If a page is served using the ISO 8859-1 encoding any textual data submitted via form fields or text areas on that page will also come back to the server as ISO 8859-1. As a practical matter, these issues had little impact on the status quo of the site. As part of its pre-migration preparations, eBay had built a translation layer above the database that could convert from UTF-8 to ISO 8859-1 and back again. To the sites’ users, nothing seemed to have changed. In the bigger picture, however, the ISO 8859-1 sites were not living up to their full potential. Because their user interfaces were still limited to displaying Western European languages, they were not yet truly globalized.

Eliminating this final barrier to cross-border trade was the goal of Phase 5 of eBay’s globalization effort. Every user-facing aspect of every site was converted from ISO 8859-1 encoding to UTF-8 during this phase. As before, not everything was converted at once, with necessary code changes rolled out in measured steps throughout eBay’s infrastructure. Only in this case, the new features would be enabled on a site-by-site basis, beginning with eBay’s least-trafficked marketplaces and progressing to higher-volume ones. Austria would be converted before France, for example, which would be converted before Germany. The US site, eBay’s highest-volume marketplace, would go last. The user-interface transition involved more than just converting page templates. The migration team also had to be on the lookout for application code that contained hardcoded ISO 8859-1 text or numeric values, in addition to code that assumed every character would be exactly one byte long. Fortunately, the team had learned a valuable lesson from its experience with Phases 1-4. Much of that early work had seemed almost like driving at night with no headlights. Before any work on Phase 5 began, the team developed monitoring tools that would allow it to examine user requests and pages served in real-time. This way, whenever inconsistencies were discovered, the team could react quickly to correct them before they snowballed into a larger problem.

The new, global eBay Today, a page served by any of eBay’s sites worldwide will be UTF-8 from end to end. On the back end, user, item, and feedback data is stored as UTF-8 in the database. All user-interface elements are served as UTF-8 encoded HTML on the front end. For eBay, the immediate benefi ts of the migration to UTF-8 were improved system stability and reduced development and maintenance costs. Before the migration, eBay’s Asian sites were forced to store their Unicode data in a database designed only for ISO 8859-1. That meant they risked data corruption and were vulnerable to vendor software changes, such as Oracle’s driver update. Moreover, eBay had to remember to make exceptions for UTF-8 whenever it published changes to its own software, as it does every two weeks. Now that all of its systems store and process their data in UTF-8, eBay’s software development cycles are much more streamlined and straightforward. What’s more, because Unicode includes enough code points to represent every known script, past and present, eBay will never need to undertake such an ambitious project again. The new system will be able to accommodate whatever new languages eBay chooses to support in the future. eBay’s globalization efforts are far from over. Now that the data-encoding hurdles have been overcome, the next challenge is the language barrier. eBay’s long-term goal is to become a truly multilingual marketplace, where buyers and sellers can select the language of their choice for their user interface and content to be presented, no matter which site they access. eBay’s globalization engineers are currently exploring various means of facilitating this, now that the transition to UTF-8 has opened the door. Even more exciting, however, are the new opportunities for commerce that have been made possible now that eBay’s UTF-8 migration effort is complete. The door has been opened for one of eBay’s most important long-term goals: cross-border trade, in which sellers in any country can offer goods for trade via any of eBay’s marketplaces worldwide. Thanks to the benefits of the migration, eBay is now poised to evolve from being a business that has a
global trade presence to one that offers a global trading platform.