E-disclosure: Breaching the Language Barrier

August 25, 2008

English may have become the dominant language of international business, but it is far from the only one. There are more than 2,200 written languages in use around the world today – the European Union alone has 23 official languages – and as globalisation brings business cultures closer together, the problems inherent in our modern-day Towers of Babel become more apparent.  One specific area where linguistic differences can create an acute problem is when reviewing electronic documents as part of a search process in the event of litigation or a regulatory investigation.

Lawyers dealing with cases involving multi-national organisations may find themselves reviewing anything from 50,000 to a million documents. The complexity of this review process increases dramatically when faced with documents in more than one language. In these situations, which are becoming increasingly common, dealing with multi language files can have a serious impact on review time. As a result, an increasingly common stage of the review process is to begin by running an automated search to establish the language in which documents are authored.

Language searches use software that quickly scans and identifies the language of unstructured text within individual documents based on linguistic analysis of its word stems. By identifying the language in which a document is written at the outset, an appropriate review strategy can be deployed to route these documents to the correct location with the required expertise. The alternative is for reviewers to remove each foreign language document they come across from the main body of documents, collecting them together and then re-sorting them by language for review by native speakers.  Depending on the size of the overall document population, this can be a time consuming strategy for the review team.

Here, there and everywhere…

The problem can be compounded in multi-national cases by the diverse distribution of documents in a particular language. Just because a server is located in a particular country, it does not automatically follow that all the documents contained on it will be written in that country’s mother tongue.

Given the inter-connectivity of multi-national companies’ IT systems, the repository of an organisation’s documents can be, in theory, anywhere in the world. Documents held in one country can often originate from many different places and are written in a variety of languages. When this occurs in the course of a search process, large amounts of time can be lost during a review when the dataset contains documents in unexpected languages. Running a language search at an early stage can quickly flag up this issue before valuable time is wasted by reviewers ploughing through documents they cannot understand.

Another common challenge is that documents (typically e-mails) can be written in more than one language when businesses operate in more than one region. It is not uncommon for e-mail correspondents to change language midway through a chain of message or even to reply in alternate languages. The consequence is that documents may be 30% in English, 20% in French, with the remaining 50% written in Japanese.

In these instances, language search technology can identify the predominant language in a document by scanning different parts of the text, and classify it accordingly, allowing an appropriate language or regional expert to come on board in the early stages of review.

While the savings from running a language search may be marginal for smaller projects, when there are thousands of documents involved, the benefits can be substantial. Being able to identify and extract documents in the same language can save significant amounts of review time and produce substantial cost savings. Last year, Epiq Systems processed a large investigation for a major international law firm. The data was harvested from a variety of servers across Asia and Europe. It was estimated that the ability to identify the predominant language at the point of extraction saved in excess of 100 hours of legal review time.

A question of character

Language searching is a relatively new tool in the armoury of the international litigator. The ability to undertake comprehensive language searches has been enabled by the support of the Unicode set of characters, which provides a uniform way for software to encode and understand data irrespective of the language in which it is written. This is important because, in addition to a proliferation of languages, there is also a proliferation of character sets, such as Arabic, Chinese, Japanese, Cyrillic and Devanagari (used for Hindi, amongst others).

In the past, trying to identify the language of documents using non-European characters was difficult as software tended to use ASCII  codes to recognise characters, which only extend to Romanised characters. This extended to only 256 characters, whereas the Unicode character set contains hundreds of thousands of possible characters, enabling it to recognise all of the major non-Roman alphanumeric sets. The implementation of Unicode character sets in software and documents is not yet universal; when selecting language search software, it is important to ensure that it is not only Unicode compliant, but also that it is compatible with code sets for non-Romanised languages. Where this is the case, by combining Unicode with algorithms that compare strings of characters, almost all languages can now be identified very quickly, regardless of the character set that they use. Not only does the processing software require Unicode compliance, but the document review tool must enable the review and search of documents in their native language.  For example, searching for an English word within the text of a Chinese document will not return that document as part of the result set. 


Language identification is a new technology which has, until recently, been relatively expensive to deploy. This is changing, however, especially as the number of regulatory investigations has been growing strongly. Litigation cases often involve a smaller set of documents and lawyers usually know what they are looking for. Investigations, on the other hand, are generally much more wide-ranging – important documents can be held at numerous locations worldwide and investigators need to examine as broad a range of documents as possible.

In these situations, the benefits of language search become very apparent, and this is a technology that looks set to become an indispensable part of search and review workflows in the future.

How to avoid the typical pitfalls

1. Run a language scan before embarking on the document review. For large-scale reviews, this can save hundreds of hours in review time.
2. Ensure that the language search technology can handle Unicode characters and characters from other non ASCII code sets so that documents with all encodings can be read.
3. Make sure that the search technology you use can also flag documents that contain more than one language and identify which language is predominant.
4. Don’t assume that all the documents on a server will be in its host country’s language. Run a language search across foreign servers – you may be surprised by the results.

Julian Uebergang is an Executive Director at Epiq Systems, a specialist E-Disclosure company and a leading global provider of technology solutions for the legal profession: www.epiqsystems.co.uk/home.php