Why you need to care about data quality during discovery

March 24, 2015

Litigators tend not to worry about the quality of data in discovery. After all, discovery is just a duty courts make you perform for the other side. As a result, many regard the pre-review phases of discovery as low-value, straightforward operations having little bearing on the outcome of the case. But while you may be carrying out these steps out of obligation, it’s also how you come to know your own case. That’s why any data quality issues that go unresolved can have serious impacts down the line.

Data quality problems can increase the cost, time, and effort of eDiscovery, especially when data must be reprocessed and re-reviewed. But it’s the risk of missed information that can be the most costly. If the other side finds something you failed to produce, you’ll be blindsided. It may look like you’re hiding evidence. Moreover, you might overlook helpful witnesses and crucial evidence for your own case. Come before a judge who understands data quality issues, and he or she could demand a do-over or impose costly curative measures. 

Even with the best intentions, quality issues can be hard to detect, especially when it comes to eDiscovery, where lawyers may lack the technical knowledge to detect or correct errors. Problems are usually detected only if the opposition finds something you’ve missed.  If you’re lucky, your opponent will never know what you’ve failed to tell them. But do you want to bet the company on luck alone? 

Mistakes aren’t limited to small ‘mom and pop’ operations either, they can even be made by the most sophisticated and prestigious providers if they fail to focus on the quality and integrity of their work.

Understanding the nuances of data

Data quality issues don’t have to be intimidating. When problems crop up during processing, there is usually a logical reason. Here are some of the most common issues that can occur during processing:

Source data problems

Different types of source data pose a challenge for some eDiscovery processing tools, generally due to difficulties they encounter with complex files, such as email databases. Uncommon file formats, corrupt files and complex file types are just some of the areas where problems can occur.

Let’s take uncommon file formats as an example. Organisations use a variety of email applications, from Lotus Notes to Apple Mail, but often eDiscovery tools can process only Microsoft Outlook PST files. Ideally, an eDiscovery tool should be able to directly read all common email file formats (and some uncommon ones). It should also have built-in knowledge of those formats and handle everyday problems that can crop up.

Forensic images can also cause processing issues. Some organisations make forensic copies of evidence sources in the belief this is the only ‘forensically sound’ method acceptable to courts. However, very few eDiscovery tools can directly interpret these forensic disk images. In this case, the discovery team, or the tool itself, must take the additional step of exporting data from the forensic image into a readable format prior to processing.

It’s also important to consider how mobile, social and online data is processed. For example, conversations started on company email are often moved to SMS or a personal email if the content becomes sensitive.  Most eDiscovery tools can’t handle data from mobile devices or cloud services, but without this data, legal review teams are likely to have an incomplete picture of communications. Advanced discovery tools can directly process these formats or import this data from mobile device forensic imaging software. Combining all communications into a single timeline gives legal teams the full picture.

Hidden risks

The complexity of electronically stored information may also conceal risks to the litigant and its legal service providers.

For example, documents and email attachments may contain malware, which can be activated when opening these files using a native viewer or application programming interface (API). Email and documents may also contain high-risk data such as private information or health information that is subject to data protection legislation. Often employees make ‘convenience copies’ of this data, for instance so that they can work on it from home. As a general rule, lawyers only look for and redact privileged information, not sensitive data. As a result, the discovery process may inadvertently breach privacy laws.

Missing data or metadata

General Michael Hayden, former director of the NSA and the CIA, said at a debate at Johns Hopkins University in May 2014, “We kill people based on metadata.” In a legal discovery context, metadata can be just as important as it is for law enforcement and intelligence agencies.

Embedded metadata in image files can show ownership, demonstrate a custodian was at a certain place at a particular time, or help to confirm a witness’s recollection of facts. Document management systems record valuable information about who created, viewed, modified and deleted files — as well as corporate filing protocols such as internal project names.

Discovery tools don’t always capture all data or metadata. This is especially the case when they use native applications to extract data or require it to be converted from one format to another before being processed.

Extracting data using an API can also be challenging, as it limits you to the data and metadata that the application’s maker believed to be important, or had the time to implement. It is highly likely that some important information will be left behind in this process.

Processing Pitfalls

Seemingly minor decisions during eDiscovery processing — often the choice of a single switch of settings—can cause major issues downstream.

A frequent processing error occurs when legal teams decide to handle documents that include tracked changes or embedded commentary by running those items through an optical character recognition (OCR) engine. OCR engines are incapable of distinguishing between the main body text of the document and the annotations or comments. The result is confused or garbled text which is hard for reviewers to read.  Another challenge for OCR engines is recognising text that is upside-down or sideways in scanned documents. The resulting ‘martian’ style text will fail any attempt to find key terms.

Some eDiscovery tools by default do not index ‘stop words,’ namely very small words like ‘it’ and ‘the’ that are almost never important to understanding the matter at hand. But ‘almost never’ is not the same as ‘never’. This difference in indexing methodology may result in some very different results than those expected. A copyright dispute involving the Mel Brooks movie To Be or Not to Be, for example, would require the parties to index stop words — otherwise they wouldn’t be able to find the name of the film in searches.

Four steps for maintaining data quality

Maintaining data quality is as much about process and education as it is about technology capabilities.  For discovery professionals, this means:

1.      Knowing your data

Familiarise yourself with common data formats and the most frequent processing errors and issues associated with each. When you do encounter anomalies, the answers can often be found in technical forums discussing systems implantation or programming issues, rather than legal discovery.

2.      Figure out your tools

Lean about the capabilities and shortcomings of your processing tools and the file types they can or can’t handle. Keep a close eye on:

·        What records the application stores when it can’t process a file due to any reasons such as encryption, corruption or unknown or mismatched format

·        Whether the processing tool makes any behind-the-scenes conversions to handle particular file types

·       What processes you have in place to handle each item that can’t be processed – do you reprocess it, convert it to another format, or ignore it and hope it isn’t important?

3.      Find out what happens when you tick the box

Understand how the application’s processing settings work and what effect they might have on which items are included or excluded from the final document collection.

4.      Implement quality controls

 Put in place check points and processes to reconcile the number of items you feed into any process against the number of items that come out of it. Use sampling to check that the outcome of any process is what you expect. If you’re concerned about the outcome, process the same data twice with the same tool, or use two different tools, and compare the results.

When it comes to data quality during processing ‘near enough’ can lead to costly repercussions down the track. Using the above approach, you can make all relevant information accessible, make sure you haven’t missed anything, then more accurately decide what’s responsive to the matter at hand, while keeping eDiscovery costs reasonable and proportionate.

Lee Meyrick is Director of Information Management at Nuix: http://www.nuix.com/