What is Predictive Coding and How it Can be Used in eDiscovery

What is Machine Learning?

Artificial Intelligence has been incorporated so deeply into our lives that we use it without even thinking about it. Whether it is getting recommendations on Amazon, using GPS to find the quickest way home, or simply using the Snapchat filters. In fact, we have grown so accustomed to it that we don’t even refer to it as AI anymore, just software.

This was an impractical method, especially with the modern data formats that have significantly increased the volume of discoverable data.

Machine learning is one of the many applications of AI, enabling the systems to learn from actions and experience. The systems don’t have to be explicitly programmed to do the same. Instead, machine learning focuses on accessing data and using it to learn.

Just like how the human brain develops understanding and gains knowledge, machine learning uses input like training data to understand patterns and connections. The process begins with analyzing data and looking for patterns that will help the system make inferences from the provided examples. Machine learning’s main goal is to develop models capable of learning autonomously, meaning that they won’t need human assistance. Depending on the circumstances, these systems will be able to adjust their determinations accordingly.

What is Predictive Coding?

Also known as Computer-Assisted Review (CAR) or Technology-Assisted Review (TAR), predictive coding is used for identifying critical ESI documents for the review phase of a legal case. It uses AI for developing a process that learns and makes better decisions. At the same time, it expedites document review and saves you a lot of time and money. Predictive coding can involve the usage of keyword searching, sampling, and filtering to automate eDiscovery document review. By incorporating predictive coding, legal firms will be able to reduce the number of non-responsive and irrelevant documents that have to be manually reviewed.

A predictive coding software involves the usage of artificial intelligence programming and mathematical models for analyzing electronic documents and locating data relevant to the legal case. This software categorizes a sample of documents that have been reviewed and tagged manually by a legal team. Capable of finetuning its mistakes, this software then takes in a new set of documents and is able to identify the relevant documents that require manual review. As the training is continued, the software learns and is able to make faster, better determinations. The software’s categorizations will be reviewed by the legal team until a certain confidence level is achieved.

How Predictive Coding Works

When it comes to the usage of predictive coding in eDiscovery, its goal is to locate relevant documents accurately and quickly during the review phase. This expedites the review process and saves the firm a lot of time and money. There are different predictive algorithms with different methodologies. However, at the very basic level, the process looks like this:

Training Seed Sets

A group of subject matter experts (SMEs) should identify a cross-section of documents to be used as a seed set or sample from the document set that requires review. Using the predictive coding software, every document in the set should be coded as relevant or non-relevant by an SME.
Prediction Process

Once the software has analyzed the seed set, it will create an internal algorithm capable of predicting future documents’ relevancy. The results from this analysis can be applied to additional documents. More and more sample documents can also be added, as the algorithm will be continually refined until the desired results are achieved.
Review

The predictive model will then be applied to the entire document set, so that the remaining documents can be predicted as relevant or non-relevant.

Now, behind the scenes of the predictive coding software, there is a lot of technological sophistication, which will be almost impossible to cover in this article. However, there are two important concepts that you should know:

1. Active Learning

This is an iterative method where the training set is augmented by documents chosen by the algorithm and coded manually by a human. CaseAssist AI can offer you powerful results. Furthermore, it learns from your input to rank unreviewed documents, which makes your job easier.

2. Passive Learning

This iterative method involves using multiple random sample sets for training the algorithm until it achieves the desired result.

Predictive Coding in the Courts

When predictive coding was first used in the legal industry, practitioners were divided on how the courts would respond. The first official judicial endorsement of the technology as an accepted way of reviewing documents is believed to be in the Southern District of New York, 2012, by the Federal Magistrate Judge Andrew Peck in the Da Silva Moore v. Publicis Groupe case. Today, eDiscovery predictive coding has been well-established.

However, there are still some disputes regarding how transparent parties need to be about how they are using the technology. This includes how they select their seed sets as well as how they code the algorithm.

Judge Peck issued a ruling in Rio Tinto Plc v. Vale S.A. (306 F.R.D. 125), where he addressed this issue. He stated that even though he encourages parties to be open about their seed set development, there are other ways to evaluate how efficient this technology is, such as manual sampling of coded documents.

There are some experts who believe that predictive coding was hyped up a lot, and so far, it hasn’t lived up to its expectations. Here are a few reasons why the legal industry has been slow to adopt:

The Myth of Human Review

Most legal practitioners still believe that manual human review offers the most accurate and thorough review of documents for relevancy. The truth is that this myth was already disproved in 1985. In a landmark study, it was found that skilled paralegals believed that they had used iterative search and search terms to find at least 75% of relevant documents when they had only found 20%. Since then, there have been studies confirming that manual human review might not be as infallible as once thought. However, the perception that manual human review is the gold standard has reduced the adoption rate of predictive coding.
Technical Unfamiliarity

Anytime a new technology is introduced, there is some skepticism around it. The legal industry has always been resistant to technology. Predictive coding is quite complex and understanding it requires a basic knowledge of statistical sampling and data science. Even though predictive coding promises more efficient and cheaper eDiscovery, for many lawyers, it has still not been able to outweigh the skepticism of the unknown.
Upfront Expense

While predictive coding is capable of decreasing the time and expense spent on document review, it requires the deployment of a new tool. This upfront expense constitutes a capital expenditure. There are some legal teams who still stick to the status quo because of this.

Predictive Coding Best Practices

There are certain practices that can help you make the most out of your predictive coding workflow:

Understand the Technology

Before you start using the predictive coding software, get familiar with how the tool works. You don’t have to be a technical expert or a statistician to understand the basics. Some of the concepts that you should have an understanding of are whether it employs active or passive learning, whether it requires a seed set, does it employ a relevancy score and more. As a lawyer, you must know the answer to these questions as they might be asked by an opposing party or a judge. It is better to be proactive than uninformed.
Use Experts to Train the Model

In order for the predictive coding system to work effectively, you will need a strong seed set. For this, you will need your strongest reviewers who have deep knowledge of the case and can pay close attention to detail. If the algorithm should learn from an inaccurate or inconsistent sample, the entire review might be compromised.
Develop a Relevancy Threshold

The relevancy score represents the software’s confidence in the relevance of a document. There should be a systematic approach to determine which documents have to be reviewed manually after the predictive coding process. You have to create a cutoff point that will determine the documents that can be discarded as non-relevant and the ones which should proceed further with human review.
Validate Results

Predictive coding shouldn’t be considered a complete replacement of the manual review process. It should be considered as a solution that augments the whole process. After the predictive coding process has been completed, you should sample the documents in relevant as well as non-relevant sets to look for inconsistencies. However, if the number of incorrectly predicted documents is large, you might have to retrain the algorithm and re-run the entire process.

Predictive Coding Tools and AI

There are different predictive coding tools with different capabilities. When looking for an eDiscovery review software, there are certain features to look for:

Document Insights

Apart from providing a relevancy score, the predictive coding software should also allow you to review the relevant content and metadata in the document. This will help validate the results and access specific documents.
Ongoing Predictive Capabilities

Every legal practitioner knows that only in rare circumstances do cases follow a linear path. As new information is brought to light, the document set will evolve as well. You can apply the existing models to individual documents that are brought in after the initial review process.
Reporting

Even though predictive coding is acceptable in the eyes of the courts, some are still ambiguous about the required transparency level. This is why your predictive coding tool must offer a robust reporting feature logging all actions taken and the involved users.
Portable Prediction Models

After training the predictive model, it is helpful if the software also supports using that model for other data sets. You can use the same model for different matters revolving around the same data, people, and issues.

Role of Predictive Coding in eDiscovery

In eDiscovery, predictive coding automates the document review process. In the legal industry, this machine learning technology identifies relevant documents in review workflows. This significantly reduces the document set. By leveraging this technology, legal professionals can find relevant ESI during the review phase. This can shape and alter the discovery process.

The predictions offered by predictive coding technology can be powerful. You might find a smoking gun that could change the course of your case. It will also help you avoid the extensive and time-consuming manual review. Given these benefits, it shouldn’t come as a surprise that many corporate counsels have started using predictive coding. In fact, courts are supporting the use of predictive coding for document review more and more.

Casepoint offers built-in artificial intelligence and an advanced analytics system called CaseAssist that offers faster and better legal discovery. For example, to make the process even more efficient, Casepoint’s CaseAssist AI is capable of prioritizing documents for relevance review using dynamic batching. Active learning allows the technology to work in the background and highlight what could be key documents, dates, and people.

In Conclusion

There are a lot of people who have been skeptical of the technology as they fear that it will take over their jobs. However, these technologies only work because of the training they receive from human experts. Even if you have a predictive coding tool in your law firm, you will need the expertise of a skilled attorney to make the right decisions needed to train the technology. AI can augment your abilities by detecting patterns that may not be obvious to humans. Casepoint’s CaseAssist AI can help you with workflow automation, review automation, and review prioritization. However, it is important to remember that people who train the technology and review relevant documents manually will remain vital for the success of the process. Essentially, it is a tool that can help you save time and money during the eDiscovery process.

A Complete Guide on Predictive Coding and its Impact on eDiscovery

What is Machine Learning?

What is Predictive Coding?

How Predictive Coding Works

Predictive Coding in the Courts

Predictive Coding Best Practices

Predictive Coding Tools and AI

Role of Predictive Coding in eDiscovery

In Conclusion

A Complete Guide to Internal Investigations

AI and Machine Learning in eDiscovery: Portable AI Models

eDiscovery and The Cloud: Things to Focus On

A Complete Guide to eDiscovery Software: A Buyer’s Checklist

Trusted technology for the world's most critical communities

What is Machine Learning?

What is Predictive Coding?

How Predictive Coding Works

Predictive Coding in the Courts

Predictive Coding Best Practices

Predictive Coding Tools and AI

Role of Predictive Coding in eDiscovery

In Conclusion

Related Articles

A Complete Guide to Internal Investigations

AI and Machine Learning in eDiscovery: Portable AI Models

eDiscovery and The Cloud: Things to Focus On

A Complete Guide to eDiscovery Software: A Buyer’s Checklist