AI and Machine Learning in eDiscovery

Any lawyer who has been involved in eDiscovery knows that document review can be a slow-moving process. This process is compounded by the pressure that comes with speeding up the review process. That is why many legal professionals have turned to AI and machine learning. The first major use of AI in the legal field was in the form of Technology-Assisted Review (TAR). Since then, the application of AI has significantly expanded in various types of legal matters.

Before we get into the application of AI and machine learning in eDiscovery, let’s get back to the basics. For starters, what is eDiscovery?


eDiscovery is a procedure in which the parties in a legal case collect and review electronically stored information (ESI) so that it can be used as evidence. The traditional types of ESI are basic emails and documents. The modern ones include instant messaging, smartphone applications, social media, and more. The eDiscovery process can be time-consuming, but with the help of AI and machine learning, eDsicovery software can make it more manageable.

There are two forms of AI that have made a significant impact on eDiscovery — Machine Learning and Natural Language Processing. Machine Learning assesses large datasets using mathematical models and learns from exposure and feedback from additional information. This allows them to discover hidden patterns and make predictions on the targeted data. Through NLP, you can communicate effectively with users in the same language. It advances the machines’ ability to understand spoken and written human language. It also enables you to approximate human cognitive patterns more closely.

The use of AI in eDiscovery also involves the use of increasingly powerful artificial intelligence advanced analytics. This technology makes it possible to automate certain tasks and increase the types of possible search filters and data analyses. There are a lot of powerful eDiscovery and compliance tools and eDiscovery software — such as Casepoint — which are capable of finding hidden patterns across thousands of dissociated communications of electronic files and text fragments. You can cluster and categorize these files by topic, content, and concepts.

Take the example of AL-fueled. “sentiment analysis” Apart from the basic term searches, it looks for relevant behavior indicators, such as concern, panic, deceit, or concealment. However, modern AI can identify voice patterns and facial expressions in recordings and videos and point to different sentiments. When you combine this with the analysis of the transactional data and writings of the subject, you will get a complete picture of the motive of the individual and the group.

Machine learning in Artificial Intelligence can also be used to look for anomalies such as irregular occurrences and omissions. People today are more careful while communicating via email. They might use a different channel or terminology or even avoid communicating about a sensitive subject altogether. Through analytics, you can look for code language, patterns pointing to underlying meaning, or out-of-character communications. For instance, if someone who talks a lot on chats sends a text saying, “Give me a call on my phone,” this is considered to be an irregular event, and the system will flag it. If there are suspicious gaps in the communication, it can also raise a flag for inquiry. The system might also signal the destruction of evidence or failures of production.

AI is still in its relatively early stages, especially in the eDiscovery field. But even though we are not in danger of sentient robots replacing everyone, AI has a clear role in our world. In the legal field, AI uses large volumes of data to give you leverage. With its help, you will be able to extend your reach and work faster.

Modern Machine Learning Approaches

AI & Machine Learning

The main goal of using machine learning for eDiscovery is to maximize the effectiveness of the review process while minimizing the need for human intervention. It drastically reduces the number of documents to find the relevant ones, so it helps cut costs as well.

As mentioned above, technology-assisted review (TAR) is among the most efficient uses of AI in eDiscovery. When combined with the human-machine approach, it is known as continuous active learning.

During the early days of TAR, the legal and IT community debated how the TAR process should be seeded. However, this debate was moot because even if you start with a single positive seed document, there will be more positive seeds, thanks to the continuous learning approach. This means that regardless of the starting point, the review will be just as effective. The continuous approach overcomes all the initial conditions.

The newer TAR model uses a different seeding approach where the human assessment of the review collection isn’t needed. It is based on Artificial Intelligence approaches and the sampling is derived from the documents that are outside the collection. This approach to TAR is called the “Portable AI model”. It is also known as the reusable model because the system is pre-trained from the data from a prior matter or related datasets. The AI approach used by Portable AI models reuses human knowledge in the seeding process. That’s why Portable AI models play an increasingly important role in the machine learning techniques used for eDiscovery. It is worth considering over the technology-assisted review model that employs the use of continuous active learning.

Most Efficient Approach to Document Review

Compared to the linear review or the randomly-seeded review workflow, the portable model offers a significant advantage in terms of efficiency. This is especially true for low-richness environments where it is rare to encounter a positive document. Some consider random seeding and linear review to be weak, improper baselines.

First, let’s compare the Portable AI model with Boolean keyword searching. It has been found that Portable AI models can find more documents than the human searcher using the Boolean method in the first 50 documents reviewed. However, the increased search is limited to only three or four more documents. This means that the magnitude of the difference is not statistically significant.

However, when we compare the one-shot Portable AI model with the human-seeded process, the former beats the latter by a large margin. The continuous iteration approach combined with human seeding is a powerful combination. In the long run, it significantly outpaces the static model that has been learning on a different dataset, even when the dataset is similar to the target collection.

Now, one would expect that when the seeds are selected from a Portable AI model to seed the active learning process, it would reach the high recall faster. But this isn’t the case. The combination of continuousness and portability doesn’t overtake the alternative. Research has shown that the active learning process, initiated with the portably selected seeds, doesn’t offer a sustained, significant improvement over the human-seeding process.

TAR: Lightening Your Load

For law firms and legal professionals who haven’t switched to TAR yet, what will happen when they are reviewing and are pressured into switching over to TAR because of time constraints? Would it help at that moment?

The answer is yes. However, it depends on the size of the collection. If you have coded half of the documents, these coding decisions can be used for jumpstarting the TAR process. The coded documents will be used as a readymade set of seeds for training the TAR algorithm. It involves a TAR engine using continuous active learning. This is a machine learning in Artificial Intelligence protocol that starts the process by using previously coded documents as the judgmental seeds.

After the coded documents have been fed into the system, they will be used as input to analyze and rank the entire population. From that point, the reviewers will be fed batches of 50 documents each by the TAR engine. Each of these batches will be the documents most likely to be responsive. Along with these documents, there will also be some contextually diverse documents to ensure that no concepts or topics are left unexplored.

Once the reviewers have finished the batches, the system will use their judgments continuously for reranking the population. With every new batch, the rankings will improve. The system will proceed along this track until there are batches with more relevant documents. Tests have shown that at the end, the review gets high recall, meaning that the majority of the relevant documents were found. Your firm will be able to save your clients and thousands of dollars by adopting a process that involves TAR.

AI and Machine Learning in eDiscovery Portable AI Models


When it comes to using AI in any setting, a healthy dose of caution has always been warranted. It still has a long way to go, but the hype surrounding the multitude of benefits offered by AI and machine learning in eDiscovery is worth it. It is making steady progress. There is a multitude of powerful eDiscovery software that can help you progress and achieve these benefits. For cases that need behavioral analysis and pattern detection and have repeat patterns with similar content, the best way to go is to employ a portable AI model. It’s a low-risk, high-return approach that is definitely worth your consideration.