What is Data Culling?

The first measure law firms can take is to stop over-collecting data for advanced eDiscovery. If you collect more than what is required, you will have to spend more money on storing and transferring data. This means more hours spent processing, analyzing, and reviewing this data. Proportional collection can help save a lot of time and resources. To do this, you have to improve your data culling practices.

In this article, we’ll discuss what data culling is, and how it can be used to streamline your eDiscovery process.

In simple terms, culling is reducing a sample by selective methods. Data culling is a process that allows you to identify documents on the basis of certain criteria, such as date ranges or keywords. This is what sets it apart from the document review process. By culling your data, you can hide certain information from the user’s view and prevent it from being returned in searches. It’s important to note that culling a document doesn’t mean that the document is deleted. It means that the document has been sequestered in a separate collection that you can access again.

Data Culling Techniques in eDiscovery

Importance of Data Culling

Casepoint's Cloud Based eDiscovery Software

In the information-driven society that we live in today, data culling has become increasingly important. Law firms are now dealing with a large volume of data that they have to gather, process, analyze, and review, which means increased eDiscovery costs. Not every organization has the bandwidth to shoulder these expenses.

According to a 2023 report by the IMF, global economic growth is expected to slow down to 2.7%. This might be the biggest economic slowdown in the modern world, after the global financial crisis of 2008 and the COVID-19 pandemic. At times like these, it is important for organizations to keep their expenses low.

The key to keeping the cost of eDiscovery down is avoiding over-collection. This is done by weeding out duplicative and irrelevant data. Once you have reduced the amount of data you are dealing with, your eDiscovery expenses will drop as well.

Types of Data Culling

As mentioned, the process of data culling involves searching and isolating data on the basis of specific criteria, such as keywords and date ranges. There are multiple techniques you can use that will remove documents from the collection before they can be processed and reviewed. Let’s take a look at three of the most common data culling methods:

Section Image

In the DeNISTing method, all the junk data that could potentially clog up the review is removed. This includes system and program files that don’t have any file formats or user-generated data with evidentiary value. It is a simple way of reducing irrelevant documents. The ‘NIST’ in DeNISTing stands for the National Institute of Standards and Technology, a federal agency under the U.S. Department of Commerce that promotes industrial competition through innovation.

DeNISTing

Section Image

The Dedupe method involves identifying and separating duplicate documents and emails. It can be performed at the custodial or global level. The custodial deduping method involves deduping a custodian against themselves. Whereas, global deduping involves removing documents across all the custodians. The former method ensures that the entire collection of a custodian is kept intact, while the latter maximizes the number of duplicate documents removed from the collection.

Dedupe

Section Image

With this method, you can look for relevant documents. In order to identify the right search terms and maximize results, you must have an in-depth knowledge of the case. For instance, if you choose date ranges or terms that are too broad, you might end up with a larger document set than necessary. Also, a small date range will provide insufficient results and require you to perform another search. To maximize the number of relevant documents, you need to use terms that are unique.

Search Terms

Initial Culling Strategies

Section Image

It is important to identify the relevant date range for the case. This allows you to cull data that is not in scope.

Filtering by Date

Section Image

This strategy can be used to eliminate certain types of emails that bear no relevance to the case, such as mass company email blasts or generic notifications.

Keyword Filtering

Section Image

This method works like a spam filter. It will search the data using known domains in order to eliminate newsletters, junk mail, and other irrelevant items.

Domain Filtering

Advantages of In-House Data Culling

In-house data culling can offer a multitude of benefits. By performing these tasks in your organization, you can:

  • Define and limit the scope of your project
  • Isolate relevant subsets of data by performing multiple rounds of data culling
  • Control and monitor your budget and reduce costs

However, in order to make the most of these benefits, you have to implement the best data culling strategies. These will improve the way your team handles the data culling process.

Best Practices for In-House Data Culling

You need best practices that will help you avoid over-collecting evidence and implement the data culling process in the best way possible. Let’s get into quick process tips for lowering eDiscovery costs:

Section Image

This popular data classification technique involves using an algorithm that groups similar data in a dataset. It allows you to reveal not only similarities in the data but also get an overall picture of the data. You can review datasets by topic and find the key terms related to each topic. This information can be used for creating a list of search terms that can provide you with irrelevant data.

Use Data Clustering

Section Image

Once you know the common terms used in the data set, you can generate a list of search terms. These search terms can be used to separate irrelevant documents from relevant ones.

Generate Lists of Search Items

Section Image

The email domain is the web address following the “@” in an email address. This can be “gmail.com,” “yahoo.com,” or “outlook.com.” When culling the data, you can identify the email domains in each data set and exclude recipients and senders with irrelevant email domains. This email culling process will help you remove spam emails and newsletters from your dataset. The same email eDiscovery process can be used for identifying email exchanges with privileged information.

Isolate Email Domains

Section Image

In order to get a better understanding of your search results and figure out what to do to cull the dataset, you need to run statistics on random samples of data when performing searches. This will also help you identify additional exclusionary search terms that can eliminate irrelevant documents.

Perform Quality Control on Search Terms

Analytical Tools for More Efficient Data Culling

Apart from implementing the best practices for data culling, you also need the right tools to efficiently cull your data and help you avoid any unnecessary costs. Let’s take a look at some of these analytical tools:

Data Culling Techniques in eDiscovery Graphic 2

Section Image

Clustering is a form of unsupervised machine learning that creates groups of similar items together. It enables the users to recognize the topics or characteristics that make that data similar. This way, you can optimize your workflow by taking action on the whole group of items.

Clustering

Section Image

With a near duplication tool, you can identify data that is at least 50% similar in the contents of its text. These files can be repetitive email threads or multiple versions of the same file with small modifications.

Near Duplication

Section Image

Technology Assisted Review (TAR) handles the eDiscovery review process using algorithms capable of classifying documents with the help of expert reviewers. It is more accurate, thorough, and faster than human review alone. Using TAR through eDiscovery solutions such as Casepoint, you can expedite the document collection process.

Technology Assisted Review

Section Image

Using a sentiment analysis tool allows you to extract positive or negative viewpoints and emotions expressed in a text automatically. This enables you to prioritize relevant data for review and identify them faster.

Sentiment Analysis

Conclusion

Thanks to the concurrent data volume explosion and the increased pressure to reduce costs, law firms need to focus on culling techniques. Avoiding over-collection and streamlining the eDiscovery process has become more important than ever. Fortunately, there are eDiscovery solutions that can help you manage your data culling process more efficiently.