What is Data Culling?

The first measure law firms can take is to stop over-collecting data for advanced eDiscovery. If you collect more than what is required, you will have to spend more money on storing and transferring data. This means more hours spent processing, analyzing, and reviewing this data. Proportional collection can help save a lot of time and resources. To do this, you have to improve your data culling practices.

In this article, we’ll discuss what data culling is, and how it can be used to streamline your eDiscovery process.

In simple terms, culling is reducing a sample by selective methods. Data culling is a process that allows you to identify documents on the basis of certain criteria, such as date ranges or keywords. This is what sets it apart from the document review process. By culling your data, you can hide certain information from the user’s view and prevent it from being returned in searches. It’s important to note that culling a document doesn’t mean that the document is deleted. It means that the document has been sequestered in a separate collection that you can access again.

Data-Culling-Techniques-in-eDiscovery-Graphic-1

Importance of Data Culling

In the information-driven society that we live in today, data culling has become increasingly important. Law firms are now dealing with a large volume of data that they have to gather, process, analyze, and review, which means increased eDiscovery costs. Not every organization has the bandwidth to shoulder these expenses.

According to a 2023 report by the IMF, global economic growth is expected to slow down to 2.7%. This might be the biggest economic slowdown in the modern world, after the global financial crisis of 2008 and the COVID-19 pandemic. At times like these, it is important for organizations to keep their expenses low.

The key to keeping the cost of eDiscovery down is avoiding over-collection. This is done by weeding out duplicative and irrelevant data. Once you have reduced the amount of data you are dealing with, your eDiscovery expenses will drop as well.

Types of Data Culling

As mentioned, the process of data culling involves searching and isolating data on the basis of specific criteria, such as keywords and date ranges. There are multiple techniques you can use that will remove documents from the collection before they can be processed and reviewed. Let’s take a look at three of the most common data culling methods:

  • DeNISTing

    In the DeNISTing method, all the junk data that could potentially clog up the review is removed. This includes system and program files that don’t have any file formats or user-generated data with evidentiary value. It is a simple way of reducing irrelevant documents. The ‘NIST’ in DeNISTing stands for the National Institute of Standards and Technology, a federal agency under the U.S. Department of Commerce that promotes industrial competition through innovation.

  • Dedupe

    The Dedupe method involves identifying and separating duplicate documents and emails. It can be performed at the custodial or global level. The custodial deduping method involves deduping a custodian against themselves. Whereas, global deduping involves removing documents across all the custodians. The former method ensures that the entire collection of a custodian is kept intact, while the latter maximizes the number of duplicate documents removed from the collection.

  • Search Terms

    With this method, you can look for relevant documents. In order to identify the right search terms and maximize results, you must have an in-depth knowledge of the case. For instance, if you choose date ranges or terms that are too broad, you might end up with a larger document set than necessary. Also, a small date range will provide insufficient results and require you to perform another search. To maximize the number of relevant documents, you need to use terms that are unique.

Initial Culling Strategies

  • Filtering by File Type

    This determines the type of files not required for the case at hand. You can put some files aside for further analysis, such as graphic or audio files. There are some other immaterial files that have no relevance other than their contents, such as container files. A few examples of container files include mailbox files (MBOX, Outlook PST, etc.) and ZIP files.

  • Filtering by Date

    It is important to identify the relevant date range for the case. This allows you to cull data that is not in scope.

  • Keyword Filtering

    This strategy can be used to eliminate certain types of emails that bear no relevance to the case, such as mass company email blasts or generic notifications.

  • Domain Filtering

    This method works like a spam filter. It will search the data using known domains in order to eliminate newsletters, junk mail, and other irrelevant items.

Advantages of In-House Data Culling

In-house data culling can offer a multitude of benefits. By performing these tasks in your organization, you can:

  • Define and limit the scope of your project

  • Isolate relevant subsets of data by performing multiple rounds of data culling

  • Control and monitor your budget and reduce costs

However, in order to make the most of these benefits, you have to implement the best data culling strategies. These will improve the way your team handles the data culling process.

Best Practices for In-House Data Culling

You need best practices that will help you avoid over-collecting evidence and implement the data culling process in the best way possible. Let’s get into quick process tips for lowering eDiscovery costs:

  • Use Data Clustering

    This popular data classification technique involves using an algorithm that groups similar data in a dataset. It allows you to reveal not only similarities in the data but also get an overall picture of the data. You can review datasets by topic and find the key terms related to each topic. This information can be used for creating a list of search terms that can provide you with irrelevant data.

  • Generate Lists of Search Items

    Once you know the common terms used in the data set, you can generate a list of search terms. These search terms can be used to separate irrelevant documents from relevant ones.


  • Isolate Custodian Data

    It is important to have a list of all the key players involved in the case. Each of them must be assigned a custodian ID. This way, you can isolate information pertaining to the custodian and use this information to generate topic clusters and search term lists for culling data. This information can also be used for performing searches on other custodians’ data.

  • Isolate Email Domains

    The email domain is the web address following the “@” in an email address. This can be “gmail.com,” “yahoo.com,” or “outlook.com.” When culling the data, you can identify the email domains in each data set and exclude recipients and senders with irrelevant email domains. This email culling process will help you remove spam emails and newsletters from your dataset. The same email eDiscovery process can be used for identifying email exchanges with privileged information.

  • Perform Quality Control on Search Terms

    In order to get a better understanding of your search results and figure out what to do to cull the dataset, you need to run statistics on random samples of data when performing searches. This will also help you identify additional exclusionary search terms that can eliminate irrelevant documents.

  • Use In-Place Search Technology

    In-place search technology, such as an eDiscovery solution, can help you reduce your dataset more effectively. eDiscovery solutions such as Casepoint enable you to perform comprehensive searches across data sources. The search results will be compiled into a content index that contains valuable information about the data. This information can include where it is stored, how long it has been stored, who has access to this information, and more. With this index, you can determine the relevance of the data for the case.

Analytical Tools for More Efficient Data Culling

Apart from implementing the best practices for data culling, you also need the right tools to efficiently cull your data and help you avoid any unnecessary costs. Let’s take a look at some of these analytical tools:

Data-Culling-Techniques-in-eDiscovery-Graphic-2
  • Email Threading

    Email threads refers to the sequence of emails sent as a reply to the original email. Depending on the context, these threads can be very long. In the email threading process, related emails are grouped together for a more efficient review and analysis process.

  • Clustering

    Clustering is a form of unsupervised machine learning that creates groups of similar items together. It enables the users to recognize the topics or characteristics that make that data similar. This way, you can optimize your workflow by taking action on the whole group of items.

  • Near Duplication

    With a near duplication tool, you can identify data that is at least 50% similar in the contents of its text. These files can be repetitive email threads or multiple versions of the same file with small modifications.

  • Technology Assisted Review

    Technology Assisted Review (TAR) handles the eDiscovery review process using algorithms capable of classifying documents with the help of expert reviewers. It is more accurate, thorough, and faster than human review alone. Using TAR through eDiscovery solutions such as Casepoint, you can expedite the document collection process.

  • Sentiment Analysis

    Using a sentiment analysis tool allows you to extract positive or negative viewpoints and emotions expressed in a text automatically. This enables you to prioritize relevant data for review and identify them faster.

Conclusion

Thanks to the concurrent data volume explosion and the increased pressure to reduce costs, law firms need to focus on culling techniques. Avoiding over-collection and streamlining the eDiscovery process has become more important than ever. Fortunately, there are eDiscovery solutions that can help you manage your data culling process more efficiently.