What is Data Culling?
The first measure law firms can take is to stop over-collecting data for advanced eDiscovery. If you collect more than what is required, you will have to spend more money on storing and transferring data. This means more hours spent processing, analyzing, and reviewing this data. Proportional collection can help save a lot of time and resources. To do this, you have to improve your data culling practices.
In this article, we’ll discuss what data culling is, and how it can be used to streamline your eDiscovery process.
In simple terms, culling is reducing a sample by selective methods. Data culling is a process that allows you to identify documents on the basis of certain criteria, such as date ranges or keywords. This is what sets it apart from the document review process. By culling your data, you can hide certain information from the user’s view and prevent it from being returned in searches. It’s important to note that culling a document doesn’t mean that the document is deleted. It means that the document has been sequestered in a separate collection that you can access again.
Importance of Data Culling
In the information-driven society that we live in today, data culling has become increasingly important. Law firms are now dealing with a large volume of data that they have to gather, process, analyze, and review, which means increased eDiscovery costs. Not every organization has the bandwidth to shoulder these expenses.
According to a 2023 report by the IMF, global economic growth is expected to slow down to 2.7%. This might be the biggest economic slowdown in the modern world, after the global financial crisis of 2008 and the COVID-19 pandemic. At times like these, it is important for organizations to keep their expenses low.
The key to keeping the cost of eDiscovery down is avoiding over-collection. This is done by weeding out duplicative and irrelevant data. Once you have reduced the amount of data you are dealing with, your eDiscovery expenses will drop as well.
Types of Data Culling
As mentioned, the process of data culling involves searching and isolating data on the basis of specific criteria, such as keywords and date ranges. There are multiple techniques you can use that will remove documents from the collection before they can be processed and reviewed. Let’s take a look at three of the most common data culling methods:
In the DeNISTing method, all the junk data that could potentially clog up the review is removed. This includes system and program files that don’t have any file formats or user-generated data with evidentiary value. It is a simple way of reducing irrelevant documents. The ‘NIST’ in DeNISTing stands for the National Institute of Standards and Technology, a federal agency under the U.S. Department of Commerce that promotes industrial competition through innovation.
The Dedupe method involves identifying and separating duplicate documents and emails. It can be performed at the custodial or global level. The custodial deduping method involves deduping a custodian against themselves. Whereas, global deduping involves removing documents across all the custodians. The former method ensures that the entire collection of a custodian is kept intact, while the latter maximizes the number of duplicate documents removed from the collection.
With this method, you can look for relevant documents. In order to identify the right search terms and maximize results, you must have an in-depth knowledge of the case. For instance, if you choose date ranges or terms that are too broad, you might end up with a larger document set than necessary. Also, a small date range will provide insufficient results and require you to perform another search. To maximize the number of relevant documents, you need to use terms that are unique.
Initial Culling Strategies
This determines the type of files not required for the case at hand. You can put some files aside for further analysis, such as graphic or audio files. There are some other immaterial files that have no relevance other than their contents, such as container files. A few examples of container files include mailbox files (MBOX, Outlook PST, etc.) and ZIP files.
Filtering by File Type
It is important to identify the relevant date range for the case. This allows you to cull data that is not in scope.
Filtering by Date
This strategy can be used to eliminate certain types of emails that bear no relevance to the case, such as mass company email blasts or generic notifications.
This method works like a spam filter. It will search the data using known domains in order to eliminate newsletters, junk mail, and other irrelevant items.
Advantages of In-House Data Culling
In-house data culling can offer a multitude of benefits. By performing these tasks in your organization, you can:
- Define and limit the scope of your project
- Isolate relevant subsets of data by performing multiple rounds of data culling
- Control and monitor your budget and reduce costs
However, in order to make the most of these benefits, you have to implement the best data culling strategies. These will improve the way your team handles the data culling process.
Best Practices for In-House Data Culling
You need best practices that will help you avoid over-collecting evidence and implement the data culling process in the best way possible. Let’s get into quick process tips for lowering eDiscovery costs:
This popular data classification technique involves using an algorithm that groups similar data in a dataset. It allows you to reveal not only similarities in the data but also get an overall picture of the data. You can review datasets by topic and find the key terms related to each topic. This information can be used for creating a list of search terms that can provide you with irrelevant data.
Use Data Clustering
Once you know the common terms used in the data set, you can generate a list of search terms. These search terms can be used to separate irrelevant documents from relevant ones.
Generate Lists of Search Items
It is important to have a list of all the key players involved in the case. Each of them must be assigned a custodian ID. This way, you can isolate information pertaining to the custodian and use this information to generate topic clusters and search term lists for culling data. This information can also be used for performing searches on other custodians’ data.
Isolate Custodian Data
The email domain is the web address following the “@” in an email address. This can be “gmail.com,” “yahoo.com,” or “outlook.com.” When culling the data, you can identify the email domains in each data set and exclude recipients and senders with irrelevant email domains. This email culling process will help you remove spam emails and newsletters from your dataset. The same email eDiscovery process can be used for identifying email exchanges with privileged information.
Isolate Email Domains
In order to get a better understanding of your search results and figure out what to do to cull the dataset, you need to run statistics on random samples of data when performing searches. This will also help you identify additional exclusionary search terms that can eliminate irrelevant documents.
Perform Quality Control on Search Terms
In-place search technology, such as an eDiscovery solution, can help you reduce your dataset more effectively. eDiscovery solutions such as Casepoint enable you to perform comprehensive searches across data sources. The search results will be compiled into a content index that contains valuable information about the data. This information can include where it is stored, how long it has been stored, who has access to this information, and more. With this index, you can determine the relevance of the data for the case.
Use In-Place Search Technology
Analytical Tools for More Efficient Data Culling
Apart from implementing the best practices for data culling, you also need the right tools to efficiently cull your data and help you avoid any unnecessary costs. Let’s take a look at some of these analytical tools:
Email threads refers to the sequence of emails sent as a reply to the original email. Depending on the context, these threads can be very long. In the email threading process, related emails are grouped together for a more efficient review and analysis process.
Clustering is a form of unsupervised machine learning that creates groups of similar items together. It enables the users to recognize the topics or characteristics that make that data similar. This way, you can optimize your workflow by taking action on the whole group of items.
With a near duplication tool, you can identify data that is at least 50% similar in the contents of its text. These files can be repetitive email threads or multiple versions of the same file with small modifications.
Technology Assisted Review (TAR) handles the eDiscovery review process using algorithms capable of classifying documents with the help of expert reviewers. It is more accurate, thorough, and faster than human review alone. Using TAR through eDiscovery solutions such as Casepoint, you can expedite the document collection process.
Technology Assisted Review
Using a sentiment analysis tool allows you to extract positive or negative viewpoints and emotions expressed in a text automatically. This enables you to prioritize relevant data for review and identify them faster.
Thanks to the concurrent data volume explosion and the increased pressure to reduce costs, law firms need to focus on culling techniques. Avoiding over-collection and streamlining the eDiscovery process has become more important than ever. Fortunately, there are eDiscovery solutions that can help you manage your data culling process more efficiently.