What is Data Culling?
The first measure law firms can take is to stop over-collecting data for advanced eDiscovery. If you collect more than what is required, you will have to spend more money on storing and transferring data. This means more hours spent processing, analyzing, and reviewing this data. Proportional collection can help save a lot of time and resources. To do this, you have to improve your data culling practices.
In this article, we’ll discuss what data culling is, and how it can be used to streamline your eDiscovery process.
In simple terms, culling is reducing a sample by selective methods. Data culling is a process that allows you to identify documents on the basis of certain criteria, such as date ranges or keywords. This is what sets it apart from the document review process. By culling your data, you can hide certain information from the user’s view and prevent it from being returned in searches. It’s important to note that culling a document doesn’t mean that the document is deleted. It means that the document has been sequestered in a separate collection that you can access again.
Importance of Data Culling
In the information-driven society that we live in today, data culling has become increasingly important. Law firms are now dealing with a large volume of data that they have to gather, process, analyze, and review, which means increased eDiscovery costs. Not every organization has the bandwidth to shoulder these expenses.
According to a 2023 report by the IMF, global economic growth is expected to slow down to 2.7%. This might be the biggest economic slowdown in the modern world, after the global financial crisis of 2008 and the COVID-19 pandemic. At times like these, it is important for organizations to keep their expenses low.
The key to keeping the cost of eDiscovery down is avoiding over-collection. This is done by weeding out duplicative and irrelevant data. Once you have reduced the amount of data you are dealing with, your eDiscovery expenses will drop as well.
Casepoint’s Cloud-Based eDiscovery Software
Types of Data Culling
As mentioned, the process of data culling involves searching and isolating data on the basis of specific criteria, such as keywords and date ranges. There are multiple techniques you can use that will remove documents from the collection before they can be processed and reviewed. Let’s take a look at three of the most common data culling methods:
-
DeNISTing
In the DeNISTing method, all the junk data that could potentially clog up the review is removed. This includes system and program files that don’t have any file formats or user-generated data with evidentiary value. It is a simple way of reducing irrelevant documents. The ‘NIST’ in DeNISTing stands for the National Institute of Standards and Technology, a federal agency under the U.S. Department of Commerce that promotes industrial competition through innovation.
-
Dedupe
The Dedupe method involves identifying and separating duplicate documents and emails. It can be performed at the custodial or global level. The custodial deduping method involves deduping a custodian against themselves. Whereas, global deduping involves removing documents across all the custodians. The former method ensures that the entire collection of a custodian is kept intact, while the latter maximizes the number of duplicate documents removed from the collection.
-
Search Terms
With this method, you can look for relevant documents. In order to identify the right search terms and maximize results, you must have an in-depth knowledge of the case. For instance, if you choose date ranges or terms that are too broad, you might end up with a larger document set than necessary. Also, a small date range will provide insufficient results and require you to perform another search. To maximize the number of relevant documents, you need to use terms that are unique.
Initial Culling Strategies
-
Filtering by File Type
This determines the type of files not required for the case at hand. You can put some files aside for further analysis, such as graphic or audio files. There are some other immaterial files that have no relevance other than their contents, such as container files. A few examples of container files include mailbox files (MBOX, Outlook PST, etc.) and ZIP files.
-
Filtering by Date
It is important to identify the relevant date range for the case. This allows you to cull data that is not in scope.
-
Keyword Filtering
This strategy can be used to eliminate certain types of emails that bear no relevance to the case, such as mass company email blasts or generic notifications.
-
Domain Filtering
This method works like a spam filter. It will search the data using known domains in order to eliminate newsletters, junk mail, and other irrelevant items.
Advantages of In-House Data Culling
In-house data culling can offer a multitude of benefits. By performing these tasks in your organization, you can:
-
Define and limit the scope of your project
-
Isolate relevant subsets of data by performing multiple rounds of data culling
-
Control and monitor your budget and reduce costs
However, in order to make the most of these benefits, you have to implement the best data culling strategies. These will improve the way your team handles the data culling process.
Best Practices for In-House Data Culling
You need best practices that will help you avoid over-collecting evidence and implement the data culling process in the best way possible. Let’s get into quick process tips for lowering eDiscovery costs:
-
Use Data Clustering
This popular data classification technique involves using an algorithm that groups similar data in a dataset. It allows you to reveal not only similarities in the data but also get an overall picture of the data. You can review datasets by topic and find the key terms related to each topic. This information can be used for creating a list of search terms that can provide you with irrelevant data.
-
Generate Lists of Search Items
Once you know the common terms used in the data set, you can generate a list of search terms. These search terms can be used to separate irrelevant documents from relevant ones.
-
Isolate Custodian Data
It is important to have a list of all the key players involved in the case. Each of them must be assigned a custodian ID. This way, you can isolate information pertaining to the custodian and use this information to generate topic clusters and search term lists for culling data. This information can also be used for performing searches on other custodians’ data.
-
Isolate Email Domains
The email domain is the web address following the “@” in an email address. This can be “gmail.com,” “yahoo.com,” or “outlook.com.” When culling the data, you can identify the email domains in each data set and exclude recipients and senders with irrelevant email domains. This email culling process will help you remove spam emails and newsletters from your dataset. The same email eDiscovery process can be used for identifying email exchanges with privileged information.
-
Perform Quality Control on Search Terms
In order to get a better understanding of your search results and figure out what to do to cull the dataset, you need to run statistics on random samples of data when performing searches. This will also help you identify additional exclusionary search terms that can eliminate irrelevant documents.
-
Use In-Place Search Technology
In-place search technology, such as an eDiscovery solution, can help you reduce your dataset more effectively. eDiscovery solutions such as Casepoint enable you to perform comprehensive searches across data sources. The search results will be compiled into a content index that contains valuable information about the data. This information can include where it is stored, how long it has been stored, who has access to this information, and more. With this index, you can determine the relevance of the data for the case.
Apart from implementing the best practices for data culling, you also need the right tools to efficiently cull your data and help you avoid any unnecessary costs. Let’s take a look at some of these analytical tools:
Conclusion
Thanks to the concurrent data volume explosion and the increased pressure to reduce costs, law firms need to focus on culling techniques. Avoiding over-collection and streamlining the eDiscovery process has become more important than ever. Fortunately, there are eDiscovery solutions that can help you manage your data culling process more efficiently.