Text Analysis in the eDiscovery process

What is eDiscovery?

In order to understand how Text Analysis technology can help as part of the eDiscovery process it is important to first understand, what eDiscovery is and why it is important in the legal profession. Wikipedia describes legal discovery as “the pre-trial phase in a lawsuit in which each party…can obtain evidence from the opposing party.” eDiscovery is an umbrella term used to indicate the discovery process for electronic documents.


Given that the vast majority of information is stored electronically in one form or another the discovery process requires law firm associates to review text documents, email trails etc to determine if they are relevant (responsive or non-responsive) to a  particular case. It is pretty much a data reduction and analysis task which is time-consuming and therefore an extremely costly process.

Given the proliferation of electronic documents within a corporate environment and the sheer mass of e-documents within an organization’s data warehouse one may have to consider documents numbering in the millions or tens of millions as part of a discovery process. It is almost impossible for a human being to trawl through such a vast amount of documents with a fine tooth comb without any technological assistance. Natural Language Processing and Machine Learning technologies, therefore, are well placed to add some smarts and automation to the process in order to save time, eliminate human error and overall reduce costs.

Text Analysis used in the process

Text Analysis practices can be used as part of an overall eDiscovery process to reduce time, increase accuracy and lower costs. Unsupervised and supervised methods can be used to achieve this goal.

Unsupervised Methods:

Machine Learning practices and the application of Text Analysis as part of the discovery process can help by allowing certain tasks such as Language detection of documents, Entity Extraction, Concept Extraction, Summarization and Classification to be conducted automatically. Metadata created for individual documents can also be considered in terms of the overall document repository to cluster documents by concept and uncover duplicate and/or near duplicate documents quickly with little or no heavy lifting.

Additionally the metadata created can allow the automatic discovery of topics in documents and add a temporal dimension to see how the topic evolves over time, this process is known as topic modelling. Consider email threading as an example i.e. taking what would otherwise be disparate emails and linking them together into a thread over time to see how the conversation evolved.

Supervised Methods:

While unsupervised methods are useful in the eDiscovery process they will most likely never entirely replace the human aspect of discovery, and for the most part they don’t aim to be a complete replacement. They’re more of a very smart and efficient aid in the process.

Major benefits are realized when predictive coding is combined with human review. This process is known as Technology Assisted Review or TAR. This is a process whereby a sample set of documents is analyzed, usually by a senior attorney, and scored in terms of responsiveness to discovery requests for the case. eDiscovery software applies mathematical algorithms and machine learning techniques to automatically analyze the rest of the documents and score them for relevance based on what it “learns” from the TAR process.

Scores generated through predictive coding can be used to automatically cull large numbers of documents from consideration without the need for human review.


In recent years, the adoption of natural language processing and machine learning technologies as part of the eDiscovery process has been on the rise mainly due to the fact that it aids knowledge discovery, saves time and reduces costs.

Knowledge Discovery:

The sheer volume of documents and data to review as part of an eDiscovery process is massively overwhelming for a team of legal professionals who might be searching for a specific line of text among millions of documents. Sometimes they may not even know what they are looking for. Incorporating advanced and specialized technology into the process means the search and discovery process can ensure no page is left unturned.


In most cases eDiscovery projects are time bound and teams work day and night to meet important deadlines. With limited time and huge volumes of data and text to get through eDiscovery teams are often fighting an uphill battle. Technology can assist in processing large amounts of data in a fraction of the time it takes a team of legal professionals.


The amount of data sources to be analyzed and the size of legal teams involved means an eDiscovery project can often prove quite costly. The introduction of Text Analysis as part of the eDiscovery process means the time it takes and the amount of professionals needed on an eDiscovery team can both be greatly reduced which in turn reduce the cost of the overall project.


It seems technology will never fully replace the role of a legal expert in the eDiscovery process, but as machines and software get smarter the role of technology in the entire process is only going to grow.

Text Analysis API - Sign up

Let's Talk