When you upload a case with email data, you often see the same content repeated multiple times because replies include the full history of the conversation. Phaselaw’s "exclude redundant emails" feature intelligently identifies these repeated messages and removes them from your review set, so you only focus on unique content without missing any important information.

What problem is being solved?

Email threads naturally build on previous messages, causing the same text to appear repeatedly. Without filtering out these duplicates, reviewers waste time going over the same information multiple times, which inflates document counts and slows down the review process.

Redundancy analysis helps answer the question: “Is the content of one email fully contained within another?” If so, only the more complete email needs to be reviewed.

How it works

Our approach carefully processes each email to identify and exclude redundant messages while preserving unique content.

Step 1: Prepare the email content

Each email is cleaned and standardized by removing signatures, headers, and unnecessary formatting. This ensures that only meaningful content is compared.

  1. Signature extraction — Signatures are identified and removed. Small variations in how signatures are presented across different email providers is a common source of false positives.

  2. Header removal — Forwarding and reply headers are stripped out to focus on the actual message content.

  3. Whitespace collapse and lowercasing — Text is simplified to ignore case and spacing differences.

Step 2: Identify overlapping content

The system breaks down each email into small, overlapping segments of text to compare content between messages efficiently. This allows it to detect when one email’s content is fully included in another. If so, it is marked as redundant and excluded from review.

Step 3: Thread-level analysis

The analysis is performed within each email thread, starting from the most recent message and moving backward. This efficient process quickly identifies which emails contain others, so only the most comprehensive messages are included in your review set. The result is a clear set of “root” emails that represent all unique content, while redundant messages are automatically excluded to streamline your review.

Although the analysis is highly accurate, like any statistical system it may have errors causing redundant emails to be classified as non-redundant. If you encounter such a case, please let us know so we can investigate further.

What the algorithm does not do