How Near-Duplicate Detection Works

Last updated: April 27, 2026

When working with large document sets, you may encounter files that are almost identical but not exact copies — for example, the same document saved with minor edits, or the same email with slight differences in metadata. Standard deduplication won't catch these. Near-duplicate detection is designed to clean them up.

This is an experimental feature. To enable it, click your name or avatar at the bottom of the sidebar → Experiments → toggle on Near-Duplicate Detection → refresh the page:

What problem is being solved?

Standard deduplication identifies exact copies of files. But in large exports, you often encounter files that are functionally the same — the same content with minor differences in formatting, metadata, or headers — that wouldn't be caught by a bit-by-bit comparison.

Near-duplicate detection identifies these files and gives you the option to exclude them from review, reducing your document set further before you begin detailed review and redaction.

How it works

Standard deduplication uses hashing (for documents) and Message IDs (for emails) to identify exact copies. Near-duplicate detection goes a step further — it analyses the body content of your files to identify those that are functionally the same but wouldn't be caught by exact matching. This includes:

Loose files with the same content but minor differences in formatting or metadata
Unthreaded emails with the same body content but different headers or timestamps

When Phaselaw identifies a near-duplicate, it designates one file as the authoritative version and excludes the others from your review set.

How to run it

From inside your case, click the Case Tasks icon in the top toolbar
Under New Task, click Exclude Near Duplicates
Phaselaw will analyse your document set and mark any near-duplicate files as out of scope

💡 You can see completed tasks and undo them from the Recent Tasks section of the same panel.