Remove Duplicate Files

Duplicate File Detective

Saturday, November 24, 2007

Finding Duplicate Files with Content Hashing

Duplicate File Detective provides the ability to specify precisely how duplicate file matches are identified. Specific combinations of file matching options each have their own set of trade-offs. For example, matching files based solely upon file name and size is very quick, but will obviously miss files that have duplicate contents but different file names (and may also incorrectly identify files with different file contents, but the same names and sizes, as duplicates).

For this reason, extending duplicate file searches beyond a quick exploratory run will usually require comparison of file contents. In Duplicate File Detective, file contents can be matched with or without regard to other file attributes such as name or modification date.

Duplicate File Detective compares file contents through a process known as file hashing. File hashing can be defined as the generation of a smaller, unique key value (a "digital fingerprint") from the (much larger) contents of a given file.

Duplicate File Detective can hash file contents using a variety of cryptographic algorithms, including CRC32, ADLER32, MD5, and SHA1. The first two generate 32-bit file hashes, while the latter two generate 128-bit and 160-bit file hashes, respectively. Generally speaking, the stronger (larger) the generated hash values, the more accurate the file content comparison. In fact, the chance that two different files could produce the same 128-bit or 160-bit digital fingerprint is incredibly small.

Duplicate File Detective allows you to mix and match file content matching techniques to suit your specific requirements, but the built-in Project Wizard creates a reasonable set of defaults based upon your general objectives:

  • Quick duplicate file scan - This project type matches duplicate files by name and size alone. It executes very quickly because no file contents are analyzed, and provides a fair degree of accuracy.
  • Checksum duplicate file scan - Matches files by name, size, and CRC content hash. Also fairly quick, but works to ensure that files with the same name and size also have the same contents.
  • Strong checksum duplicate file scan - Matches files by contents alone (via strong MD5 hashing). Allows for the isolation of duplicate files regardless of other attributes (such as name and / or modification date).

In Duplicate File Detective v2.x, file content hashing can be further re-enforced with byte-for-byte content matching. Byte-for-byte matching can help provide that final degree of confidence that two duplicate files are truly, 100% identical at the byte level.