Find duplicates

Comparing a documents' metadata and file properties, this tool is able to group together similar documents. PDFs with same number of pages, same creation date and file size are obvious candidates that can be identified as duplicates. Other file and metadata properties may apply in different scenarios.
With this selection simplification, and with the possibility to fine tune the grouping with CRC (Cyclic Redundancy Check) comparison, and manual, side by side, visualization, we can easily determinate if two documents are duplicates.
To deal with each duplicates group, in order to select the duplicate files to exclude, there is also a dynamically select functionality, used to select all the files that match a custom defined criteria, except for the file we will want to keep.

Find duplicates tool screenshot

The tool has options to delete, copy or move files, that can be used to easily manage the duplicates found.

The operation begins by starting the tool with the list of PDFs we want to compare, that can incorporate folders that will be automatically scanned, sub-folders included, for PDF documents.

After the metadata and file properties of all the submitted documents is collected, the tool will use the default, or last used, list of properties to compare to group the documents by equality of properties. No groups will show If no equality is found, but the list of files remain charged internally. We have always the chance to change the list of properties used in the comparison to try other possibilities.

The list of properties to compare is composed by interacting with the related buttons in the top toolbar. The toolbar left button collects the last used comparison for easily reuse; the plus (+) sign button is used to add more items; each of the items buttons have options to change to another property, or to remove it from the comparison. Under this menu there is a scripts named item that provides access to custom defined script functions. The script functions are created using the built-in script editor, started from the manage scripts item. The script function should be create using the same rules as the used by rename tool scripting functionality, and should return a string that will be compared against the related values off all the other files. This can be, as example, the checksum of the PDF text, a value representing the page sizes, etc.

The Apply named button will execute a new comparison using the current list of comparison items.

The left toolbar provides access to the manage duplicates functionality:
- The CRC button compares the result of the CRC32 type checksum for each of the selected groups, and exclude all the files that in that group don't have the same checksum. Useful for when searching for exact file duplicates. Excluded files are not removed from the internal list of files, so further compare operations will show it again, if collision once again occur.

- The dynamically select functionality is used to select all the files, in each of the selected groups, that match the selector criteria, e.g. select all the files with the newest modification date, select all the files with a small length name, etc. This way it is possible to easily select the duplicates we will be deleting. The selectors are scripts and we can create new custom selector scripts anytime to best define our duplicates exclude scenario.

The delete and move operations remove the affected files from the internal list of documents, so further compare operations will not include these.

The export to CSV button exports the shown duplicates groups to a CSV format file, so the results can be processed by external applications.

(c) 2006-2021 RTT