How to find similar images?

Introduction
Anyone that downloads images from the internet, or that works a lot with digital photos, will know that after a while there are probably a lot of different versions of the same image around.

As long as these images exact duplicates, they can be found very quickly using a check on file length, and if identical, followed by a check of the contents of the files. This is what ABC-View does when searching duplicate files.

It gets more complicated if the images are not exact duplicates. Sometimes, images get resized or perhaps saved with a different amount of compression. Some images even get altered slightly, get frames around them or text or logos added to them.

For automated software, it gets suddenly much more difficult to find these similar images. The files have to be compared by actually comparing the pixels in the images.

How ABC-View Manager (ABCVM) does it

In order to be able to compare one image to another, ABCVM first creates a property containing the image characteristics, called image metrics. The image metrics are created for each image in the collection that needs to be checked. This process is called indexing.

The image metrics are constructed so that:

  • Similar but resized images can be found
  • The difference between two images can be expressed in %.
  • The user can define the amount of detail that should be present in the comparison
  • The user can opt to just compare intensity, or compare each of the color channels red, green and blue

The image metrics are kept in the database, and then, ABCVM uses a special process to find images that are similar. The collection may be sorted by similarity, or the collection is filtered so that only similar images are shown in color bands.

Do the work at the right time

The user can decide when the computer is going to do the “dirty work”: the indexing process can be lengthy for large collections. ABCVM offers two approaches:

  1. Start the indexing after adding new files to the collection. The indexing process is done in the background and once completed, it allows the user to do a similarity search quickly.
    Note that you can save a collection and when you load it again, you don’t need to re-index! All image metrics are stored in the database.
    Options -> Browser -> Background processes -> [x] Precalculate image metrics for new files
  2. If you just want to do a quick check of some files or want to only calculate the image metrics for the files you select, you can do the indexing “on the fly”. Whenever a similarity search is started, the files that are not indexed will be indexed on the spot.

Add the files you want to check
Before you can use ABC-View for similarity checking you will need to add a list of files to it.

Create a list of similar images
Click on the list in which you want to find similar images, for instance, All Items. Rightclick on it and select Add Filter ->  Find similar images. You will get this dialog.:

Tolerance

The tolerance setting allows you to specify what is the maximum allowed difference between two similar pictures. Higher tolerance will yield more similar pictures. The smallest tolerance setting (0.3%) will only yield images that are virtually identical.

Note that a higher tolerance setting will slow down the process of finding similar images (more images must be compared).

The setting for detail matching will have an influence on the number of false positives and false negatives that are found. False positives are images that are not identical but show up in the list, false negatives are images that are identical, but do not show up in the list. Next table gives an overview:

Detail Matching
False positives & negatives at tolerance
0.3%0.5%0.8%1.5%2.5%5.0%7.5%10.0%
Low0%0.8%0.5%1.0%2.6%13.5%32.0%43.5%
Medium0%0.5%0.1%0.5%0.5%3.7%8.5%13.8%
High0%0%0%0%0%0.1%0.4%1.6%
Super-High
reference

Similarity search
Click on OK and the similarity search starts immediately. You can see the progress in the status bar. This is a background process so you can keep on working and viewing files.

When completely finished, ABC-View will sort the list of similar images so that they are grouped together in recognizable color bands.

Is it possible to delete all inferior images at once?
The list contains groups of similar images. In each group, there is one image that will be favorable to leave, while perhaps all others can or should be deleted. This is possible but requires some care.

First of all, it is best to work in Details Mode (F12). As an example, suppose you want to keep the biggest images (largest dimensions), and if these are equal, then the image with biggest file size. Here’s how to proceed:

  1. Click twice on the Size column. First click to sort on size, second click to reverse the order so that the largest files are coming first.
  2. Click twice on the Dimensions column. The first click is to sort by dimensions, the second click to reverse so that largest dimensions are first.
  3. Rightclick on any item in the list and select Sort List -> By Duplicate Group. Note that you now have a list that is first sorted by duplicate group, then by dimension, and then by size.
  4. Now click Items -> Special Selection -> Duplicates (w/o original). You will have selected all 2nd, 3rd and more of each duplicate group. This means, your top item per group, the one that has largest dimension, and if equal, largest size, will remain unselected.
  5. This is the reward: click the “delete” icon and delete or archive the selected items. The top items (your important images) are not selected so also not deleted.

It is crucial to realize that not all images that are in one duplicate group are exactly identical. Especially for the larger tolerance settings, you will find that images may look like the other, but still are definitely different. So please be careful when removing many images automatically with the above method!

Sorting on similarity
With a sort of similarity, you will get a list in which each item is displayed next to its most similar two neighbors. This process will work best for lists with a detail matching of at least “Medium”.

The image metrics must be precalculated in order for sorting on similarity to work.

A correct sorting on similarity cannot be done with the quicksort algorithm. Therefore, for sets smaller than 5,000 images, another method (slower but more precise) is used to sort the list on similarity.

You can change the default behavior of ABCVM:

Using the fast method for sorting on similarity will do it right in only for exactly similar images. Often, slightly similar images will not end up next to each other.