Tutorial: Detecting Duplicate Images

Learning Objectives

In this tutorial, you’ll learn:

  • How to detect duplicate and near-duplicate images in your collection
  • How to control similarity thresholds for duplicate detection
  • The difference between exact duplicates and visually similar images
  • Best practices for image deduplication workflows
  • How to interpret and use duplicate detection results

Prerequisites

Before starting this tutorial, please ensure you have:

  • Completed the Finding Similar Images tutorial
  • A valid search context ID from previous tutorials
  • Your Aspose Cloud credentials (Client ID and Client Secret)
  • A valid access token for the Aspose Cloud API
  • Multiple images already added to your search context, including some duplicates or near-duplicates

Understanding Image Duplication

In digital image collections, duplicates can exist in various forms:

  1. Exact Duplicates: Pixel-by-pixel identical images
  2. Near-Duplicates: Same image with minor variations:
    • Different compression or file format
    • Slight cropping or resizing
    • Minor color adjustments or filters
    • Small edits or overlays
  3. Visually Similar: Not duplicates but very similar content
    • Same subject from different angles
    • Sequential frames from video
    • Multiple shots from same photoshoot

Aspose.Imaging Cloud can detect both exact and near-duplicates using visual feature matching.

How Duplicate Detection Works

Unlike the “find similar” operation, which finds images similar to a single reference image, duplicate detection compares each image in your search context with every other image to identify pairs or groups that appear to be duplicates of each other.

The process is based on:

  1. Comparing the visual features of all images
  2. Applying a similarity threshold to determine what counts as a “duplicate”
  3. Grouping images into duplicate sets based on their similarity

Step 1: Obtain an Access Token

First, authenticate with the Aspose Cloud API:

curl -v "https://api.aspose.cloud/v3/imaging/ai/imageSearch/YOUR_SEARCH_CONTEXT_ID/findDuplicates?similarityThreshold=90.0" \
-X GET \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN"

Replace the following placeholders:

  • YOUR_SEARCH_CONTEXT_ID with your actual search context ID
  • YOUR_ACCESS_TOKEN with the token from Step 1

The similarityThreshold query parameter controls how strict the duplicate detection is:

  • 90-100: Only very close or exact duplicates
  • 80-89: Near duplicates with minor variations
  • 70-79: Visually similar images that might be considered duplicates
  • Below 70: Not recommended for duplicate detection (too many false positives)

Step 2: Interpret the Results

You’ll receive a response like this:

{
  "Duplicates": [
    {
      "duplicatesInGroup": [
        {
          "ImageId": "image1.jpg",
          "Similarity": 100.0
        },
        {
          "ImageId": "image1_copy.jpg",
          "Similarity": 98.7
        }
      ]
    },
    {
      "duplicatesInGroup": [
        {
          "ImageId": "cat_photo.jpg",
          "Similarity": 100.0
        },
        {
          "ImageId": "cat_photo_resized.jpg",
          "Similarity": 95.2
        },
        {
          "ImageId": "cat_photo_filtered.jpg",
          "Similarity": 92.1
        }
      ]
    }
  ],
  "Code": 200,
  "Status": "OK"
}

The response includes:

  • An array of duplicate groups, each representing a set of duplicate images
  • Within each group, the first image is considered the “original”
  • Each image has a similarity score relative to the first image in its group
  • Only images meeting the similarity threshold are included in groups

Try It Yourself

Now, let’s practice finding duplicate images:

  1. Obtain an access token using the curl command
  2. Add several similar or duplicate images to your search context if you haven’t already
  3. Run the duplicate detection with different similarity thresholds (try 95, 90, and 80)
  4. Compare the results and see how the groups change with different thresholds
  5. Identify which threshold works best for your specific use case

Using SDK Libraries

Java SDK Example

import com.aspose.imaging.cloud.sdk.api.ImagingApi;
import com.aspose.imaging.cloud.sdk.model.ImageDuplicatesSet;
import com.aspose.imaging.cloud.sdk.model.ImageDuplicates;
import com.aspose.imaging.cloud.sdk.model.SearchResult;

public class FindDuplicateImagesExample {
    
    public static void main(String[] args) {
        // Get your clientId and clientSecret from https://dashboard.aspose.cloud/
        String clientId = "YOUR_CLIENT_ID";
        String clientSecret = "YOUR_CLIENT_SECRET";
        
        // Create instance of the API
        ImagingApi api = new ImagingApi(clientId, clientSecret);
        
        // Your search context ID from previous operations
        String searchContextId = "YOUR_SEARCH_CONTEXT_ID";
        
        // Duplicate detection parameter
        Double similarityThreshold = 90.0; // Minimum similarity score (0-100)
        
        try {
            // Find duplicate images
            ImageDuplicatesSet duplicatesSet = api.findImageDuplicates(searchContextId, similarityThreshold, null, null);
            
            // Print the results
            System.out.println("Found " + duplicatesSet.getDuplicates().size() + " duplicate groups:");
            
            int groupIndex = 1;
            for (ImageDuplicates group : duplicatesSet.getDuplicates()) {
                System.out.println("\nDuplicate Group " + groupIndex + ":");
                
                for (SearchResult image : group.getDuplicatesInGroup()) {
                    System.out.println("  Image ID: " + image.getImageId() + ", Similarity: " + image.getSimilarity());
                }
                
                groupIndex++;
            }
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

.NET SDK Example

using Aspose.Imaging.Cloud.Sdk.Api;
using Aspose.Imaging.Cloud.Sdk.Model;
using System;

namespace AsposeImagingCloudTutorials
{
    class FindDuplicateImagesExample
    {
        static void Main(string[] args)
        {
            // Get your clientId and clientSecret from https://dashboard.aspose.cloud/
            string clientId = "YOUR_CLIENT_ID";
            string clientSecret = "YOUR_CLIENT_SECRET";
            
            // Create instance of the API
            ImagingApi api = new ImagingApi(clientId, clientSecret);
            
            // Your search context ID from previous operations
            string searchContextId = "YOUR_SEARCH_CONTEXT_ID";
            
            // Duplicate detection parameter
            double similarityThreshold = 90.0; // Minimum similarity score (0-100)
            
            try
            {
                // Find duplicate images
                ImageDuplicatesSet duplicatesSet = api.FindImageDuplicates(searchContextId, similarityThreshold);
                
                // Print the results
                Console.WriteLine("Found " + duplicatesSet.Duplicates.Count + " duplicate groups:");
                
                int groupIndex = 1;
                foreach (ImageDuplicates group in duplicatesSet.Duplicates)
                {
                    Console.WriteLine("\nDuplicate Group " + groupIndex + ":");
                    
                    foreach (SearchResult image in group.DuplicatesInGroup)
                    {
                        Console.WriteLine("  Image ID: " + image.ImageId + ", Similarity: " + image.Similarity);
                    }
                    
                    groupIndex++;
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error: " + ex.Message);
            }
        }
    }
}

Real-World Deduplication Workflow

In real applications, duplicate detection is often part of a larger workflow:

  1. Detection: Find duplicate groups using the API
  2. Review: Allow users to review the suggested duplicates
  3. Selection: For each group, select which version to keep
  4. Action: Remove or archive the unwanted duplicates
  5. Documentation: Log which images were removed and why

A typical deduplication strategy is to keep the highest quality version of each image and remove the rest.

Best Practices for Duplicate Detection

  1. Choose the Right Threshold: Start with a high threshold (90-95) to identify only very close duplicates.
  2. Batch Processing: For large collections, process in manageable batches.
  3. Quality Assessment: When choosing which duplicate to keep, consider:
    • Image resolution (keep the highest)
    • File format (prefer lossless formats)
    • Addition of metadata (keep the version with more useful metadata)
  4. Human Verification: For critical collections, have a human verify the duplicates before removal.
  5. Incremental Deduplication: Run duplicate detection periodically as new images are added.

What You’ve Learned

In this tutorial, you’ve learned:

  • How to detect duplicate and near-duplicate images in your collection
  • How to control the duplicate detection sensitivity using thresholds
  • How to interpret the results of duplicate detection
  • Best practices for image deduplication workflows
  • How to implement duplicate detection using the API and SDKs

Troubleshooting Tips

  • No Duplicates Found: If you expect duplicates but none are found, try lowering the similarity threshold.
  • Too Many False Positives: If unrelated images are grouped as duplicates, increase the similarity threshold.
  • Performance Issues: Duplicate detection compares each image with every other image, which can be resource-intensive for large collections.
  • Unexpected Grouping: Remember that grouping is based on visual similarity, not semantic meaning or file properties.

Next Steps

Now that you know how to detect duplicate images, you’re ready to move on to the next tutorial in our series:

Tutorial: Comparing Two Images

In the next tutorial, you’ll learn how to directly compare two specific images and determine their similarity score.

Further Practice

To reinforce what you’ve learned:

  • Create a script that finds duplicates and organizes them in folders
  • Experiment with different similarity thresholds to find the best balance for your images
  • Implement a simple duplicate detection and cleanup workflow
  • Try adding transformed versions of images (rotated, resized, filtered) and see if they’re detected as duplicates

Helpful Resources