`/duplicate` endpoint

Use the /duplicate endpoint to scan the collection for duplicate images using external images or the images already in the collection. Detect exact duplicates only or near-duplicates as well.

The examples below require the example dataset.

Single scan

Example result of correctly detected duplicates.

Check whether the image 115 has duplicates in the collection using the /duplicate endpoint.

http://localhost:8983/api/cores/my-collection/duplicate?rank.by.id=115

Open Flow in browser

Full scan

Scanning the complete collection by comparing all images against each other is more complex than a single scan. The process for a full scan is:

Retrieve all image ids in the collection
For each id scan for duplicates in the collection
If duplicate images are found add those in a graph as nodes and connect them via edges.
Once all images have been processed identify isolated groups of duplicates in the graph.
For each duplicate group save the URIs to each image in a JSON file.

Once the duplicates are detected, you can implement action policies like deleting images, merge them etc.

Since the above steps are quite a few, we provide you a python project dupe-scanner on Github.

Dupe Scanner on Github

Clone the repo:

git clone https://github.com/pixolution/dupe-scanner

Follow the setup instructions and scan with the example dataset or index your local files to detect duplicates.

(venv) user@laptop:~/dupe-scanner$ python scanner.py scan
Process 1562 images. Start scanning (this may take a while)...
100%|██████████████████████████████████████| 1562/1562 [00:19<00:00, 81.88scans/s]
Export 31 duplicate groups to duplicates.json...
Detected 77 duplicate images. Export results to duplicates.html ...
100%|███████████████████████████| 77/77 [00:06<00:00, 12.10thumbnails generated/s]

Open the generated duplicates.html in your browser and inspect the detected duplicate clusters of your collection.

Example visualization of found duplicate groups.

Partial scanning

Sometimes you want to search for duplicates only in a subset of the collection. To restrict the scan, just add one or more filter queries (fq) to your query.

You may include only docs that are tagged with a certain term (e.g. fq=category:project-a) or exclude them (search in all other projects but project-a by setting fq=category:project-a). You can also search in specific date ranges.

This may be useful to either speed up the scan be reducing the number of docs that have to be scored or to be more specific where to search.

QueryResponse

The query below scans for door images that are duplicates of image id 1041. Furthermore, the query image itself is excluded from search.

http://localhost:8983/api/cores/my-collection/duplicate?
rank.by.id=1041
&fq=labels:door
&fq=-id:1041

One duplicate image is found.

{
  "responseHeader":{
    "status":0,
    "QTime":6},
  "response":{"numFound":1,"start":0,"maxScore":0.7436813,"numFoundExact":true,"docs":[
      {
        "id":"1051",
        "image":"https://docs.pixolution.io/assets/imgs/example-dataset/door/photo-1581613856477-f65208436a48.jpeg",
        "score":0.7436813}]
  }
}

Detection sensitivity

Depending on your use case you may have various definitions what a duplicate actually is. With the rank.treshold parameter you can set the detection sensitivity and control whether to only retrieve exact duplicates or near-duplicates as well.

Exact duplicates may have a different scale, file format or compression artifacts but basically encode the same image content.

Near-duplicates additionally may include manipulations to the image content, like cropping, different aspect ratio, changes to brightness, gamma and saturation, added decorations like text, logos or icons.

The threshold depends on your image content as well as your definion of a duplicate. As a guideline you may set the following rank.threshold values based on your use case:

Exact duplicates only: 0.95
Near-duplicates: 0.7

Scanning huge collections

Scanning a complete collection is a time consuming process which has to be done offline. Depending on the hardware and collection size it can take hours to process. Most often, this is only necessary once to initially de-dupe an existing dataset. Subsequent operation should then scan before adding new images to the collection to keep the dataset clean.

Reducing scan time is always a trade-off between time and detection quality. We propose two scan modes: balanced & speed which both suggest different parameter values for rank.approximate and rank.smartfilter based on the collection size.

The python snippet find-duplicates.py implements both scanning modes (see full scan).

BalancedSpeed

up to 10,000 docs use rank.approximate=false & rank.smartfilter=off
up to 100,000 docs use rank.approximate=false & rank.smartfilter=low
from 100,000 docs use rank.approximate=true & rank.smartfilter=medium

up to 100,000 docs use rank.approximate=false & rank.smartfilter=high
from 100,000 docs use rank.approximate=true & rank.smartfilter=high

When using rank.approximate=true you have to manually remove irrelevant matches from the result set that do not meet a required score threshold. The provided find-duplicates.py snippet automatically does this for you.

Maintain a clean image collection

To keep your collection free of duplicates, scan new images to check if they already exist before you index them. Based on the response, you can take various actions, such as rejecting the new document, adding a link to the existing document IDs, forcing user input, or overwriting the existing document.

The graph below shows the process how to efficiently scan before indexing a new doc. The idea is that an image should be analyzed only once to avoid repeated analysis steps and to reuse the preprocessed json data throughout the workflow.

graph LR
  A[Client]
  B[Flow /analyze endpoint]
  C[Flow /duplicate endpoint]
  D[Flow /update endpoint]

  A <==>|1. Analyze image| B
  A <==>|2. Scan for duplicates | C
  A ==>|3. Index Json| D

Analyze image
Use preprocessed json to scan for duplicates (rank.by.preprocessed)
use preprocessed json to index doc, if it does not exist

Use Cases

E-commerce

Finding near-duplicate images can help identify products that are being sold by multiple vendors, making it easier to manage inventory and ensure product authenticity.

Media and entertainment

Finding near-duplicate images can help identify copyright violations and protect intellectual property rights for content creators.

Stock photography

Automatically identify and reject image uploads of previously flagged and rejected images of content creators.

Real estate

Identify fraudulent image uploads (fake listings) on the marketplace by matching them with verified users' images, identify duplicate listings for the same home or property if imported from multiple sources.

Digital asset management

Prevent multiple uploads of different users or associate different version of an image. Dedupe existing image collection.