/duplicate endpoint

Use the /duplicate endpoint to scan the collection for duplicate images using external images or the images already in the collection. Detect exact duplicates only or near-duplicates as well.

The examples below require the example dataset.

Single scan

Example result of correctly detected duplicates.

Check whether the image 115 has duplicates in the collection using the /duplicate endpoint.


Open Flow in browser

Full scan

Scanning the complete collection by comparing all images against each other is more complex than a single scan. The process for a full scan is:

  1. Retrieve all image ids in the collection
  2. For each id scan for duplicates in the collection
  3. If duplicate images are found add those in a graph as nodes and connect them via edges.
  4. Once all images have been processed identify isolated groups of duplicates in the graph
  5. For each duplicate group save the image ids as CSV in a single line.

Once the duplicates are detected, you can implement action policies like deleting images, merge them etc.

Since these are quite a few steps, we provide you a python snippet which you can use.

Executing the script scans your complete collection:

$ python3
Init scan_mode=balanced based on 1565 docs in collection.
Set rank.approximate=False and rank.smartfilter=off
Process 1565 images. Start scanning (this may take a while)...
100%|█████████████████████████████████████| 1565/1565 [00:02<00:00, 619.61scans/s]
Found 50 duplicates
Save graph to file graph.json
Save csv to file duplicates.csv

Paste this snippet in a file and run the scan.

import requests
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import networkx as nx
import json
import csv

class DupeScanner:

    def __init__(self, flow_url="http://localhost:8983/api/cores/my-collection",
    scan_mode="balanced", threshold=0.7):
        self.FLOW_URL = flow_url
        self.G = nx.Graph()
        self.threshold = threshold
        self.approx, self.filter = self.init_scan_mode(scan_mode)
        print(f"Set rank.approximate={self.approx} and rank.smartfilter={self.filter}")

    def init_scan_mode(self, scan_mode):
        num = self.num_docs()
        print(f"Init scan_mode={scan_mode} based on {num} docs in collection.")
        if scan_mode == "balanced":
            if num <= 10_000:
                return (False, "off")
            if num <= 100_000:
                return (False, "low")
            if num <= 1_000_000:
                return (True, "medium")
            return (True, "high")
        elif scan_mode == "speed":
            if num <= 100_000:
                return (False, "high")
            return (True, "high")
            raise ValueError("scan_mode must be 'balanced' or 'speed'.")

    def ids(self, limit=sys.maxsize):
        ids = []
        cursor_mark = "*"
        done = False
        bulk = 1000
        count = 0
        while not done:
            rsp = requests.get(f"{self.FLOW_URL}/select?rows={bulk}&fl=id&sort=id asc&cursorMark={cursor_mark}").json()
            docs = rsp["response"]["docs"]
            ids.extend(list(map(lambda doc: doc["id"], docs)))
            count += bulk
            if cursor_mark == rsp["nextCursorMark"] or count >= limit:
                # no further docs
                done = True
            cursor_mark = rsp["nextCursorMark"]
        return ids if count <= limit else ids[:limit]

    def detect(self, id):
        rsp = requests.get(f"{self.FLOW_URL}/duplicate?{id}&fq=-id:{id}&fl=id,score,image&rows=50&rank.threshold={self.threshold}&rank.approximate={self.approx}&rank.smartfilter={self.filter}")
        dups = rsp.json()["response"]["docs"]
        # Filter irrelevant results - necessary when scanning approximately
        dups = self.remove_irrelevant_matches(dups)
        if len(dups) >0 :
            # Add query doc, since it is not part of response
            self.G.add_node(id, image=self.get_image_url(id))
            for dup in dups:
                self.G.add_node(dup["id"], image=dup["image"])
                self.G.add_edge(id, dup["id"], weight=dup["score"])

    def num_docs(self):
        rsp = requests.get(f"{self.FLOW_URL}/select?q=*:*&rows=0")
        return rsp.json()["response"]["numFound"]

    def remove_irrelevant_matches(self, docs):
        new_list = []
        for doc in docs:
            if doc["score"] >= self.threshold:
        return new_list

    def get_image_url(self, id):
        rsp = requests.get(f"{self.FLOW_URL}/select?q=id:{id}&fl=image")
        return rsp.json()["response"]["docs"][0]["image"]

    def scan(self, max=sys.maxsize, threads=4):
        # Build new graph
        self.G = nx.Graph()
        pool = ThreadPoolExecutor(threads)
        futures = []
        for id in self.ids(max):
            futures.append(pool.submit(self.detect, id))
        if not futures:
            print(f"Collection is empty.")
        print(f"Process {len(futures)} images. Start scanning (this may take a while)...")
            # Await completion and display progress
            progress = tqdm(as_completed(futures), total=len(futures), unit="scans", colour="green", smoothing=0)
            for f in progress:
                if not f.exception() == None:
                    # Silently count errors
        except KeyboardInterrupt:
            print("Abort scanning...")
            self.close_threadpool(futures, pool)
        if errors>0:
            print(f"{errors}/{len(futures)} images could not be processed.")
        print(f"Found {self.G.number_of_nodes()} duplicates ")

    def close_threadpool(self, futures, pool):
        for future in futures:

    def save_graph(self, filename="graph.json"):
        # Convert NetwokX graph to Cytoscape JSON graph
        cyto_graph = nx.cytoscape_data(self.G)
        with open(filename, 'w') as f:
            print(f"Save graph to file {filename}")
            f.write(json.dumps(cyto_graph, ensure_ascii=False))

    def save_csv(self, filename="duplicates.csv"):
        with open(filename, 'w', encoding='UTF8') as f:
            print(f"Save csv to file {filename}")
            writer = csv.writer(f)
            for c in nx.connected_components(self.G):

scanner = DupeScanner()

Visualizing scan results

$ python3
Load graph from file graph.json
Download and convert 50 images to data URIs...
Dash is running on

 * Serving Flask app 'show_graph'
 * Debug mode: on

Open your browser and inspect the detected duplicate clusters of the scan.

Example visualization of found duplicate groups.

Paste this snippet in a file and visualize the results.

from dash import Dash, html
import json
import dash_cytoscape as cyto
from PIL import Image
import requests
from io import BytesIO
import base64

class GraphPreparation:

    def __init__(self, graph_file="graph.json"):
        cyto_graph = None
        print(f"Load graph from file {graph_file}")
        with open(graph_file, 'r') as f:
            cyto_graph = json.load(f)
        self.edges = cyto_graph["elements"]["edges"]
        self.nodes = cyto_graph["elements"]["nodes"]
        # Avoid CORS issues by embedding thumbnail images

    def style_edge_width(self):
        for edge in self.edges:
            weighted = max(0.1, edge['data']['weight'] - 0.5)
            edge['data']['opacity'] = weighted
            edge['data']['weight'] = weighted * 20

    def embedd_imgs_as_data_uris(self):
        print(f"Download and convert {len(self.nodes)} images to data URIs...")
        for node in self.nodes:
            url = node['data']['image']
            #converts PIL image to datauri
            img =
            data = BytesIO()
  , "JPEG")
            data64 = base64.b64encode(data.getvalue())
            node['data']['image'] = u'data:img/jpeg;base64,'+data64.decode('utf-8')

    def get_cytoscape_ui(self):
        return cyto.Cytoscape(
            id='Detected duplicates',
            layout={'name': 'cose',
            style={'width': '100%', 'height': '100%', 'position': 'absolute'},
                'selector': 'node',
                'style': {
                    'content': 'data(name)',
                    'background-image': 'data(image)',
                    'background-fit': 'contain',
                    'background-opacity': '0.1',
                    'width': '150px',
                    'height': '150px',
                    'shape': 'rectangle'
                'selector': 'edge',
                'style': {
                    'opacity': 'data(opacity)',
                    'width': 'data(weight)',
            elements= self.nodes + self.edges

if __name__ == '__main__':
    app = Dash(__name__)
    graph = GraphPreparation()
    app.layout = html.Div([ graph.get_cytoscape_ui() ])

Partial scanning

Sometimes you want to search for duplicates only in a subset of the collection. To restrict the scan, just add one or more filter queries (fq) to your query.

You may include only docs that are tagged with a certain term (e.g. fq=category:project-a) or exclude them (search in all other projects but project-a by setting fq=category:project-a). You can also search in specific date ranges.

This may be useful to either speed up the scan be reducing the number of docs that have to be scored or to be more specific where to search.

The query below scans for door images that are duplicates of image id 1041. Furthermore, the query image itself is excluded from search.


One duplicate image is found.


Detection sensitivity

Depending on your use case you may have various definitions what a duplicate actually is. With the rank.treshold parameter you can set the detection sensitivity and control whether to only retrieve exact duplicates or near-duplicates as well.

Exact duplicates may have a different scale, file format or compression artifacts but basically encode the same image content.

Near-duplicates additionally may include manipulations to the image content, like cropping, different aspect ratio, changes to brightness, gamma and saturation, added decorations like text, logos or icons.

The threshold depends on your image content as well as your definion of a duplicate. As a guideline you may set the following rank.threshold values based on your use case:

  • Exact duplicates only: 0.95
  • Near-duplicates: 0.7

Scanning huge collections

Scanning a complete collection is a time consuming process which has to be done offline. Depending on the hardware and collection size it can take hours to process. Most often, this is only necessary once to initially de-dupe an existing dataset. Subsequent operation should then scan before adding new images to the collection to keep the dataset clean.

Reducing scan time is always a trade-off between time and detection quality. We propose two scan modes: balanced & speed which both suggest different parameter values for rank.approximate and rank.smartfilter based on the collection size.

The python snippet implements both scanning modes (see full scan).

  • up to 10,000 docs use rank.approximate=false & rank.smartfilter=off
  • up to 100,000 docs use rank.approximate=false & rank.smartfilter=low
  • from 100,000 docs use rank.approximate=true & rank.smartfilter=medium

  • up to 100,000 docs use rank.approximate=false & rank.smartfilter=high
  • from 100,000 docs use rank.approximate=true & rank.smartfilter=high

When using rank.approximate=true you have to manually remove irrelevant matches from the result set that do not meet a required score threshold. The provided snippet automatically does this for you.

Maintain a clean image collection

To keep your collection free of duplicates, scan new images to check if they already exist before you index them. Based on the response, you can take various actions, such as rejecting the new document, adding a link to the existing document IDs, forcing user input, or overwriting the existing document.

The graph below shows the process how to efficiently scan before indexing a new doc. The idea is that an image should be analyzed only once to avoid repeated analysis steps and to reuse the preprocessed json data throughout the workflow.

graph LR
  B[Flow /analyze endpoint]
  C[Flow /duplicate endpoint]
  D[Flow /update endpoint]

  A <==>|1. Analyze image| B
  A <==>|2. Scan for duplicates | C
  A ==>|3. Index Json| D
  1. Analyze image
  2. Use preprocessed json to scan for duplicates (
  3. use preprocessed json to index doc, if it does not exist

Use Cases


Finding near-duplicate images can help identify products that are being sold by multiple vendors, making it easier to manage inventory and ensure product authenticity.

Media and entertainment

Finding near-duplicate images can help identify copyright violations and protect intellectual property rights for content creators.

Stock photography

Automatically identify and reject image uploads of previously flagged and rejected images of content creators.

Real estate

Identify fraudulent image uploads (fake listings) on the marketplace by matching them with verified users' images, identify duplicate listings for the same home or property if imported from multiple sources.

Digital asset management

Prevent multiple uploads of different users or associate different version of an image. Dedupe existing image collection.