/duplicate
endpoint
Use the /duplicate
endpoint to scan the collection for duplicate images using external images or the images already in the collection.
Detect exact duplicates only or near-duplicates as well.
The examples below require the example dataset.
Single scan
Check whether the image 115
has duplicates in the collection using the /duplicate
endpoint.
http://localhost:8983/api/cores/my-collection/duplicate?rank.by.id=115
Full scan
Scanning the complete collection by comparing all images against each other is more complex than a single scan. The process for a full scan is:
- Retrieve all image ids in the collection
- For each id scan for duplicates in the collection
- If duplicate images are found add those in a graph as nodes and connect them via edges.
- Once all images have been processed identify isolated groups of duplicates in the graph
- For each duplicate group save the image ids as CSV in a single line.
Once the duplicates are detected, you can implement action policies like deleting images, merge them etc.
Since these are quite a few steps, we provide you a python snippet find-duplicates.py
which you can use.
Executing the find-duplicates.py
script scans your complete collection:
$ python3 find-duplicates.py
Init scan_mode=balanced based on 1565 docs in collection.
Set rank.approximate=False and rank.smartfilter=off
Process 1565 images. Start scanning (this may take a while)...
100%|█████████████████████████████████████| 1565/1565 [00:02<00:00, 619.61scans/s]
Found 50 duplicates
Save graph to file graph.json
Save csv to file duplicates.csv
Paste this snippet in a file find-duplicates.py
and run the scan.
import requests
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import networkx as nx
import json
import csv
class DupeScanner:
def __init__(self, flow_url="http://localhost:8983/api/cores/my-collection",
scan_mode="balanced", threshold=0.7):
self.FLOW_URL = flow_url
self.G = nx.Graph()
self.threshold = threshold
self.approx, self.filter = self.init_scan_mode(scan_mode)
print(f"Set rank.approximate={self.approx} and rank.smartfilter={self.filter}")
def init_scan_mode(self, scan_mode):
num = self.num_docs()
print(f"Init scan_mode={scan_mode} based on {num} docs in collection.")
if scan_mode == "balanced":
if num <= 10_000:
return (False, "off")
if num <= 100_000:
return (False, "low")
if num <= 1_000_000:
return (True, "medium")
return (True, "high")
elif scan_mode == "speed":
if num <= 100_000:
return (False, "high")
return (True, "high")
else:
raise ValueError("scan_mode must be 'balanced' or 'speed'.")
def ids(self, limit=sys.maxsize):
ids = []
cursor_mark = "*"
done = False
bulk = 1000
count = 0
while not done:
rsp = requests.get(f"{self.FLOW_URL}/select?rows={bulk}&fl=id&sort=id asc&cursorMark={cursor_mark}").json()
docs = rsp["response"]["docs"]
ids.extend(list(map(lambda doc: doc["id"], docs)))
count += bulk
if cursor_mark == rsp["nextCursorMark"] or count >= limit:
# no further docs
done = True
cursor_mark = rsp["nextCursorMark"]
return ids if count <= limit else ids[:limit]
def detect(self, id):
rsp = requests.get(f"{self.FLOW_URL}/duplicate?rank.by.id={id}&fq=-id:{id}&fl=id,score,image&rows=50&rank.threshold={self.threshold}&rank.approximate={self.approx}&rank.smartfilter={self.filter}")
dups = rsp.json()["response"]["docs"]
# Filter irrelevant results - necessary when scanning approximately
dups = self.remove_irrelevant_matches(dups)
if len(dups) >0 :
# Add query doc, since it is not part of response
self.G.add_node(id, image=self.get_image_url(id))
for dup in dups:
self.G.add_node(dup["id"], image=dup["image"])
self.G.add_edge(id, dup["id"], weight=dup["score"])
def num_docs(self):
rsp = requests.get(f"{self.FLOW_URL}/select?q=*:*&rows=0")
return rsp.json()["response"]["numFound"]
def remove_irrelevant_matches(self, docs):
new_list = []
for doc in docs:
if doc["score"] >= self.threshold:
new_list.append(doc)
return new_list
def get_image_url(self, id):
rsp = requests.get(f"{self.FLOW_URL}/select?q=id:{id}&fl=image")
return rsp.json()["response"]["docs"][0]["image"]
def scan(self, max=sys.maxsize, threads=4):
# Build new graph
self.G = nx.Graph()
pool = ThreadPoolExecutor(threads)
futures = []
for id in self.ids(max):
futures.append(pool.submit(self.detect, id))
if not futures:
print(f"Collection is empty.")
return
print(f"Process {len(futures)} images. Start scanning (this may take a while)...")
errors=0
try:
# Await completion and display progress
progress = tqdm(as_completed(futures), total=len(futures), unit="scans", colour="green", smoothing=0)
for f in progress:
if not f.exception() == None:
# Silently count errors
errors+=1
except KeyboardInterrupt:
print("Abort scanning...")
self.close_threadpool(futures, pool)
if errors>0:
print(f"{errors}/{len(futures)} images could not be processed.")
print(f"Found {self.G.number_of_nodes()} duplicates ")
def close_threadpool(self, futures, pool):
for future in futures:
future.cancel()
pool.shutdown(wait=True)
def save_graph(self, filename="graph.json"):
# Convert NetwokX graph to Cytoscape JSON graph
cyto_graph = nx.cytoscape_data(self.G)
with open(filename, 'w') as f:
print(f"Save graph to file {filename}")
f.write(json.dumps(cyto_graph, ensure_ascii=False))
def save_csv(self, filename="duplicates.csv"):
with open(filename, 'w', encoding='UTF8') as f:
print(f"Save csv to file {filename}")
writer = csv.writer(f)
for c in nx.connected_components(self.G):
writer.writerow(c)
scanner = DupeScanner()
scanner.scan()
scanner.save_graph()
scanner.save_csv()
Visualizing scan results
$ python3 show-graph.py
Load graph from file graph.json
Download and convert 50 images to data URIs...
Finished.
Dash is running on http://127.0.0.1:8050/
* Serving Flask app 'show_graph'
* Debug mode: on
Open your browser and inspect the detected duplicate clusters of the scan.
Paste this snippet in a file show-graph.py
and visualize the results.
from dash import Dash, html
import json
import dash_cytoscape as cyto
from PIL import Image
import requests
from io import BytesIO
import base64
class GraphPreparation:
def __init__(self, graph_file="graph.json"):
cyto_graph = None
print(f"Load graph from file {graph_file}")
with open(graph_file, 'r') as f:
cyto_graph = json.load(f)
self.edges = cyto_graph["elements"]["edges"]
self.nodes = cyto_graph["elements"]["nodes"]
# Avoid CORS issues by embedding thumbnail images
self.embedd_imgs_as_data_uris()
self.style_edge_width()
def style_edge_width(self):
for edge in self.edges:
weighted = max(0.1, edge['data']['weight'] - 0.5)
edge['data']['opacity'] = weighted
edge['data']['weight'] = weighted * 20
def embedd_imgs_as_data_uris(self):
print(f"Download and convert {len(self.nodes)} images to data URIs...")
for node in self.nodes:
url = node['data']['image']
#converts PIL image to datauri
img = Image.open(BytesIO(requests.get(url).content))
img.thumbnail((150,150))
data = BytesIO()
img.save(data, "JPEG")
data64 = base64.b64encode(data.getvalue())
node['data']['image'] = u'data:img/jpeg;base64,'+data64.decode('utf-8')
print("Finished.")
def get_cytoscape_ui(self):
return cyto.Cytoscape(
id='Detected duplicates',
responsive=True,
layout={'name': 'cose',
'animate':False,
'fit':True,
'nodeOverlap':80,
'numIter':500
},
style={'width': '100%', 'height': '100%', 'position': 'absolute'},
stylesheet=[
{
'selector': 'node',
'style': {
'content': 'data(name)',
'background-image': 'data(image)',
'background-fit': 'contain',
'background-opacity': '0.1',
'width': '150px',
'height': '150px',
'shape': 'rectangle'
}
},
{
'selector': 'edge',
'style': {
'opacity': 'data(opacity)',
'width': 'data(weight)',
}
}],
elements= self.nodes + self.edges
)
if __name__ == '__main__':
app = Dash(__name__)
graph = GraphPreparation()
app.layout = html.Div([ graph.get_cytoscape_ui() ])
app.run_server(debug=True)
Partial scanning
Sometimes you want to search for duplicates only in a subset of the collection.
To restrict the scan, just add one or more filter queries (fq
) to your query.
You may include only docs that are tagged with a certain term (e.g. fq=category:project-a
) or exclude them (search in all other projects but project-a
by setting fq=category:project-a
). You can also search in specific date ranges.
This may be useful to either speed up the scan be reducing the number of docs that have to be scored or to be more specific where to search.
The query below scans for door images that are duplicates of image id 1041
.
Furthermore, the query image itself is excluded from search.
http://localhost:8983/api/cores/my-collection/duplicate?
rank.by.id=1041
&fq=labels:door
&fq=-id:1041
One duplicate image is found.
{
"responseHeader":{
"status":0,
"QTime":6},
"response":{"numFound":1,"start":0,"maxScore":0.7436813,"numFoundExact":true,"docs":[
{
"id":"1051",
"image":"https://docs.pixolution.org/assets/imgs/example-dataset/door/photo-1581613856477-f65208436a48.jpeg",
"score":0.7436813}]
}
}
Detection sensitivity
Depending on your use case you may have various definitions what a duplicate actually is. With the rank.treshold
parameter you can set the detection sensitivity and control whether to only retrieve exact duplicates or near-duplicates as well.
Exact duplicates may have a different scale, file format or compression artifacts but basically encode the same image content.
Near-duplicates additionally may include manipulations to the image content, like cropping, different aspect ratio, changes to brightness, gamma and saturation, added decorations like text, logos or icons.
The threshold depends on your image content as well as your definion of a duplicate. As a guideline you may set the following rank.threshold
values based on your use case:
- Exact duplicates only:
0.95
- Near-duplicates:
0.7
Scanning huge collections
Scanning a complete collection is a time consuming process which has to be done offline. Depending on the hardware and collection size it can take hours to process. Most often, this is only necessary once to initially de-dupe an existing dataset. Subsequent operation should then scan before adding new images to the collection to keep the dataset clean.
Reducing scan time is always a trade-off between time and detection quality.
We propose two scan modes: balanced & speed which both suggest different parameter values for rank.approximate
and rank.smartfilter
based on the collection size.
The python snippet find-duplicates.py
implements both scanning modes (see full scan).
- up to 10,000 docs use
rank.approximate=false
&rank.smartfilter=off
- up to 100,000 docs use
rank.approximate=false
&rank.smartfilter=low
- from 100,000 docs use
rank.approximate=true
&rank.smartfilter=medium
- up to 100,000 docs use
rank.approximate=false
&rank.smartfilter=high
- from 100,000 docs use
rank.approximate=true
&rank.smartfilter=high
When using rank.approximate=true
you have to manually remove irrelevant matches from the result set that do not meet a required score threshold. The provided find-duplicates.py
snippet automatically does this for you.
Maintain a clean image collection
To keep your collection free of duplicates, scan new images to check if they already exist before you index them. Based on the response, you can take various actions, such as rejecting the new document, adding a link to the existing document IDs, forcing user input, or overwriting the existing document.
The graph below shows the process how to efficiently scan before indexing a new doc. The idea is that an image should be analyzed only once to avoid repeated analysis steps and to reuse the preprocessed json data throughout the workflow.
graph LR
A[Client]
B[Flow /analyze endpoint]
C[Flow /duplicate endpoint]
D[Flow /update endpoint]
A <==>|1. Analyze image| B
A <==>|2. Scan for duplicates | C
A ==>|3. Index Json| D
- Analyze image
- Use preprocessed json to scan for duplicates (
rank.by.preprocessed
) - use preprocessed json to index doc, if it does not exist
Use Cases
E-commerce
Finding near-duplicate images can help identify products that are being sold by multiple vendors, making it easier to manage inventory and ensure product authenticity.
Media and entertainment
Finding near-duplicate images can help identify copyright violations and protect intellectual property rights for content creators.
Stock photography
Automatically identify and reject image uploads of previously flagged and rejected images of content creators.
Real estate
Identify fraudulent image uploads (fake listings) on the marketplace by matching them with verified users' images, identify duplicate listings for the same home or property if imported from multiple sources.
Digital asset management
Prevent multiple uploads of different users or associate different version of an image. Dedupe existing image collection.