Duplicate Detection

pixolution flow provides a dedicated request handler named DuplicateHandler for detecting duplicate images to a given sample image. The DuplicateHandler ships with optimized presets suited for a duplicate image detection task.

Configuration

The DuplicateHandler uses the following components which have to be configured in the solrconfig.xml:

The DuplicateHandler has a auto-lookup mechanism to lookup and register the above components automatically. Therefore the configuration of the DuplicateHandler request handler is only a one-liner:

<requestHandler name="/duplicate" class="de.pixolution.solr.handler.component.DuplicateHandler" />

dup.find

Necessity:mandatory
Value:id:123 | url:ENCODED_URL | descriptor:PIXOLUTION_DESCRIPTOR

You can use already indexed images to find duplicates. Or you can use external images referenced by URL or given as an image descriptor as search input.

An example to check whether the given image duplicate.jpg is already indexed:

/duplicate?dup.find=url:http%3A%2F%2Fwebsite.org%2Fduplicate.jpg

Response format:

The response is in the standard Solr format. Per default only the id field and the calculated score are returned. If different fields should be returned you can set the fl param accordingly. The score value is a float number between 0.0 - 1.0 describing how similar the found duplicates are to the input image.

Example response with two found duplicate images:

{
  "responseHeader": {
    "status": 0,
    "QTime": 0
  },
  "response": {"numFound": 2, "start": 0, "maxScore": 1, "docs": [
    {
      "id": "431",
      "score": 1},
    {
      "id": "62",
      "score": 0.9357}]
  }
}

dup.threshold

Necessity:optional
Value:0.0 to 1.0
Default:0.9

With dup.threshold you can set the sensitivity of the duplicate detection and control how strict or lax the detection should be.

With a value near to 1 you will only retrieve exact identities (if the physical file was indexed more than once). With lower values variants of the image for example with different encoding quality will be returned as well.

Example to check whether the given image was indexed more than once (the same file):

/duplicate?dup.find=id:123&dup.threshold=0.99