URL encoding

pixolution flow processes URLs as references to image resources when indexing new images (add) and doing visual similarity searches (query by URL).

A URL may consist of reserved, unreserved and other characters (for further information read Percent-encoding and RCF3986). Reserved characters have a special meaning (i.e. partition different parts of a URL) and must be percent-encoded, if they should not have a special meaning (i.e. as part of a filename or folder):

! # $ % & ' ( ) * + , / : ; = ? @ [ ]

Unreserved characters with no special meaning (no need to percent-encode):

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 - _ . ~

All other characters must always be percent-encoded. These are all characters that are not reserved and unreserved (see above).

For percent-encoding UTF-8 must be used, since pixolution flow decodes the URL with UTF-8 internally.

For example nasty URLs like:

http://localhost:1234/a b/a folder/=รค@>d0>8p?! (pk..

can be used as long as they are properly UTF-8 percent-encoded. See below the same URL just percent-encoded (Note that the / char is not encoded, since it indicates a folder hierarchy and has therefore special meaning for the URL):

http://localhost:1234/a%20b/a%20folder/%3D%C3%A4%40%3Ed0%3E8p%3F%21%20%28pk..

In order to use a URL as value of rank.by=url: within a GET request you need to percent-encode the complete URL after you have percent encoded the chars as stated above. This is necessary because the GET request itself is a URL and you have to escape the special meaning of your provided rank.by=url: parameter. This parameter should only be treated as URL after it reached Solr. If you are using a Client API library like SolrJ, you do not need to encode the complete URL yourself. SolrJ will take care of the correct encoding when sending the parameter. You should test your implementation with some nasty URLs to be sure you do not percent-encode too often or not enough.

http://localhost:1234/folder/my image.jpg
http://localhost:1234/folder/my%20image.jpg
http%3A%2F%2Flocalhost%3A1234%2Ffolder%2Fmy%2520image.jpg

Add requests using the update handler are done via POST method, therefore you do not need to percent-encode your complete URL in those cases.

In short:

  1. Always UTF-8 percent-encode all chars that are either in
    • category other or
    • category reserved but should not have a special meaning, i.e. because they are part of a filename.
  2. When using URLs as rank.by parameter via GET request, you have to additionally percent-encode the complete URL. Client API libraries (i.e. Solrj) may do this for you.