Configuration

Keep the index of Solr in sync with your search system, so that all image IDs exist in both search systems and address equal images (ID 1 in your system references the same image in the index of Solr). Therefore, when adding or deleting images in your search system also add or delete those images in the Solr index.

No distributed support

The Universal Connector does not support a distributed Solr environment and works only with a single sharded index.

The Universal Connector consists of the following parts:

  • UniversalConnectorHandler is the endpoint and parses the provided image IDs from your existing search system
  • UniversalConnectorComponent converts the given image ID list to an internal Solr filter, causing Solr only to return images that are part of the given list.
  • Optional Cache for fast internal docid lookup. There is a DocIdCacheWarmer for auto warming the cache at Solr start up and a DocIdCacheRegenerator for regenerating the cache when commits to the index are made.

UniversalConnectorHandler

The UniversalConnectorHandler supports content-streams in queries, parses the IDs from the request POST body as stream and creates an internal data structure used by UniversalConnectorComponent.

The example below shows the configuration of the UniversalConnectorHandler as request handler. Additionally, the UniversalConnectorComponent is referenced by its name to enable the processing of the parsed IDs.

Configure Universal Connector endpoint with the name /subset:

<requestHandler name="/subset"
class="de.pixolution.solr.handler.component.UniversalConnectorHandler">
  <arr name="first-components">
    <str>pixolutionComponent</str>
    <str>universalConnectorComponent</str>
  </arr>
</requestHandler>

UniversalConnectorComponent

The UniversalConnectorComponent creates a filter using the the parsed image IDs to restrict the response to only return images with IDs that are part of the given ID set.

The next example shows a configuration of the UniversalConnectorComponent within the query section of solrconfig.xml.

Configure UniversalConnectorComponent with an associated cache:

<searchComponent name="universalConnectorComponent"
class="de.pixolution.solr.handler.component.UniversalConnectorComponent">
  <str name="cache.name">universalConnectorCache</str>
</searchComponent>

cache.name

Optional parameter. Name of the referenced cache that should be used to cache and lookup image IDdocID mappings. See the next section for how to configure a cache. It is highly recommended to use a cache.

DocIdCacheRegenerator

The configured cache is used by UniversalConnectorComponent. The cache is used to lookup docIds that are used internally in Solr, instead of the image IDs that were send in a request. Therefore the cache will be filled with mappings of image IDs to internal docIds.

Since the index lookup of docIds is fairly slow, the cache is vital to speed up this process.

Cache configuration with DocIdCacheRegenerator as cache regenerator:

<cache name="universalConnectorCache"
  class="solr.LRUCache"
  size="100000"
  initialSize="100000"
  autowarmCount="100%"
  regenerator="de.pixolution.solr.search.DocIdCacheRegenerator"/>

All cache parameters are Solr specific. Anyhow, we explain the params with regard to the Universal Connector usage.

name

Name of the cache, that is used by UniversalConnectorComponent and DocIdCacheWarmer to identify the correct cache.

class

The actual type of cache. All Solr cache implementations may be used as cache.

size

Maximum number of cached image IDdocID mappings. The higher this value, the better the performance will be at the expense of RAM consumption. For best performance it is recommended to cache the complete docIds of your collection. If you have 100 000 images in your collection you may set the size to 100 000 or even higher if your collection will grow over time.

Boost search performance

We recommend to cache all image IDdocID mappings. This consumes a lot of RAM but speeds up search performance tremendously. Looking up internal docIDs consumes up to 95% of the complete query processing. You can speed up performance by a factor of 20 when caching all image IDdocID mappings! To do so set the size param to the number of documents in your index or even higher, if you add documents to the index in the future.

RAM usage

The consumed RAM of the cache depends on the number of cached elements and the number of concurrently opened searchers (see Performance hints when using Universal Connector). As an indication you should reserve the following RAM: 1 million elements = 500MB, 5 million elements = 2,5GB and 10 million elements = 5GB etc. When using the JRE -Xmx param you should add additional space according to the remaining configuration of solrconfig.xml.

initialSize

The initial cache size. The cache can grow to a maximum of size elements. This value does not mean, that the cache will initially be filled. If you fill the complete cache when auto warming, set initialSize to the same value as size.

autowarmCount

If an index change is made visible (commit or optimize) the old cache is invalidated and a new one will be created. With autowarmCount you can define how many items of the old cache should be regenerated into the new cache. If set to 100% all elements of the old cache will be regenerated. This parameter supports specific numbers as well as percentage. The higher this value the longer the regeneration will take and therefore the visibility of index changes delay.

Changes to the index will only be visible after regeneration has finished. While regenerating there are shortly two caches which consume twice as much RAM: the old cache that still serves requests and the new cache, that get currently filled.

Commit strategy

Keep in mind, that every commit or optimize will cause the invalidation of the old cache and the regeneration of a new one. See Performance hints when using Universal Connector for a suitable commit strategy.

regenerator

The class DocIdCacheRegenerator must be used in order to fill the new cache with the correct data expected by UniversalConnectorComponent. The DocIdCacheRegenerator looks up the new internal docIds that might have changed due to an index change and renews the mapping of image IDdocID elements up to autowarmCount. DocIdCacheRegenerator will be called internally by the cache implementation.

DocIdCacheWarmer

To auto warm the cache when Solr starts, you can configure a DocIdCacheWarmer that will fill the associated cache with current image IDdocID mappings without user interactions.

In the example below the DocIdCacheWarmer is configured as event listener that will be triggered when the event firstSearcher is fired. This event will be fired once at start up time of Solr.

DocIdCacheWarmer as event listener for the firstSearcher event:

<listener event="firstSearcher"
class="de.pixolution.solr.schema.DocIdCacheWarmer">
  <str name="cache.name">universalConnectorCache</str>
  <int name="cache.size">100000</int>
</listener>

The DocIdCacheRegenerator warms caches after index changes, DocIdCacheWarmer warms a cache initially at startup.

Avoid double warming

If the event type is set to event="newSearcher" the DocIdCacheWarmer will warm the cache after every index change. If you have also configured a DocIdCacheRegenerator the cache will then be warmed/regenerated twice causing more RAM consumption and longer warming time. Avoid this behaviour by using the DocIdCacheWarmer only as a startup cache warmer with the event event="firstSearcher".

cache.name

Mandatory parameter. The name of the associated cache that should be warmed. This must the cache name configured above.

cache.size

Optional parameter. Set how many elements should be warmed. If not set, the complete index will be iterated and all IDs will be cached. If the cache is not big enough, the cache will overwrite already cached elements. Therefore this value should not be greater than the size of the cache.

It is recommended to fill the complete cache. Although this will slow down Solr startup, even first queries can benefit from cache lookup.