Published August 2022
OpenSearch includes a plugin for vector search. In this post, we introduce vector search and compare the different methods available.
We will also point you in the right direction for example code. For personalized help, contact us to learn more about our OpenSearch support services.
What is vector search?
Here’s the ELI5 version:
Vector search compares data points to identify the most similar results.
For a more text-book definition:
Vector search provides fast and accurate searching. It uses k-nearest neighbor (k-NN) to identify the nearest neighbors between points in a vector space. We will define vector space below.
Vector search is also sometimes referred to as vector similarity search.
What is vector search used for?
The classic use case for vector search is identifying similar products for a customer.
For instance, when shopping in an online store, next to the item a customer explicitly searches for will be similar items.
Another example is with video streaming services. Streaming services will show users similar shows to the one(s) they’re are already watching.
What is k-NN?
Vector search for OpenSearch uses k-Nearest Neighbor (k-NN). k-NN is an algorithm that searches for the most similar entries to a query across a database.
What is vector space?
Each item in a database can have many parameters. For instance, text-based documents will likely contain the following parameters:
- List of words they contain
- Date collected
- Name of sender and receiver
Each of these parameters is a dimension within the vector space.
Each dimension is a point of classification for the stored data. And vector search uses these dimensions/parameters to identify nearest neighbors.
OpenSearch Plugin for Vector Search
OpenSearch has a vector search plugin included with the installation. The plugin provides three methods for obtaining nearest neighbors within your data set. The choice of three methods is great for users because you can balance speed and accuracy.
Approximate k-NN. This method is ideal for large indices (100k vectors or more). It is a low-latency approach and is scalable.
Yet, its fast performance compromises search accuracy. It also does not support pre-filtering.
Script Score k-NN. Script Score is a brute force, exact k-NN search. With brute force search the algorithm completes an exhaustive search. Brute force examines all possible solutions. This approach is more accurate than Approximate k-NN. But, it results in high latencies for large datasets.
Script Score allows you to search on a subset of your vectors, also known as pre-filter search. And it can also support binary fields.
Painless Extensions. This is a custom method for complex use cases. As with Script Score, Painless Extensions is a brute force, exact k-NN search. And it allows for pre-filtering.
The Painless Extensions method also supports the use of distance functions. There are many different methods for calculating the distance between two points. The choice of distance function can improve classification accuracy.
Let’s illustrate how a distance function can impact nearest neighbor determination. We will consider the Euclidean distance and Manhattan distance for this demonstration.
The figure below shows two methods for measuring distance between points A and B.
Euclidean distance is the shortest (diagonal) path between two points. Manhattan (or taxicab) distance is the absolute distance between two points.
Running k-NN using Euclidean vs Manhattan distance will return different results. This is because the measure for distance produces different nearest neighbors.