How to Index OpenSearch

An OpenSearch Index is used to both organize and distribute data within a cluster.

In this post we define both components of an Index and outline how to create, add to, delete, and reindex Indices in OpenSearch. We will also touch on querying, but querying OpenSearch is covered in more depth in our article How to Query OpenSearch.

The role of the OpenSearch Index can be described in two ways. Firstly, an Index is a mechanism to organize data in a user prescribed way. And secondly, the Index is a way to allocate data within an OpenSearch cluster.

Let’s first dive into how Indices are used to organize data.

Using Indices to Organize Data in OpenSearch

When we are discussing how Indices are used to organize data in OpenSearch we can think of them as being akin to a relational database like MySQL. Similarly, OpenSearch stores Documents* (rows) that each have Properties (columns). Each Property can also be thought of as a key in the key-value pair terminology.

Properties fall into one of several different categories that each have their own individual capabilities. For instance, you can perform math on Properties that are floats but not on a string or date Property.

*A quick note on capitalization. In this article we capitalize Index and Document throughout. We felt it was helpful to readers to see these terms capitalized to distinguish them from their every day meanings.

Let’s run through a short example. A retail company would likely have a purchases Index. Then within the Index there is a Document (row) for each individual purchase. And then within each purchase Document there are Properties (columns) such as date/time of purchase, item purchased, cost, etc.

We have more posts in the works that cover OpenSearch querying in depth, but for now let’s quickly look at how we would query this purchases Index. In the OpenSearch Dashboards Dev Console you can use the following code:

				
					GET /purchases/_search
{
  "query": {
	"match": {
  	"item": "New Balance 574"
	}
  }
}

The top line of the query includes “GET” which is the API used to retrieve a Document. Then, the Index is specified, in this case “purchases”. Finally, the operation is included. Here the operation is “_search” because we are searching for all the entries (Documents) for when a particular pair of cleats, New Balance 574, were purchased.

If instead of “match” we used “match_all” that would return all of the Documents in an Index. Since Indices typically hold thousands of documents, using “match_all” isn’t a common search.

Added Performance and Flexibility With Indices

While it is a helpful exercise to show how Indices and databases are similar, that is not the end of the story. Indices have added performance compared to databases. For instance, because Indices are lightweight, OpenSearch makes it easy to create an individual Index for each daily log or for each user.

Individual Indices can aid in performance and improve data organization within a cluster. Separate Indices can be used for products, logs, IoT devices, transactions, or any other group of items or events that are of interest to your use case. However, you likely wouldn’t want to have an Index for a singular item or event because of sharding. We go over why in the next section.

Distributing Data With Indices in OpenSearch

Shards are sub-elements of Indices and allow the information to be distributed over multiple nodes in a cluster. You can read more about OpenSearch shards here.

Replicas are used to enhance search performance in addition to providing data backup. Because replicas help with scaling query processing, replicas can be added/removed as needed at any time.

For the example above where we created a purchases Index, then we would also simultaneously be creating a set of primary and replica shards for that Index. As a default, OpenSearch creates one (1) primary shard and one (1) replica for each Index.

As a general guideline, set up a single shard Index to be a minimum of at least 20-50 GB. This recommended lower limit is because each Index is at least one shard.

Each individual shard requires resources from OpenSearch. More shards require more work for OpenSearch to maintain where data is located, detracting from OpenSearch’s ability to run queries.

Multiple shards allows OpenSearch to spread search load across multiple instances. There can be situations where you want more shards for greater parallelization of work across multiple instances.

If you are interested in more information on shards and shard optimization check out our article on OpenSearch Shard Optimization.

How to Create and Delete an Index in OpenSearch

Now that we have a basic understanding of Indices, let’s get into the ins and out of using them.

Firstly, to retrieve a full list of all Indices in your OpenSearch cluster use the following code in the OpenSearch Dashboards Dev Console:

				
					GET /_cat/indices

Creating an Index in OpenSearch

The PUT method is used to create an Index. Below is our purchase example again. In this sample code we are creating the purchases Index and specifying its properties. The Properties included here are “@timestamp”, “price”, and “item_name”. The type of Property is also specified as date, float, and string, respectively.

				
					PUT /purchases
{
"settings": {
   "index": {
     	"number_of_shards": 1,
     	"number_of_replicas": 1
   },
   "analysis": {
 	"analyzer": {
   	"analyzer-name": {
         	"type": "custom",
         	"tokenizer": "keyword",
         	"filter": "lowercase"
   	}
 	}
   },
   "mappings": {
   	"properties": {
     	"@timestamp": {
           	"type": "date"
     	},
     	"price": {
           	"type": "float" 	 
     	},
     	"item_name": {
           	"type": "string",
           	"analyzer": "analyzer-name"
     	}
 	}
   }
 }  
}

It is important to note that Documents cannot be added to a data stream using PUT.

After submitting the above code, the following confirmation will be returned.

				
					{
 "acknowledged": true
}

Adding to an Index in OpenSearch

Now that the purchases Index is created, we can add a Document. In this case the Document is a single transaction. To add a Document to an Index we use the POST API.

We also want to note that a Document can be created without an Index being created beforehand. If a Document is being created without the associated Index, OpenSearch will create an Index with the default settings (not usually what we want).

				
					POST /purchases/_doc/
{
"@timestamp": "2023-1-09-T14:59:00",
"item_name" : "New Balance 574",
"price" : 119.99
}

A response will then be provided that assigns a unique “_id” for the Document. This identifier can be useful for querying a Document. Further down in this article we will show how the identifier can be used to select a Document for deletion.

If you want to specify the “_id” rather than be provided a random identifier, then instead of

				
					POST /my_index/_doc/

include the “_id” like this:

				
					POST /my_index/_doc/<_id>

While you can specify your own “_id”, this is not recommended because OpenSearch works fastest with its own Document ID generation. If you want to use an “ID”, then we recommend using your own unique field, such as “purchase_id”.

As a general rule, OpenSearch users should not add any fields or field names that start with an underscore. For example, we wouldn’t want to add the field name “id” when there is an “_id” field. We also wouldn’t want to add a “type” field because the “_type” field is default.

Retrieving Data From an Index in OpenSearch

To retrieve data from an Index we will use the GET API.

The code below is a repeat of the example from above where we are querying all transactions that include the purchase of “New Balance 574” cleats within the “purchases” Index.

				
					GET /purchases/_search
{
  "query": {
	"match": {
  	"item": "New Balance 574"
	}
  }
}

Deleting Data From an Index in OpenSearch

Documents can be deleted by either using the POST or DELETE APIs. We will go through a few examples below.

Delete a Document by Query Using Match

Here we are deleting all of the Documents for which the “item” field value is “New Balance 574”. This deletion would include the entry for the pair of cleats that we added to our Index in the section above.

				
					POST /purchases/_delete_by_query
{
  "query": {
	"match": {
  	"item": "New Balance 574"
	}
  }
}

You’ll notice that this is similar to the query code above with the two differences being an API swap from “GET” to “POST”, and the action being changed from “_search” to “_delete_by_query”.

Delete all Documents in an Index Using Match_all

If instead of using “match” we used “match_all” this would delete all Documents from the Index.

				
					POST /purchases/_delete_by_query
{
  "query": {
	"match_all": {	}
  }
}

Delete Documents with Range

In the following example we are deleting all Documents with a price greater than or equal to (“gte”) $100.00.

				
					POST /purchases/_delete_by_query
{
  "query": {
	"range": {
  	"price": {
    	"gte": 100.00
  	}
	}
  }
}

DELETE API for Deleting a Single Document in OpenSearch

The DELETE API is less cumbersome but does require the unique “_id” mentioned above.

Specifying the unique “_id” is the only way to ensure you are deleting a single Document in OpenSearch. A unique “_id” can also be specified using the Delete by query approach above. In that case, the field name would be “_id”.

				
					DELETE /my_index/_doc/<_id>

Deleting an Index in OpenSearch

To delete an entire Index in Opensearch using the DELETE API as follows:

				
					DELETE /my_index

Reindex in OpenSearch

Reindexing is used to copy Documents from a source into a destination. To reindex in OpenSearch use the POST API as follows:

				
					POST _reindex
{
  "source": {
    "index": "my-index"
  },
  "dest": {
    "index": "my-new-index"
  }
}

24x7 OpenSearch Support & Consulting

Visit our OpenSearch page for more details on our support services.

How to Index OpenSearch