How to Index Elasticsearch

How to Index Elasticsearch

Updated October 2020

An Index in Elasticsearch is used to both organize and distribute data within a cluster.  In this post we will define both components of an Index and then outline how to create, add to, delete, and reindex Indicies in Elasticsearch.  We will also touch on querying, but querying will be covered in more depth in subsequent posts.

Table of Contents: Click to jump to section

Before we get started though, let’s pause and look at how Elastic* breaks up the definition of an Index into its two parts.  Firstly, “An index is like a ‘database’ in a relational database.  It has a mapping which defines multiple types.”  This first definition is describing an Index as a mechanism to organize data in a user prescribed way.  Secondly, Elastic says, “An index is a logical namespace which maps to one more primary shards and can have zero or more replica shards.”  In this second definition, the Index is being described as a way to allocate data within an Elasticsearch cluster.

*A note on the above linked Elastic article.  That article is several years old and talks about Types in a way we don’t here.  Over the last few years Elastic has decreased its reliance on Types.  Currently, the most common Type is “_doc”, and we won’t be getting more into them in this post.  Despite the changing role of Types, we liked the old definition of Index, included in the article, which still holds true today.

A quick note on capitalization.  In this post we’ve chosen to capitalize Index and Document throughout.  We felt it was helpful to readers to see these terms capitalized to distinguish them from their every day meanings.

Let’s first dive into how Indices are used to organize data.

Using Indices to Organize Data in Elasticsearch

When we are discussing how Indices are used to organize data in Elasticsearch we can think of them as being akin to a relational database like MySQL.  Similarly, Elasticsearch stores Documents (rows) that each have Properties (columns).  Each Property can also be thought of as a key in the key-value pair terminology. Properties fall into one of several different categories that each have their own individual capabilities.  For instance, you can perform math on Properties that are floats but not on a string or date Property.

Let’s run through a short example. A retail company would likely have a purchases Index.  Then within the Index there is a Document (row) for each individual purchase.  And then within each purchase Document there are Properties (columns) such as date/time of purchase, item purchased, cost, etc.

We have more posts in the works that will cover querying in depth, but for now let’s quickly look at how we would query this purchases Index.  In the Kibana Dev Console you can use the following code:

GET /purchases/_search

{

  “query”: {

    “match”: {

      “item”: “Viper Edge Pro”

    }

  }

}

The top line of the query includes “GET” which is the API used to retrieve a Document.  Then the Index is specified, in this case “purchases”. Finally, the operation is included.  Here the operation is “_search” because we are searching for all the entries (Documents) for when a particular pair of cleats, Viper Edge Pro, were purchased.

If instead of “match” we used “match_all” that would return all of the Documents in an Index.  Since Indices typically hold thousands of documents, using “match_all” isn’t a common search.

Added Performance and Flexibility With Indices

While it is a helpful exercise to show how Indices and databases are similar, that is not the end of the story.  Indices have added performance compared to databases.  For instance, because Indices are lightweight, Elasticsearch makes it easy to create an individual Index for each daily log or for each user.

Individual Indices can aid in performance and improve data organization within a cluster.  Separate Indices can be used for products, logs, IoT devices, transactions, or any other group of items or events that is of interest to your use case.  However, you likely wouldn’t want to have an Index for a singular item or event because of sharding.  We go over why in the next section.

Distributing Data With Indices in Elasticsearch

Let’s now circle back to the second half of the definition of an Index as defined by the Elastic team, “An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.”


Shards are sub-elements of Indices and allow the information to be distributed over multiple nodes in a cluster.


Replicas are used to enhance search performance in addition to providing data backup.  Because replicas help with scaling query processing, replicas can be added/removed as needed at any time.


For the example above where we created a purchases Index, then we would also simultaneously be creating a set of primary and replica shards for that Index.  As a default, Elasticsearch creates one (1) primary shard and one (1) replica for each Index.


As a general guideline, set up each Index to be a minimum of at least 50 GB.  This recommended lower limit is because each Index is at least one shard.  Since each individual shard requires resources from Elasticsearch, more shards require more work for Elasticsearch to maintain where data is located, thus detracting from Elasticsearch’s ability to run queries. If you are interested in more information on shards and shard optimization check out our article on Elasticsearch Shards.

How to Create and Delete an Index in Elasticsearch

Now that we have a basic understanding of Indices, let’s get into the ins and out of using them.

Firstly, to retrieve a full list of all Indices in your Elasticsearch cluster use the following code in the Kibana Dev Console:

GET /_cat/indices

Creating an Index in Elasticsearch

The PUT method is used to create an Index.  Below is our purchase example again.  In this sample code we are creating the purchases Index and specifying its properties.  The Properties included here are “@timestamp”, “price”, and “item_name”.  The type of Property is also specified as date, float, and string, respectively.

PUT  /purchases

{

“settings”: {

   “index”: {

         “number_of_shards”: 1,

         “number_of_replicas”: 1

   },

   “analysis”: {

     “analyzer”: {

       “analyzer-name”: {

             “type”: “custom”,

             “tokenizer”: “keyword”,

             “filter”: “lowercase”

       }

     }

   },

   “mappings”: {

       “properties”: {

         “@timestamp”: {

               “type”: “date”

         },

         “price”: {

               “type”: “float”      

         },

         “item_name”: {

               “type”: “string”,

               “analyzer”: “analyzer-name”

         }

     }

   }

 }  

}

It is important to note that Documents cannot be added to a data stream using PUT.  See this post for instructions on how to add a Document to a data stream.

After submitting the above code, the following confirmation will be returned.

{

 “acknowledged”: true

}

Adding to an Index in Elasticsearch

Now that the purchases Index is created, we can add a Document.  In this case the Document is a single transaction.  To add a Document to an Index we use the POST API.

We also want to note that a Document can be created without an Index being created beforehand. If a Document is being created without the associated Index, Elasticsearch will create an Index with the default settings (not usually what we want).

POST /purchases/_doc/

{

“@timestamp”: “2020-10-09-T14:59:00”,

“item_name” : “Viper Edge Pro”,

“price” : 119.99

}

A response will then be provided that assigns a unique “_id” for the Document.  This identifier can be useful for querying a Document, which will be covered in a subsequent post.  Further down in this article we will show how the identifier can be used to select a Document for deletion.

If you want to specify the “_id” rather than be provided a random identifier, then instead of
POST /my_index/_doc/

include the “_id” like this:

POST /my_index/_doc/<_id>

While you can specify your own “_id”, this is not recommended because Elasticsearch works fastest with its own Document ID generation.  If you want to use an “ID”, then we recommend using your own unique field, such as “purchase_id”.

As a general rule, Elasticsearch users should not add any fields or field names that start with an underscore.  For example, we wouldn’t want to add the field name “id” when there is an “_id” field.  We also wouldn’t want to add a “type” field because the “_type” field is default.

Retrieving Data From an Index in Elasticsearch

To retrieve data from an Index we will use the GET API.

The code below is a repeat of the example from above where we are querying all transactions that include the purchase of “Viper Edge Pro” cleats within the “purchases” Index.

GET /purchases/_search

{

  “query”: {

    “match”: {

      “item”: “Viper Edge Pro”

    }

  }

}

Deleting Data From an Index in Elasticsearch

Documents can be deleted by either using the POST or DELETE APIs.  We will go through a few examples below. 

Delete a Document by Query Using Match

Here we are deleting all of the Documents for which the “item” field value is “Viper Edge Pro”. This deletion would include the entry for the pair of cleats that we added to our Index in the section above.

POST /purchases/_delete_by_query

{

  “query”: {

    “match”: {

      “item”: “Viper Edge Pro”

    }

  }

}

You’ll notice that this is similar to the query code above with the two differences being an API swap from “GET” to “POST”, and the action being changed from “_search” to “_delete_by_query”.

Delete all Documents in an Index Using Match_all

If instead of using “match” we used “match_all” this would delete all Documents from the Index.

POST /purchases/_delete_by_query

{

  “query”: {

    “match_all”: {    }

  }

}

Delete Documents with Range

In the following example we are deleting all Documents with a price greater than or equal to (“gte”) $100.00. 

POST /purchases/_delete_by_query

{

  “query”: {

    “range”: {

      “price”: {

        “gte”: 100.00

      }

    }

  }

}

DELETE API for Deleting a Single Document in Elasticsearch

The DELETE API is less cumbersome but does require the unique “_id” mentioned above.

Specifying the unique “_id” is the only way to ensure you are deleting a single Document in Elasticsearch.  A unique “_id” can also be specified using the Delete by query approach above.  In that case, the field name would be “_id”.

DELETE /my_index/_doc/<_id>

Deleting an Index in Elasticsearch

To delete an entire Index in Elasticsearch using the DELETE API as follows:

DELETE /my_index

Reindex in Elasticsearch

Reindexing is used to copy Documents from a source into a destination.  To reindex in Elasticsearch use the POST API as follows:

POST _reindex

{

  “source”: {

    “index”: “my-index”

  },

  “dest”: {

    “index”: “my-new-index”

  }

}

Have more questions about Elasticsearch Indices?

Stay tuned for our follow-on posts about querying in Elasticsearch.  And as always if you have questions about what you read or could use help with optimizing your Elasticsearch implementation drop us a line below.  Our Elastic Certified Engineers are here to help.

Elasticsearch Support with Elastic Certified Engineers

Dattell’s Elastic Certified Engineers work one-on-one with companies to design, implement, manage, and improve their Elasticsearch deployments.  Pricing for Elasticsearch support services starts at $2,400.

Leave a Reply