elasticsearch ops

Lessons Learned From A Year Of Elasticsearch In Production

We ❤ Elasticsearch at Scrunch and are constantly finding new use cases for it. After a year of running it in production, we’ve learned a few lessons that we’d like to share with you.

We Elasticsearch at Scrunch and are constantly finding new use cases for it. We heavily utilise aggregations, write complex function score queries and index mass quantities of documents every day. After a year of running it in production, we’ve learned a few lessons that we’d like to share with you.

So without further ado, let's start with...

1. Watch Your Thread Pools

Elasticsearch has a collection of thread pools for handling tasks like searching, indexing, fetching etc. Most of the thread pools have a fixed size based on the number of processors (specifics available here) and each of these thread pools includes a queue for managing pending requests. We have found it paramount to keep an eye on the thread pool queue size. Consistently large queue sizes often translate directly into poor query performance for customers. If you aren’t using something like Marvel, you can keep an eye on these metrics using the Node Stats API.

Here’s an example of a relatively healthy search thread pool, with some slight spikes that may warrant attention:

Once your queued sizes are maxed out, you might start to see these scary looking exceptions in your logs:

EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction

But you shouldn’t let it get to that point. Consider sending out an alert when your queues sit above the threshold for too long and take action: add more capacity or tune your queries. It can be tempting to want to increase your queue sizes when you see these exceptions, however, it'll just result in more queued requests eating memory and frustrated customers (been there, done that).

If you are seeing queued index requests, consider using the Bulk API. If you are already using the Bulk API, try lowering your bulk sizes.

Speaking of adding more capacity, you might just want to...

2. Ensure You Have An Adequate Heap Size But Leave Plenty Of Off-heap Memory Too

You should keep a close eye on your heap utilisation in Elasticsearch. Again, if you aren’t using something like Marvel for monitoring your cluster, you can collect heap stats from the jvm section of the Node Stats API. 

If you haven’t explicitly set your heap size, you'll be using the default of 1G which is sufficient enough for testing on your laptop but not much else. The recommended way to increase the heap size is by setting the ES_HEAP_SIZE environment variable (usually set in the init.d script or in /etc/default/elasticsearch).

Healthy heap patterns usually look like deep sawtooths with consistent but not overzealous garbage collection:

Flat-lined heap usage with a lot of garbage collection operations may indicate that the heap size is too small for your needs. A lot of garbage collection operations also tend to eat CPU and you risk large pause-the-world collections halting your cluster completely. I’m not going to claim to be an a JVM expert so I’ll defer to a few other articles if you want to know more.

You also want to keep an eye on filter evictions. Consistent filter evictions are usually a sign of either insufficient heap sizes or using filter caches poorly (for example, caching extremely specific geo distance filters or date range filters). Upgrading to Elasticsearch 2.x will help a lot here, as filter caching has gotten a lot smarter. 

Also, Lucene (the search-engine from which Elasticsearch is built) utilises off-heap memory for its operations, so you should ensure you have plenty set aside. The oft-cited stat is that you want roughly 50% of available memory for your heap (not exceeding 32G), and 50% for Lucene, which we’ve found to be about right. You also want to ensure that your operating system is doing plenty of disk caching. More info about that here.

Lastly, if you are doing a lot of sorting or aggregating then you can take a lot of pressure off your heap if you... 

3. Use Doc Values Liberally

When Elasticsearch performs aggregations or sorts, it loads a bunch of field data into memory. This can quickly exhaust your heap or result in Circuit Breaker exceptions, which are effectively a defense against all out OutOfMemory exceptions. However, there is a way to have a delicious bit of cake and eat it too: doc values. Doc values let you store field data on disk, taking pressure off the heap. As of 2.x they are now the default, but if you are stuck on an old version, you’ll need to set them manually in your mappings and reindex your data. Here's an example: 

curl -XPOST "http://localhost:9200/test-index-1" -d '{
  "mapping": {
    "properties": {
      "some_field": {
        "type": float,
        "doc_values": True

 Doc values are available for most field types, for string fields they must be non-analyzed.

You can keep an eye on how much memory your field data is taking up using the Indices Stats API. We usually look at our top field data memory consumers and turn them into doc values on a periodic basis. 

Also, be aware that using doc values can be slightly slower than the alternative since reading from disk is slower than memory. In our experience, the heap freed using doc values mitigates this downside. Again, you should ensure that your operating system has plenty of memory free for caching disk segments and that your disks are fast enough. Which brings me to my next point...

4. Use SSDs

If you can afford SSDs, then buy them. Elasticsearch does a lot of reading from disk and fast disks equal fast queries. If they're outside of your budget, you should keep a close eye on your disk IO using something like iostat. If you are seeing high await times, then your disks might be a bottleneck. Really, we've found that SSDs are essential for any reasonable sort of traffic demands. Here you can see the impacts of swapping out a spinning disk for SSD in one of our environments: 

(The blue line on the latter graph is thread pool queue size (see Lesson #1)).

A high load average could also be an indication of a high disk wait time since it's factored into the load calculation in Linux.

Slow disks and large I/O wait times can manifest in a lot of ways but slow queries are a good indication that you may have an issue here, which is why you should...

5. Turn On Slow Query Logging and Regularly Review It

You should make sure that you have your slow search log turned on and that you review it regularly. We have found the settings defined in Elasticsearch's docs are quite reasonable for our use cases, but obviously, these should be set to whatever makes sense for you. It can be quite an eye-opening experience to see how poorly certain queries perform, particularly the notoriously bad ones like Wildcard queries or Phrase queries. If you are seeing exceptionally slow queries, then you may want to look at index-time optimisations available to you. For example, n-grams instead of wildcard queries and “shingles” instead of phrase queries.

It can be useful to graph your query times using the Indices Stats API and correlate any spikes with entries in the slow log.