Elasticsearch is awesome. Based on the blazing fast Lucene engine, Elasticsearch not only provides smart, full-text search, but it also adds killer features like aggregations, which allow for real-time analytics of huge data sets. It’s used in both frontend and backend business applications, universities, NASA, medical research projects and so much more. The list of applications is growing by the day.
One of Elasticsearch’s biggest strengths is the low barrier to entry: it’s open source and freely available, and the learning curve is so easy that anyone can dive in and get a single node indexing documents in just a few minutes. However, this also happens to be one of Elasticsearch’s biggest problems. Users can dive in and start using and developing with Elasticsearch without really understanding what they’re doing (think of it like another Eternal September).
This leads to some pretty common problems. For some users with relatively basic needs, the solutions are simple. But for projects with a larger and more complex scope, the available solutions can be untenable. The purpose of this post is to lay out some of those common pitfalls and address how to avoid them.
1. Avoid oversharding
The basic premise of sharding is that requests to an index can be processed simultaneously by different CPUs, which is generally faster than having one CPU process an entire index. In other words, it’s faster for 5 CPUs to each search over 200,000 documents and collate the results, compared to a single CPU processing 1,000,000 documents. New users often take this to mean that more shards = more speed, which is not actually the case.
The Elasticsearch documentation has a good article on why users should avoid oversharding, but it boils down to a few reasons:
- Shards consume system resources; you’ll need an insane number of nodes to keep the cluster from breaking. In other words, it will cost a lot of money without really giving any benefits.
- Elasticsearch calculates term statistics at the shard level. If you don’t have enough documents in each shard, then those stats will be unreliable, and you’ll end up with lots of irrelevant results.
- When Elasticsearch receives a search request, the gateway node forwards it to all of the other nodes in the cluster, then handles the work of collating and sorting the results. More shards means more work for that gateway node. And, if the shards are spread across only a few nodes, then they end up competing for resources and slowing everything down.
2. Aliases are your friend
Elasticsearch offers the ability to alias indices. This is a really cool feature designed to let you switch between indices atomically for zero downtime.
Imagine that you have some index, products
, that has a particular mapping. You decide that you don’t really like the analyzers that are being used, or you want to add a new field. Mapping changes generally require a reindex to completely take effect (for example, adding a field will only apply to documents indexed after the new field was created, not before; searches relying on the new field will therefore have partial results). Reindexing will mean purging the existing data and recreating it from scratch. This may mean anywhere from a couple minutes to many hours of downtime/incomplete search results in production.
Elasticsearch works around this kind of problem with aliases. Essentially, you create a new index (say, products_v2
) with the new mapping, and populate this new index with your data. Once it’s caught up with the original index, you can create an alias called products
and map it to the products_v2
index. This happens atomically, instantaneous as far as your users are concerned.
3. Custom routing is a two-edged sword
By default, Elasticsearch will distribute documents randomly around the cluster. It does this for a few reasons, but the bottom line is that it ensures an even distribution of load across all shards. When a search request comes in, Elasticsearch sends it to all shards in the index. The shards then process the request and return the results to the gateway node, where they’re collated, sorted and returned.
This process is fine when there are only a couple of shards, but what about larger indices? The larger the cluster in terms of nodes and shards, the more overhead this process creates. Elasticsearch overcomes this by allowing users to specify a specific shard when a document is indexed. This means that you can apply business logic within your app to reduce the overhead of querying many shards.
For example, imagine that you have some index called media
with three types: photos
, videos
and music
. Let’s also say for simplicity that the media
index is comprised of 3 primary shards. You could configure your app to index photo documents to shard 1, video documents to shard 2, and music documents to shard 3. Then when a user wants to scope their search to, say, documents of the music type, you can bypass the broadcast and collation steps entirely by simply searching one shard.
The caveat here is that custom routing voids the guarantee of balanced load. Let’s say you have 3 nodes with 100GB of disk capacity and 90GB of documents of the music
type. With the aforementioned routing scheme, one node would be sitting at 90% disk capacity, while the other two nodes are overprovisioned. The default scheme would have ensured that 90GB was spread over 3 nodes instead of just 1, in which case, all nodes would have ample resources.
This leads me to tip 4…
4. Watch your data footprint
I mentioned in tip 1 why users should avoid adding too many shards, but even with a small number of shards, there can be some gotchas. Many users fail to take into account the impact of replication on system resources and disk availability in particular.
Basically, each replica increases the footprint of your data by the size of the corresponding primary shard. For example, if you have an index with one 10GB primary shard, then each replica of that index will require another 10GB of capacity. Many new users don’t take this multiplicative effect into account when capacity planning. “Our cluster has a total of 120GB and we only have 30GB of primary data, so we’re overprovisioned.”
Well, if that primary data has a replication factor of 2, the actual disk usage will be 90GB. Further, this hypothetical user only has an additional 10GB to index before completely exhausting the cluster’s disk space. If disk capacity is reached in production, users will start to see some really crazy errors.
For one, write operations will fail and not be able to gracefully recover, leaving critical data files corrupted. Lucene has some tools for repairing corrupted segments, but it will result in the loss of data. How much data is lost depends entirely on how lucky you are.
5. Deleted documents aren’t deleted
Elasticsearch is based on Lucene, and Lucene doesn’t delete documents like most users expect. Typically when a file is deleted from a computer, the operating system simply unlinks the file from a handle, marks the space as available, and said space gets reclaimed over time. In any case, it gets immediately ceded back to the computer. Not so with Lucene.
Without going into the nitty-gritty, data in Lucene is stored in a series of segment files, which are occasionally merged into larger files. When a document is deleted in Lucene, the segment file is updated to indicate that the document has been deleted, but the original data remains a part of the segment file (and thus continues to occupy unclaimable space on the disk). When that segment file is merged with another one, the deletes are not carried over. So deleted documents will work their way out of the system over time, but it’s not instant.
This is a really common gotcha with new users running Elasticsearch inside a VM. They notice that an index is growing very large and they’re nearly out of disk space. So they delete a ton of documents expecting to free up space, only to find that, nope!, the disk usage remains unchanged even though the deleted documents are not longer accessible.
While Elasticsearch does have internal triggers for merging segment files, it’s not always enough. Fortunately, Elasticsearch offers a workaround via the _optimize API. This API allows you to force some cleanup of deleted documents to free up space. There are a couple important things to consider:
- It’s a really bad idea to run this against all indices at once. Run it against one index at a time, preferably from smallest to largest.
- The
only_expunge_deletes
setting is a conservative optimization. By default it will only merge away a segment it 10% or more of the documents in it are marked as deleted. - The
max_num_segments
allows the user to specify how many segment files should remain after the optimization is complete. The docs say to set this to 1 for full optimization. This is true, but terrible advice. Merges are terribly resource-intense and will require a lot of capacity to complete. - A more conservative strategy for full optimization (especially relevant when resources are constrained) is to merge to half the current number of segment files. So if your index has 20 segment files, optimize to 10. Then repeat for 5. Then 2, then 1.
That’s it for now!
Hope you enjoyed these tips. They were cultivated from a long career working with Elasticsearch and teaching it to harried developers. Down the road I’d like to follow up with some more advanced tips and strategies, so stay tuned!