Asquera GmbH Blog

what we blog

ElasticSearch pre-flight checklist

I attended the ElasticSearch base training this week and took away lots of very interesting details. I can definitely recommend taking them, the training made a lot of things clearer to me. One of my most immediate benefits was that it allowed me to compile a little "pre-flight" checklist of things that should never be skipped when bringing an ElasticSearch cluster to production. To ensure that I don't forget it and you don't start into your ElasticSearch life with huge amounts of pain, I'll document it here. Thanks to bascht for proofreading and fixes!

Basic Configuration

It is essential that you change a few configuration parameters before deploying your ElasticSearch setup. The most important are:

  • path.data: This initially points to $ES_HOME/data. Thats very convenient for development, but very brittle in production. It means that you either have to carefully overwrite your old $ES_HOME on updates or take great care that you migrate the data directory between your old and new installations. Suggestion: /var/data/elasticsearch.
  • path.logs: While we're at it, also change your log path for similar reasons. While log loss is not exactly data loss, its nothing you want to bother with. Suggestion: /var/log/elasticsearch.
  • cluster.name: Do not call your cluster elasticsearch. This means that any node in default configuration will happily join your cluster and start pulling in shards if it happens to discover one of your nodes by accident. Use a telling name.
  • node.name: Change all the node names! Marvel heroes are awesome, but which one of your nodes was "Thor" again? Which rack was it in? Use a telling name as well. For example: rack1.vm1.node1.

All those options can be configured using an environment variable. Especially node names shouldn't be configured in the config file, as it means that the config file diverges on every machine. Consider configuring the cluster name through the environment as well.

File descriptors

ElasticSearch uses quite a few of file descriptors, both for Lucene indexes and Netty. Raise the number of available file descriptors to the user running ElasticSearch, for example by adding the following to /etc/security/limits.d/10-elasticsearch.conf on RedHat Linux:

elasticsearch                soft    nofile          65535
elasticsearch                hard    nofile          65535

Failing to do so leads to weird errors. For example, ElasticSearch might throw FileNotFound exceptions although the file is there. Be aware that the default limit of 1024 is just enough to allow ElasticSearch not to exhibit strange behaviour until you apply moderate pressure.

Heap size and memory

ElasticSearch allows you to easily set the java heap size by setting the $ES_HEAP_SIZE environment variable. The rule of thumb is that the ElasticSearch heap should have around 50% of the available memory on the machine. ElasticSearch uses system caches heavily, so you should leave enough memory for them.

Discovery

If you are not running on a platform that has special discovery options (e.g. EC2), you will end up using the built-in Zen discovery mechanism. It comes in two flavours: multicast and unicast. Multicast is convenient, but only works in settings where it is actually allowed. Also, it potentially leads to a lot of chatter with no use when you have a lot of nodes in the same network and the potential for accidental joins is present. Only switch it on if you really know what you're doing!

  • discovery.zen.ping.multicast.enabled: false

And then switch on unicast by configuring a list of unicast hosts. Be aware that forgetting to switch off multicast means that both are enabled and you might make things worse!

  • discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3[portX-portY]"]

Prefer host names over IP addresses here, as it allows you to replace those nodes later. This is the list of nodes that handle discovery: you don't have to include your whole cluster here or redeploy the configuration for every added node. Just make sure that at least one of those nodes is reachable when a new node starts up.

Avoiding split-brain

Split-brain is a catastrophic event for ElasticSearch clusters. Once a cluster is split into two and and each side has elected a new master, they will diverge and cannot be rejoined without killing one half of the cluster. Also, as catastrophes go, you will not run into this at your first day of operations. Luckily, this scenario can and should be avoided.

  • discovery.zen.minimum_master_nodes: set this to at least N/2+1 on clusters with N > 2, where N is the number of master nodes in the cluster.

minimum_master_nodes ensures that a node has to see that number of master eligible nodes (nodes that could potentially take over as masters) to be operational. Otherwise, the node goes into an error state and refuses to accept requests, as it considers itself to be split off from the cluster.

In the simple example where all nodes are potential masters, the suggested setting means that a node considers itself cut off if it doesn't reach more than half of the cluster.

In a two-node-environment, leave the setting as is, as every node can act as a backup for every other node, so the node only has to see itself. Otherwise, the cluster would become inoperable if one node fails.

This setting can (and should) be set at runtime, to avoid the number dropping below the recommended threshold.

Avoiding shard shuffle on recovery

Another problem is constant rebalancing when a cluster comes up. As ElasticSearch does not know how many nodes will finally be in the cluster, it tries to balance all known primary shards and replicas to the known machines. When another node suddenly joins the cluster, this whole strategy will change. To avoid the traffic and the overhead of this, ElasticSearch can be configured with a minimum number of nodes it should expect before starting the recovery phase. There are 3 settings to consider:

  • gateway.recover_after_nodes: Recover only after the given number of nodes have joined the cluster. Can be seen as "minimum number of nodes to attempt recovery at all".
  • gateway.recover_after_time: Time to wait for additional nodes after recover_after_nodes is met.
  • gateway.expected_nodes: Inform ElasticSearch how many nodes form a full cluster. If this number is met, start up immediately.

While these options are basically just hints for ElasticSearch, how its environment looks, but will make you cluster run smoother and avoid unpredictability during startup time.

Tagging nodes for uptime and profit

For every piece of the outside world that you explain to ElasticSearch, there are immediate gains in reliability. ElasticSearch does a good job at using its available resources in a good way, but cannot cover you in all cases. But consider the following setup: You have two physical machines A and B and 4 virtual machines on them (A1,A2,B1 and B2). Each of them runs one ElasticSearch node. ElasticSearch decides to put one primary shard A1 and its replica on A2. This is completely fine from the perspective of ElasticSearch: it doesn't know about the physical environment. Now, imagine machine A failing or restarting: this wipes the whole shard from the cluster, as it ran both virtual machines that hosted the shards. Luckily, with a bit of explaining, this can be avoided. First of all, assign tags to nodes. Tags can have any name, but are attached to the node configuration:

node.machine: machine1

And tell ElasticSearch that it should be aware of this attribute:

cluster.routing.allocation.awareness.attributes: machine

This tells ElasticSearch to never put 2 shards on nodes with the same value for the tag machine.

The same goes for situations where you have multiple physical environment parameters affecting your cluster - like having 2 racks that could lose the connection to each other. Just expand the set of tags:

node.machine: machine1
node.rack: rack1
cluster.routing.allocation.awareness.attributes: machine,rack

Consider setting those variables using environment variables for easier config management:

node.rack: ${RACK_LOCATION}

Linux only: mlockall and OOM-Killer

Sharp drops in performance are usually associated with swapping. You can configure ElasticSearch to instruct the operating system to never swap.

  • bootstrap.mlockall: Set to true to instruct the operating system to never swap the ElasticSearch process

This configuration option might need some legwork, like giving the running user the right to actually lock memory. Have a look at the log, ElasticSearch will log an error if it cannot lock memory although being instructed to do so.

Assuming that the only thing your machine does is running ElasticSearch and you are paranoid like me, also consider tuning the OOM-Killer (Out Of Memory-Killer) to never kill ElasticSearch, even under the worst memory pressure. See this article about details.

Setup monitoring and things

I won't give any recommendations here. I assume that you have some kind of monitoring service like nagios running and some kind of logging policy. Make sure that ElasticSearch plays with them. Use one of the many Dashboards to make things visible. ElasticSearch has an incredibly detailed stats API that allows you to monitor all the details that I just described. Also, most of the settings can be changed at runtime, changing cluster behaviour without restarts.

Discuss this post on Hacker News