ZooKeeper in Production - Hadoop

In production, you should run ZooKeeper in replicated mode. Here we will cover some of the considerations for running an ensemble of ZooKeeper servers. However, this section is not exhaustive, so you should consult the ZooKeeper Administrator’s Guide for detailed up-to-date instructions, including supported platforms, recommended hardware, maintenance procedures, and configuration properties.

Resilience and Performance

ZooKeeper machines should be located to minimize the impact of machine and network failure. In practice, this means that servers should be spread across racks, power supplies, and switches, so that the failure of any one of these does not cause the ensemble to lose a majority of its servers. ZooKeeper replies on having low-latency connections between all of the servers in the ensemble, so for that reason an ensemble should beconfined to a single data center.

ZooKeeper has the concept of an observer node, which is like a nonvoting follower. Since they do not participate in the vote for consensus during write requests, observers allow a ZooKeeper cluster to improveread performance without hurting write performance. In addition, a ZooKeeper cluster can span data centers by placing the voting members in one data center and observers in the other.

ZooKeeper is a highly available system, and it is critical that it can perform its functions in a timely manner. Therefore, ZooKeeper should run on machines that are dedicated to ZooKeeper alone. Having other applications contend for resources can cause Zoo- Keeper’s performance to degrade significantly.

Configure ZooKeeper to keep its transaction log on a different disk drive from its snapshots. By default, both go in the directory specified by the dataDir property, but by specifying a location for dataLogDir, the transaction log will be written there. By having its own dedicated device (not just a partition), a ZooKeeper server can maximize the rate at which it writes log entries to disk, which is does sequentially, without seeking.Since all writes go through the leader, write throughput does not scale by adding servers, so it is crucial that writes are as fast as possible.

If the process swaps to disk, performance will suffer adversely. This can be avoided by setting the Java heap size to less than the amount of physical memory available on the machine. The ZooKeeper scripts will source a file called java.env from its configuration directory, and this can be used to set the JVMFLAGS environment variable to set the heap size (and any other desired JVM arguments).


Each server in the ensemble of ZooKeeper servers has a numeric identifier that is unique within the ensemble, and must fall between 1 and 255. The server number is specified in plain text in a file named myid in the directory specified by the dataDir property. Setting each server number is only half of the job. We also need to give all the servers all the identities and network locations of the others in the ensemble. The ZooKeeperconfiguration file must include a line for each server, of the form:

The value of n is replaced by the server number. There are two port settings: the first is the port that followers use to connect to the leader, and the second is used for leader election. Here is a sample configuration for a three-machine replicated ZooKeeper ensemble:

Servers listen on three ports: 2181 for client connections; 2888 for follower connections, if they are the leader; and 3888 for other server connections during the leader election phase. When a ZooKeeper server starts up, it reads the myid file to determine which server it is, then reads the configuration file to determine the ports it should listen on, as well as the network addresses of the other servers in the ensemble.

Clients connecting to this ZooKeeper ensemble should use zookeeper1:2181,zoo keeper2:2181,zookeeper3:2181 as the host string in the constructor for the ZooKeeper object.

In replicated mode, there are two extra mandatory properties: initLimit and syncLimit, both measured in multiples of tickTime. initLimit is the amount of time to allow for followers to connect to and sync with theleader. If a majority of followers fail to sync within this period, then the leader renounces its leadership status and another leader election takes place. If this happens often (and you can discover if this is the case because it is logged), it is a sign that the setting is too low.

syncLimit is the amount of time to allow a follower to sync with the leader. If a follower fails to sync within this period, it will restart itself. Clients that were attached to this follower will connect to another one.These are the minimum settings needed to get up and running with a cluster of Zoo- Keeper servers. There are, however, more configuration options, particularly for tuning performance, documented in the ZooKeeper Administrator’s Guide.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics