June | 2014 | Service Unavailable

The last few weeks with Zookeeper/Curator have been a good experience. I am going to maintain a continuous list of errors that come up with Zookeeper and how I fixed/stepped over them

Running out of connections

WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@352] - Too many connections from /ab.cd.ef.ghi - max is 60

This is indicative of the client running out of connections. In your zookeeper.cfg, set the following

maxClientCnxns = 500

Unable to load database – disk corruption

FATAL Unable to load database on disk !  java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for  at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!

This typically implies either disk corruption on your server or the process was restarted while snapshotting. There are some bugs filed with Zookeeper in related area. The easiest way is to wipe out the version-2/ directory if other nodes in your cluster are running. The node with the error would rebuild itself from the other nodes.

Unable to load database – Unreasonable length

FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)

Some versions of zookeeper allowed the client to set the data larger than the max readable size by the server. Increasing the max buffer size JVM property fixes the issue.

-Djute.maxbuffer = xxx

Failure to follow the leader

WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out

This is observed under stress on the system. The stress might be caused by either disk contention or network delays etc. If you cannot reduce the load on the system, try increasing your hardware spec. On EC2, I switched over High I/O instances and the response was much better.

private final ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(1); // single thread for logging //in your method - initialize the logger executor.scheduleWithFixedDelay(gcStatLogger, 60000, 60000, TimeUnit.MILLISECONDS ); // log every minute private class GCStatLogger implements Runnable { @Override public void run() { logGCStats(); } private void logGCStats() { long gcCount = 0; long gcTime = 0; for(GarbageCollectorMXBean gc :ManagementFactory.getGarbageCollectorMXBeans()) { long count = gc.getCollectionCount(); if(count >=0){ gcCount += count; } long time = gc.getCollectionTime(); if(time >=0) { gcTime += time; } } log.debug("Total Garbage Collections: " + gcCount ); log.debug("Total Garbage Collection Time (ms): "+ gcTime); } }

Service Unavailable

Lessons learnt building web services

Monthly Archives: June 2014

Zookeeper Error Guide : Part 1

Running out of connections

Unable to load database – disk corruption

Unable to load database – Unreasonable length

Failure to follow the leader

Java Garbage Collection Statistics