Zookeeper | Service Unavailable

2014-11-11 12:09:36,101 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:382) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:241) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:228) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:365) at java.net.Socket.connect(Socket.java:527) at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)

2014-11-11 12:09:36,102 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)

The last few weeks with Zookeeper/Curator have been a good experience. I am going to maintain a continuous list of errors that come up with Zookeeper and how I fixed/stepped over them

Running out of connections

WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@352] - Too many connections from /ab.cd.ef.ghi - max is 60

This is indicative of the client running out of connections. In your zookeeper.cfg, set the following

maxClientCnxns = 500

Unable to load database – disk corruption

FATAL Unable to load database on disk !  java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for  at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!

This typically implies either disk corruption on your server or the process was restarted while snapshotting. There are some bugs filed with Zookeeper in related area. The easiest way is to wipe out the version-2/ directory if other nodes in your cluster are running. The node with the error would rebuild itself from the other nodes.

Unable to load database – Unreasonable length

FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)

Some versions of zookeeper allowed the client to set the data larger than the max readable size by the server. Increasing the max buffer size JVM property fixes the issue.

-Djute.maxbuffer = xxx

Failure to follow the leader

WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out

This is observed under stress on the system. The stress might be caused by either disk contention or network delays etc. If you cannot reduce the load on the system, try increasing your hardware spec. On EC2, I switched over High I/O instances and the response was much better.

Service Unavailable

Lessons learnt building web services

Category Archives: Zookeeper

Zookeeper Leader election and timeouts

Zookeeper Error Guide : Part 1

Running out of connections

Unable to load database – disk corruption

Unable to load database – Unreasonable length

Failure to follow the leader