记录一下在2.x里面不会很常见的报错。只是在测试集群中发生,生产集群大概很少有人会去重启Namenode吧,特别是做了HA的。
场景是在2.x里做好了Namenode HA,以Namespace URI方式访问HDFS时,报错,然后两个Namenode貌似都是standby,然后历史任务服务器无法启动,HBase的Master也无法启动。其实这个故障很简单。
2014-06-05 17:20:09,548 FATAL hs.JobHistoryServer (JobHistoryServer.java:launchJobHistoryServer(158)) - Error starting JobHistoryServer org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Error creating done directory: [hdfs://HAtest:8020/history/done] at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.serviceInit(HistoryFileManager.java:503) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapreduce.v2.hs.JobHistory.serviceInit(JobHistory.java:88) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.serviceInit(JobHistoryServer.java:93) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:155) at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:165) Caused by: java.net.ConnectException: Call From slave.hadoop/192.168.118.134 to slave.hadoop:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) at org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:124) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1112) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1112) at org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1528) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.mkdir(HistoryFileManager.java:556) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.serviceInit(HistoryFileManager.java:500) ... 8 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399) at org.apache.hadoop.ipc.Client.call(Client.java:1318) ... 28 more
用ls命令无法访问HDFS,经检查,是没有启动ZooKeeper导致的,HDFS HA需要用Zookeeper做支持,除了需要JournalNode之外,还需要在两个Namenode上部署ZKFC,也就是Zoo Keeper Failed Controller。Namenode热备的失效恢复由ZKFC完成,如果没启动ZKFC,Namenode和JournalNode可以启动,但无法区分哪个才是主要提供Namespace服务的服务器,应该是产生了Brain split,也就是脑裂,于是整个HDFS都无法访问。
至于HBase Master无法启动,也就很好解释了。
另外,大集群,ZKFC的tickTime需要设置的更大一些,以防出现同步时间过短导致的网络拥塞。