记录一下在2.x里面不会很常见的报错。只是在测试集群中发生,生产集群大概很少有人会去重启Namenode吧,特别是做了HA的。

场景是在2.x里做好了Namenode HA,以Namespace URI方式访问HDFS时,报错,然后两个Namenode貌似都是standby,然后历史任务服务器无法启动,HBase的Master也无法启动。其实这个故障很简单。

2014-06-05 17:20:09,548 FATAL hs.JobHistoryServer (JobHistoryServer.java:launchJobHistoryServer(158)) - Error starting JobHistoryServer
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Error creating done directory: [hdfs://HAtest:8020/history/done]
    at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.serviceInit(HistoryFileManager.java:503)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.mapreduce.v2.hs.JobHistory.serviceInit(JobHistory.java:88)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
    at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.serviceInit(JobHistoryServer.java:93)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:155)
    at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:165)
Caused by: java.net.ConnectException: Call From slave.hadoop/192.168.118.134 to slave.hadoop:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
    at org.apache.hadoop.ipc.Client.call(Client.java:1351)
    at org.apache.hadoop.ipc.Client.call(Client.java:1300)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
    at org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:124)
    at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116)
    at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1112)
    at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
    at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1112)
    at org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1528)
    at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.mkdir(HistoryFileManager.java:556)
    at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.serviceInit(HistoryFileManager.java:500)
    ... 8 more
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
    at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399)
    at org.apache.hadoop.ipc.Client.call(Client.java:1318)
    ... 28 more

 

用ls命令无法访问HDFS,经检查,是没有启动ZooKeeper导致的,HDFS HA需要用Zookeeper做支持,除了需要JournalNode之外,还需要在两个Namenode上部署ZKFC,也就是Zoo Keeper Failed Controller。Namenode热备的失效恢复由ZKFC完成,如果没启动ZKFC,Namenode和JournalNode可以启动,但无法区分哪个才是主要提供Namespace服务的服务器,应该是产生了Brain split,也就是脑裂,于是整个HDFS都无法访问。

至于HBase Master无法启动,也就很好解释了。

另外,大集群,ZKFC的tickTime需要设置的更大一些,以防出现同步时间过短导致的网络拥塞。

发表评论

*
*

Required fields are marked *