hadoop namenode 都是standby

张映 发表于 2019-03-13

分类目录: hadoop/spark/scala

标签:, , ,

跑spark-submit报错,查看了一下ha的状态,二台namenode节点都是standby,其中一台机器的JournalNode,挂掉了。

1,排查错误

jps查看,缺少了JournalNode,DFSZKFailoverController进程

查看journalnode日志,如下

2019-03-12 09:35:27,306 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2767)
at org.apache.hadoop.ipc.Server.access$2200(Server.java:139)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1121)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1193)
at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2134)
at org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1261)
at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:644)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2268)
2019-03-12 16:42:25,616 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: RECEIVED SIGNAL 15: SIGTERM
2019-03-12 16:42:25,619 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down JournalNode at testing/10.0.0.149
************************************************************/

查看zkfc日志,如下:

java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/bigdata1/ActiveStandbyElectorLock
at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)

2,解决办法

2.1,启动journalnode,zkfc

# sbin/hadoop-daemon.sh start journalnode
# sbin/hadoop-daemon.sh start zkfc

重启后就可以解决,如果解决不了,往下看

2.2,自选举active

# hdfs haadmin -failover -forceactive nn1 nn2

如果这个方法不行,就手动强制指定一个active

2.3,强制指定一个active

# hdfs haadmin -transitionToActive --forcemanual nn1

不要忘了加forcemanual,手动强制切换后,ZKFC将停止工作,你将不会再有自动故障切换的保障,别忘了,重新启动ZKFC就好了。

# hdfs haadmin -transitionToActive nn1
Automatic failover is enabled for NameNode at testing/10.0.0.149:9000
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the --forcemanual flag.

如果这个方法还不行,就重启整个hadoop集群



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/hadoop/2097.html