Troubleshooting on Zeppelin with keberized cluster

We've updated Zeppelin from 0.7.0 to 0.7.1, still work with kerberized hadoop cluster, we use some interpreters in zeppelin, not all. And I wanna write some troubleshooting records with this awesome webtool. BTW: I can write a webtool better than this 1000 times, such as phpHiveAdmin, basically I can see the map/reduce prograss bar

Use kerberized Hive in Zeppelin

We deployed Apache Zeppelin 0.7.0 for the Kerberos secured Hadoop cluster, and my dear colleague cannot use it correctly, so I have to find out why he can’t use anything in Zeppelin, except shell command.

I start with Kerberized Hive

Troubleshooting kerberized hive issues

Today, my colleagues want to use hive in zeppelin, it’s the first time to use hive in this new kerberized cluster, and unfortunately there was an authenticate issue of using hive. So I have to debug on it.

The hive client was installed hadoop-client and hive and put all the needed keytabs in config dirs and set the right permission of their all, but still could not connect to the cluster. The log always shows authentication failed.

Enable Kerberos secured Hadoop cluster with Cloudera Manager

I created a secured Hadoop cluster for P&G with cloudera manager, and this document is to record how to enable kerberos secured cluster with cloudera manager. Firstly we should have a cluster that contains kerberos KDC and kerberos clients

Dr.Elephant mysql connection error

Because of multi threaded drelephant will cause JobHistoryServer's Loads very high, so I stopped it for a strench of time. Until last week, a period pull from JHS patch merge request from github was released. I re-compiled dr. elephant and deploy the new dr. elephant on the cluster. It seems stable, but on this Monday morning, my leader told me that there were no more counters and any information about cluster jobs in dr. elephant.  So I logged in to the server, and check log, then I found this message below.



Influx是用Go语言写的,专为时间序列数据持久化所开发的,由于使用Go语言,所以各平台基本都支持。类似的时间序列数据库还有OpenTSDB,Prometheus等。


公司基础架构这边想提取慢作业和获悉资源浪费的情况,所以装个dr elephant看看。LinkIn开源的系统,可以对基于yarn的mr和spark作业进行性能分析和调优建议。

DRE大部分基于java开发,spark监控部分使用scala开发,使用play堆栈式框架。这是一个类似Python里面Django的框架,基于java?scala?没太细了解,直接下来就能用,需要java1.8以上。

Apache Bigtop与卖书求生

快一年没写博客了,终于回来了,最近因公司业务需要,要基于cdh发行版打包自定义patch的rpm,于是又搞起了bigtop,就是那个hadoop编译打包rpm和deb的工具,由于国内基本没有相关的资料和文档,所以觉得有必要把阅读bigtop源码和修改的思路分享一下。



首先,对方已经做好了Hive访问HBase,所以spark-sql原则上可以通过调用Hive的元数据来访问Hbase。但是执行极慢,而且日志无报错。中间都是邮件沟通,先问了几个问题,是否启用了Kerberos,是否Hive访问Hbase正常,HBase shell访问数据是否正常等等,回答说没有用Kerberos,Hive访问Hbase正常,spark-sql读取Hive元数据也正常,Hbase shell也正常,就是spark-sql跑不了。



  1. Active的namenode元数据硬盘满了,满了,满了…上来第一句话就如雷贯耳。
  2. 运维人员发现硬盘满了以后执行了对active namenode的元数据日志执行了 echo “” > edit_xxxx-xxxx…第二句话如五雷轰顶。
  3. 然后发现standby没法切换,切换也没用,因为standby的元数据和日志是5月份的…这个结果让人无法直视。

Continue reading Hadoop运维记录系列(十六)

