Page 1 of 3
1 2 3

jupyterlab and pyspark2 integration in 1 minute

As we use CDH 5.14.0 on our hadoop cluster, the highest spark version to be support is 2.1.3, so this blog is to record the procedure of how I install pyspark-2.1.3 and integrate it with jupyter-lab.

Evironment:
spark 2.1.3
CDH 5.14.0 – hive 1.1.0
Anaconda3 – python 3.6.8

  1. Add export to spark-env.sh
    export PYSPARK_PYTHON=/opt/anaconda3/bin/python
    export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/jupyter-lab
    export PYSPARK_DRIVER_PYTHON_OPTS='  --ip=172.16.191.30 --port=8890'
  2. install sparkmagic
    pip install sparkmagic
  3. Use conda or pip command to downgrade ipykernel to 4.9.0, cause ipykernel 5.x doesn’t support sparkmagic, it will throw a Future exception.
    https://github.com/jupyter-incubator/sparkmagic/issues/492
  4. /opt/spark-2.1.3/bin/pyspark –master yarn

If you need to run with backgrand , use nohup.

Another problem, in pyspark, sqlContext cannot access remote hivemetastore and without any exceptions, when i run show databases in pyspark, it always return me default. And then i found out, in spark2’s jars dir, there was a hive-exec-1.1.0-cdh5.14.0.jar, delete this jar file, everythings ok.

How to use cloudera parcels manually

Cloudera Parcel is actually a compressed file format, it just a tgz file with some meta info, so we can simply untar it with command tar zxf xxx.parcel. So we have the capability to  extract multi version of hadoop in a single machine. It’s easy to make hadoop upgrade or  downgrade, only ln -s CDH symbol link to a specific version directory.

With understanding that, I can package a self-distributed parcel package with my patches, and use cloudera-manager to manage the cluster… That sounds good

Integrate pyspark and sklearn with distributed parallel running on YARN

Python is useful for data scientists, especially with pyspark, but it’s a big problem to sysadmins, they will install python 2.7+ and spark and numpy,scipy,sklearn,pandas on each node, well, because Cloudera said that. Wow, imaging this, You have a cluster with 1000+ nodes or even 5000+ nodes, although you are good at DevOPS tools such as puppet, fabric, this work still cost lot of time. Continue reading Integrate pyspark and sklearn with distributed parallel running on YARN

Spark read LZO file error in Zeppelin

Due to our dear stingy Party A  said they will add not any nodes to the cluster, so we must compress the data to reduce disk consumption. Actually  I like LZ4, it’s natively supported by hadoop, and the compress/decompress speed is good enough,  compress ratio is better than LZO. But, I must choose LZO finally, no reason.

Well, since we use Cloudera Manager to  install Hadoop and Spark, so it’s no error when read lzo file in command line, simply use as text file, Ex:

val data = sc.textFile("/user/dmp/miaozhen/ott/MZN_OTT_20170101131042_0000_ott.lzo")
data.take(3)

But in zeppelin, it will told me: native-lzo library not available, WTF?

Well, Zeppelin is a self-run environment, it will read its configuration only, do not read any other configs, Ex: it will not try to read /etc/spark/conf/spark-defaults.conf . So I must wrote all spark config such as you wrote them in spark-deafults.conf.

In our cluster, the Zeppelin conf looks like this:

Troubleshooting on Zeppelin with keberized cluster

We’ve updated Zeppelin from 0.7.0 to 0.7.1, still work with kerberized hadoop cluster, we use some interpreters in zeppelin, not all. And I wanna write some troubleshooting records with this awesome webtool. BTW: I can write a webtool better than this 1000 times, such as phpHiveAdmin, basically I can see the map/reduce prograss bar Continue reading Troubleshooting on Zeppelin with keberized cluster

Troubleshooting kerberized hive issues

Today, my colleagues want to use hive in zeppelin, it’s the first time to use hive in this new kerberized cluster, and unfortunately there was an authenticate issue of using hive. So I have to debug on it.

The hive client was installed hadoop-client and hive and put all the needed keytabs in config dirs and set the right permission of their all, but still could not connect to the cluster. The log always shows authentication failed. Continue reading Troubleshooting kerberized hive issues

Enable Kerberos secured Hadoop cluster with Cloudera Manager

I created a secured Hadoop cluster for P&G with cloudera manager, and this document is to record how to enable kerberos secured cluster with cloudera manager. Firstly we should have a cluster that contains kerberos KDC and kerberos clients Continue reading Enable Kerberos secured Hadoop cluster with Cloudera Manager

Dr.Elephant mysql connection error

This is the first time I try to use english to write my blog, so don’t jeer at the mistake of my grammar and spelling.

Because of multi threaded drelephant will cause JobHistoryServer’s Loads very high, so I stopped it for a strench of time. Until last week, a period pull from JHS patch merge request from github was released. I re-compiled dr. elephant and deploy the new dr. elephant on the cluster. It seems stable, but on this Monday morning, my leader told me that there were no more counters and any information about cluster jobs in dr. elephant.  So I logged in to the server, and check log, then I found this message below. Continue reading Dr.Elephant mysql connection error

试用时间序列数据库InfluxDB

Hadoop集群监控需要使用时间序列数据库,今天花了半天时间调研使用了一下最近比较火的InfluxDB,发现还真是不错,记录一下学习心得。

Influx是用Go语言写的,专为时间序列数据持久化所开发的,由于使用Go语言,所以各平台基本都支持。类似的时间序列数据库还有OpenTSDB,Prometheus等。 Continue reading 试用时间序列数据库InfluxDB

Page 1 of 3
1 2 3