As we use CDH 5.14.0 on our hadoop cluster, the highest spark version to be support is 2.1.3, so this blog is to record the procedure of how I install pyspark-2.1.3 and integrate it with jupyter-lab.
CDH 5.14.0 – hive 1.1.0
Anaconda3 – python 3.6.8
- Add export to spark-env.sh
export PYSPARK_DRIVER_PYTHON_OPTS=' --ip=172.16.191.30 --port=8890'
- install sparkmagic
pip install sparkmagic
- Use conda or pip command to downgrade ipykernel to 4.9.0, cause ipykernel 5.x doesn’t support sparkmagic, it will throw a Future exception.
- /opt/spark-2.1.3/bin/pyspark –master yarn
If you need to run with backgrand , use nohup.
if nessasery, add a kernel json at /usr/share/jupyter/kernels/pyspark2 or /usr/local/share/jupyter/kernels/pyspark2, with the content as
"HADOOP_CLIENT_OPTS": "-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true",
"PYSPARK_SUBMIT_ARGS": " --jars /opt/spark-2.1.3-bin-hadoop2.6/jars/greenplum-spark_2.11-1.6.2.jar --master yarn --deploy-mode client --name JuPysparkHub pyspark-shell",
Another problem, in pyspark, sqlContext cannot access remote hivemetastore and without any exceptions, when i run show databases in pyspark, it always return me default. And then i found out, in spark2’s jars dir, there was a hive-exec-1.1.0-cdh5.14.0.jar, delete this jar file, everythings ok.
Python is useful for data scientists, especially with pyspark, but it’s a big problem to sysadmins, they will install python 2.7+ and spark and numpy,scipy,sklearn,pandas on each node, well, because Cloudera said that. Wow, imaging this, You have a cluster with 1000+ nodes or even 5000+ nodes, although you are good at DevOPS tools such as puppet, fabric, this work still cost lot of time. Continue reading Integrate pyspark and sklearn with distributed parallel running on YARN
Due to our dear stingy Party A said they will add not any nodes to the cluster, so we must compress the data to reduce disk consumption. Actually I like LZ4, it’s natively supported by hadoop, and the compress/decompress speed is good enough, compress ratio is better than LZO. But, I must choose LZO finally, no reason.
Well, since we use Cloudera Manager to install Hadoop and Spark, so it’s no error when read lzo file in command line, simply use as text file, Ex:
val data = sc.textFile("/user/dmp/miaozhen/ott/MZN_OTT_20170101131042_0000_ott.lzo")
But in zeppelin, it will told me: native-lzo library not available, WTF?
Well, Zeppelin is a self-run environment, it will read its configuration only, do not read any other configs, Ex: it will not try to read /etc/spark/conf/spark-defaults.conf . So I must wrote all spark config such as you wrote them in spark-deafults.conf.
In our cluster, the Zeppelin conf looks like this:
We’ve updated Zeppelin from 0.7.0 to 0.7.1, still work with kerberized hadoop cluster, we use some interpreters in zeppelin, not all. And I wanna write some troubleshooting records with this awesome webtool. BTW: I can write a webtool better than this 1000 times, such as phpHiveAdmin, basically I can see the map/reduce prograss bar Continue reading Troubleshooting on Zeppelin with keberized cluster
We deployed Apache Zeppelin 0.7.0 for the Kerberos secured Hadoop cluster, and my dear colleague cannot use it correctly, so I have to find out why he can’t use anything in Zeppelin, except shell command.
I start with Kerberized Hive Continue reading Use kerberized Hive in Zeppelin