As we use CDH 5.14.0 on our hadoop cluster, the highest spark version to be support is 2.1.3, so this blog is to record the procedure of how I install pyspark-2.1.3 and integrate it with jupyter-lab.
CDH 5.14.0 – hive 1.1.0
Anaconda3 – python 3.6.8
- Add export to spark-env.sh
export PYSPARK_PYTHON=/opt/anaconda3/bin/python export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/jupyter-lab export PYSPARK_DRIVER_PYTHON_OPTS=' --ip=172.16.191.30 --port=8890'
- install sparkmagic
pip install sparkmagic
- Use conda or pip command to downgrade ipykernel to 4.9.0, cause ipykernel 5.x doesn’t support sparkmagic, it will throw a Future exception.
- /opt/spark-2.1.3/bin/pyspark –master yarn
If you need to run with backgrand , use nohup.
Another problem, in pyspark, sqlContext cannot access remote hivemetastore and without any exceptions, when i run show databases in pyspark, it always return me default. And then i found out, in spark2’s jars dir, there was a hive-exec-1.1.0-cdh5.14.0.jar, delete this jar file, everythings ok.