Backend – Slaytanic

Jupyterhub integrate with customized Kerberos authenticator

Since I didn’t spend much time on hadoop and spark maintaince, our first party would like to use jupyterhub/lab on hadoop, and their cluster was enhanced security with kerberos and hdfs encryption, so I have to modify jupyterhub/lab to adapt the cluster job submission.

The first party firstly wants to use the SSO login, and somedays they want to use linux pam users to login, so I wrote a custom authenticator to use linux pam login and use the keytab with linux username to authenticate from kerberos in jupyterhub. So the customers could submit their hive or spark jobs without kinit command.

Firstly, the keytab of each user has a fixed filename convention, such as , a principal on dmp-python1 host, would like xianglei/dmp-python1@GC.COM, and the keytab file should be xianglei.py1.keytab.

And then, the login handler in jupyterhub is jupyterhub/handlers/login.py

change the login.py to login and request a username from customer’s SSO system.

class LoginHandler(BaseHandler):
    """Render the login page."""

    # commented for matrix ai
    args = dict()
    args['contenttype'] = 'application/json'
    args['app'] = 'dmp-jupyterhub'
    args['subkey'] = 'xxxxxxxxxxxxxxxxxxx'

    def _render(self, login_error=None, username=None):
        return self.render_template(
            'login.html',
            next=url_escape(self.get_argument('next', default='')),
            username=username,
            login_error=login_error,
            custom_html=self.authenticator.custom_html,
            login_url=self.settings['login_url'],
            authenticator_login_url=url_concat(
                self.authenticator.login_url(self.hub.base_url),
                {'next': self.get_argument('next', '')},
            ),
        )

    async def get(self):
        """
        modify to fit pg's customized oauth2 system, if this method publish with matrix ai, then comment all,
        and write code that only get username from matrixai login form.
        """
        self.statsd.incr('login.request')
        user = self.current_user

        if user:
            # set new login cookie
            # because single-user cookie may have been cleared or incorrect
            # if user exists, set a cookie and jump to next url
            self.set_login_cookie(user)
            self.redirect(self.get_next_url(user), permanent=False)
        else:
            # if user doesnt exists, jump to login page
            '''
            # below is original jupyterhub login, commented
            if self.authenticator.auto_login:
                auto_login_url = self.authenticator.login_url(self.hub.base_url)
                if auto_login_url == self.settings['login_url']:
                    # auto_login without a custom login handler
                    # means that auth info is already in the request
                    # (e.g. REMOTE_USER header)
                    user = await self.login_user()
                    if user is None:
                        # auto_login failed, just 403
                        raise web.HTTPError(403)
                    else:
                        self.redirect(self.get_next_url(user))
                else:
                    if self.get_argument('next', default=False):
                        auto_login_url = url_concat(
                            auto_login_url, {'next': self.get_next_url()}
                        )
                    self.redirect(auto_login_url)
                return
            username = self.get_argument('username', default='')
            self.finish(self._render(username=username))
            '''
            # below is cusstomized login get
            import json
            import requests
            access_token = self.get_cookie('access_token') # this is the oauth2 access_token
            self.log.info("access_token: " + access_token)
            token_type = self.get_argument('token_type', 'Bearer')
            if access_token != '':
                # use token to request sso address, to get ShortName ShortName is used for kerberos
                userinfo_url = 'https://xxxx.com/paas-ssofed/v3/token/userinfo'
                headers = {
                    'Content-Type': 'application/x-www-form-urlencoded',
                    'Ocp-Apim-Subscription-Key': self.args['subkey'],
                    'Auth-Type': 'ssofed',
                    'Authorization': token_type + ' ' + access_token
                }
                body = {
                    'token': access_token
                }
                resp = json.loads(requests.post(userinfo_url, headers=headers, data=body).text)
                user = resp['ShortName']
                data = dict()
                data['username'] = user
                data['password'] = ''
                # put ShortName into python dict, and call login_user method of jupyterhub
                user = await self.login_user(data)
                self.set_cookie('username', resp['ShortName'])
                self.set_login_cookie(user)
                self.redirect(self.get_next_url(user))
            else:
                self.redirect('http://xxxx.cn:3000/')

Do not change the post method of LoginHandler

And then, wrote a custom authenticator class, I named it as GCAuthenticator, it do nothing, just return the username from LoginHandler.And put the file into jupyterhub/garbage_customer/gcauthenticator.py

#!/usr/bin/env python

from tornado import gen
from jupyterhub.auth import Authenticator

class GCAuthenticator(Authenticator):
    """
    usage:
    1. generate hub config file with command
        jupyterhub --generate-config /path/to/store/jupyterhub_config.py

    2. edit config file, comment
    # c.JupyterHub.authenticator_class = 'jupyterhub.auth.PAMAuthenticator'
    and write a new line
    c.JupyterHub.authenticator_class = 'jupyterhub.garbage_customer.gcauthenticator.GCAuthenticator
    """

    @gen.coroutine
    # 入口参数固定写法, 啥也不做, 直接返回用户名, 即为真.这里传递的data, 就是之前在login里面定义的data['username']和data['password']
    # 但是由于验证已经由甲方的sso做了, 所以我们用不到password, 但是格式还是要遵守的.
    # 其实按照这个思路, 自己把这个方法改写成mysql, postgres, 或者文本文件做用户验证, 其实也很简单.
    def authenticate(self, handler, data):
        user = data['username']
        return user

And then, the local kerberos authentication step.

When login successed, jupyterhub will call spawner method in spawner.py to fork a sub process, the spawner method will call singleuserapp method to create a sub process of notebook and in system process, it will named as singleuser. But actually, the subprocess is created by spawner, so I should extend the spawner method, then I can use the kerberos authenticate and insert the environment variables, this step didn’t need to change spawner.py, only write a new file as a plugin. like gckrbspawner.py

# Save this file in your site-packages directory as krbspawner.py
#
# then in /etc/jupyterhub/config.py, set:
#
#    c.JupyterHub.spawner_class = 'garbage_customer.gckrbspawner.KerberosSpawner'


from jupyterhub.spawner import LocalProcessSpawner
from jupyterhub.utils import random_port
from subprocess import Popen,PIPE
from tornado import gen
import pipes

REALM = 'GC.COM'

# KerberosSpawner扩展自spawner.py的localProcessSpawner类
class KerberosSpawner(LocalProcessSpawner):
    @gen.coroutine
    def start(self):
        """启动子进程的方法"""
        if self.ip:
            self.user.server.ip = self.ip # 用户服务的ip为jupyterhub启动设置的ip
        else:
            self.user.server.ip = '127.0.0.1' # 或者是 127.0.0.1
        self.user.server.port = random_port() # singleruser server, 也就是notebook子进程, 启动时使用随机端口
        self.log.info('Spawner ip: %s' % self.user.server.ip)
        self.log.info('Spawner port: %s' % self.user.server.port)
        cmd = []
        env = self.get_env() # 获取jupyterhub的环境变量
        # self.log.info(env)
        
        """ Get user uid and gid from linux"""
        uid_args = ['id', '-u', self.user.name] # 获取当前登录的用户名对应的linux uid
        uid = Popen(uid_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
        uid = uid.communicate()[0].decode().strip()
        gid_args = ['id', '-g', self.user.name] # 获取当前登录用户对应的 linux gid
        gid = Popen(gid_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
        gid = gid.communicate()[0].decode().strip()
        self.log.info('UID: ' + uid + ' GID: ' + gid)
        self.log.info('Authenticating: ' + self.user.name)

        cmd.extend(self.cmd)
        cmd.extend(self.get_args())

        self.log.info("Spawning %s", ' '.join(pipes.quote(s) for s in cmd))
        # 使用linux用户认证kerberos用户, 由于 jupyterhub默认使用 /home/username作为每个用户的文件夹, 所以我把用户认证需要的keytab放到每个/home/username下面
        # 例如 xianglei.wb1.keytab ,对应的linux用户就是xianglei, 对应的krb用户就是 xianglei/gc-dmp-workbench1@GC.COM.
        kinit = ['/usr/bin/kinit', '-kt',
                 '/home/%s/%s.wb1.keytab' % (self.user.name, self.user.name,),
                 '-c', '/tmp/krb5cc_%s' % (uid,),
                 '%s/gc-dmp-workbench1@%s' % (self.user.name, REALM)]
        self.log.info("KRB5 initializing with command %s", ' '.join(kinit))
        # 使用subprocess的Popen在spawner里面创建子进程去做krb认证
        Popen(kinit, preexec_fn=self.make_preexec_fn(self.user.name)).wait()

        popen_kwargs = dict(
            preexec_fn=self.make_preexec_fn(self.user.name),
            start_new_session=True,  # 不转发 signals
        )
        popen_kwargs.update(self.popen_kwargs)
        popen_kwargs['env'] = env
        self.proc = Popen(cmd, **popen_kwargs)
        self.pid = self.proc.pid
        # 返回ip和端口号, 交还给jupyterhub server进行子进程服务注册
        return (self.user.server.ip, self.user.server.port)

And add a config into jupyterhub generated config file

c.JupyterHub.spawner_class = 'garbage_customer.gckrbspawner.KerberosSpawner'

In next topic, I will write how to auto refresh kerberos authentication.

jupyterlab and pyspark2 integration in 1 minute

As we use CDH 5.14.0 on our hadoop cluster, the highest spark version to be support is 2.1.3, so this blog is to record the procedure of how I install pyspark-2.1.3 and integrate it with jupyter-lab.

Evironment:
spark 2.1.3
CDH 5.14.0 – hive 1.1.0
Anaconda3 – python 3.6.8

Add export to spark-env.sh

export PYSPARK_PYTHON=/opt/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/jupyter-lab
export PYSPARK_DRIVER_PYTHON_OPTS='  --ip=172.16.191.30 --port=8890'

install sparkmagic
```
pip install sparkmagic
```
Use conda or pip command to downgrade ipykernel to 4.9.0, cause ipykernel 5.x doesn’t support sparkmagic, it will throw a Future exception.
https://github.com/jupyter-incubator/sparkmagic/issues/492
/opt/spark-2.1.3/bin/pyspark –master yarn

If you need to run with backgrand , use nohup.

if nessasery, add a kernel json at /usr/share/jupyter/kernels/pyspark2 or /usr/local/share/jupyter/kernels/pyspark2, with the content as
{ "argv": [ "python3.6", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "Python3.6+PySpark2.1", "language": "python", "env": { "PYSPARK_PYTHON": "/opt/anaconda3/bin/python", "SPARK_HOME": "/opt/spark-2.1.3-bin-hadoop2.6", "HADOOP_CONF_DIR": "/etc/hadoop/conf", "HADOOP_CLIENT_OPTS": "-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true", "PYTHONPATH": "/opt/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip:/opt/spark-2.1.3-bin-hadoop2.6/python/", "PYTHONSTARTUP": "/opt/spark-2.1.3-bin-hadoop2.6/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": " --jars /opt/spark-2.1.3-bin-hadoop2.6/jars/greenplum-spark_2.11-1.6.2.jar --master yarn --deploy-mode client --name JuPysparkHub pyspark-shell", "JAVA_HOME": "/opt/jdk1.8.0_141" } }

Another problem, in pyspark, sqlContext cannot access remote hivemetastore and without any exceptions, when i run show databases in pyspark, it always return me default. And then i found out, in spark2’s jars dir, there was a hive-exec-1.1.0-cdh5.14.0.jar, delete this jar file, everythings ok.

自己动手打造ipv6梯子

以下内容适合有一定网络及梯子搭建经验的人士阅读。

准备工作：
Vultr美服主机一台，以下简称HAA，创建时选择同时支持IPv6和v4，最低价到20181212为止是3.5刀，选择位置靠近西海岸，Seattle，LA，均可，速度会比较好，自家网络测试觉得硅谷速度一般。

Vultr 欧服Paris，Amsterdam，Frankfurt均可，数量自定，选择IPv6 only，价格2.5刀。或者法国的scaleway，最低价格1.99欧。以下简称SSs。

阿里云或其他云主机一台，主要是利用其接入骨干网优势，可选操作，创建选择按量付费，最低配。

开工：
美服HAA部署HAProxy，配置如下

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private
        ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
        ssl-default-bind-options no-sslv3

defaults
        log     global
        mode    tcp # tcp 四层代理（默认http为七层代理）
        option  redispatch
        option  abortonclose
        timeout connect 6000
        timeout client  60000
        timeout server  60000
        timeout check   2000

listen  admin_stats
        bind    *:1090 #状态页，绑定ipv6[::]及ipv4 0.0.0.0 1090端口
        mode    http
        maxconn 10
        stats   refresh 30s
        stats   uri     /
        stats   realm   HAProxy
        stats   auth    haproxy:haproxy

listen ss
        bind    *:8388 #同时绑定ipv6[::]及v4 0.0.0.0地址 8388端口
        mode    tcp
        balance leastconn
        maxconn 8192
        server  Amsterdam-vultr                                         2001:19f0:5001:27ce:xxxx:xxx:xxxx:xxxx:8388      check   inter   180000  rise    1       fall    2 #反代阿姆斯特丹主机的ipv6地址的8388端口
        server  Amsterdam-scaleway                                      2001:bc8:xxxx:xxxx::1:8388                       check   inter   180000  rise    1       fall    2
        server  Seattle-vultr                                           2001:19f0:8001:1396:xxxx:xxx:xxxx:xxxx:8388      check   inter   180000  rise    1       fall    2

然后在对应的SSs云主机上部署和搭建ss server，这个教程很多，SSs配置时，IP绑定于::，例如：

{
  "server":"::",
  "port_password":{
    "995":"xxxxxxxxxx"
  },
  "timeout":300,
  "method":"aes-256-cfb",
  "workers": 10
}

这里需要注意的是，SS我看到的，Python版支持ipv6绑定，C版的貌似不支持IPv6绑定，没做更深入测试，所以这里我用的是Python版SS

启动SS并启动HAProxy，这就可以用了。好处是，仅IPv6的机器价格更低，被GFW拦截的可能性也更低，速度很快，且全球布局，连接速度主要取决于家里的连接美服的速度。如果资金不紧张，可以再搞一个国内的阿里云主机，最低配，搭建HAProxy再次反代美服的HAA，利用其骨干网优势。我家里是100Mbps宽带，通过阿云反代美服，上油土鳖等视频网站，带宽峰值可以到2-3MB，4K高清毫无压力。

所用服务器均为最低配，ubuntu18.04或debian stretch。

Enable HTTPS access in Zeppelin

I was using certified key file to enable HTTPS, if you use self-signatured key, see second part

First part:
I had got two files which one is the private key named server.key and another one is certification file named server.crt
Use the following command to create a jks keystore file

openssl pkcs12 -export -in xxx.com.crt -inkey xxx.com.key -out xxx.com.pkcs12
keytool -importkeystore -srckeystore xxx.com.pkcs12 -destkeystore xxx.com.jks -srcstoretype pkcs12

Second part:
Use self-signatured key

# Generate root key file and cert file, key file could be named key or pem, it's same.
openssl genrsa -out root.key(pem) 2048 # Generate root key file
openssl req -x509 -new -key root.key(pem) -out root.crt # Generate root cert file

# Generate client key and cert and csr file
openssl genrsa -out client.key(pem) 2048 # Generate client key file
openssl req -new -key client.key(pem) -out client.csr # Generate client cert request file
openssl x509 -req -in client.csr -CA root.crt -CAkey root.key(pem) -CAcreateserial -days 3650 -out client.crt # Use root cert to generate client cert file

# Generate server key and cert and csr file
openssl genrsa -out server.key(pem) 2048 # Generate server key file, use in Zeppelin
openssl req -new -key server.key(pem) out server.csr @ Generate server cert request file
openssl x509 -req -in server.csr -CA root.crt -CAkey root.key(pem) -CAcreateserial -days 3650 -out server.crt # Use root cert to generate server cert file

# Generate client jks file
openssl pkcs12 -export -in client.crt -inkey client.key(pem) -out client.pkcs12 # Package to pkcs12 format, must input a password, you should remember the password
keytool -importkeystore -srckeystore client.pkcs12 -destkeystore client.jks -srcstoretype pkcs12 # The client password you just input at last step

# Generate server jks file
openssl pkcs12 -export -in server.crt -inkey server.key(pem) -out server.pkcs12 # Package to pkcs12 format, must input a password, you should remember the password
keytool -importkeystore -srckeystore server.pkcs12 -destkeystore server.jks -srcstoretype pkcs12 # The server password you just input at last step

The server key, cert and jks are using to configure zeppelin, the client key, cert and jks are using to install into browser or your client access codes.
Then, make a directory to put the server things in it, such as

mkdir -p /etc/zeppelin/conf/ssl
cp server.crt server.jks /etc/zeppelin/conf/ssl

And then modify zeppelin-site.xml to enable https access

<property>
  <name>zeppelin.server.ssl.port</name>
  <value>8443</value>
  <description>Server ssl port. (used when ssl property is set to true)</description>
</property>
<property>
  <name>zeppelin.ssl</name>
  <value>true</value>
  <description>Should SSL be used by the servers?</description>
</property>
<property>
  <name>zeppelin.ssl.client.auth</name>
  <value>false</value>
  <description>Should client authentication be used for SSL connections?</description>
</property>
<property>
  <name>zeppelin.ssl.keystore.path</name>
  <value>/etc/zeppelin/conf/ssl/xxx.com.jks</value>
  <description>Path to keystore relative to Zeppelin configuration directory</description>
</property>
<property>
  <name>zeppelin.ssl.keystore.type</name>
  <value>JKS</value>
  <description>The format of the given keystore (e.g. JKS or PKCS12)</description>
</property>
<property>
  <name>zeppelin.ssl.keystore.password</name>
  <value>password which you input on generating server jks step</value>
  <description>Keystore password. Can be obfuscated by the Jetty Password tool</description>
</property>

Then, all completed, and you can redirect 443 to 8443 by using iptables or other reverse proxy tools

试用时间序列数据库InfluxDB

Hadoop集群监控需要使用时间序列数据库，今天花了半天时间调研使用了一下最近比较火的InfluxDB，发现还真是不错，记录一下学习心得。

Influx是用Go语言写的，专为时间序列数据持久化所开发的，由于使用Go语言，所以各平台基本都支持。类似的时间序列数据库还有OpenTSDB，Prometheus等。 Continue reading 试用时间序列数据库InfluxDB →

Tornado学习笔记(四)

一、Tornado的语言国际化方法

Tornado做国际化折腾了一下下，Tornado这部分的官方文档太poor了。所以自己记录一下如何用tornado结合gettext做国际化。

第一步，在项目路径下建立./locales/zh_CN/LC_MESSAGES文件夹。

第二步，使用xgettext或poedit在第一步的文件夹下创建一个po文件，比如messages.po，我用poedit创建，比xgettext方便一些。 Continue reading Tornado学习笔记(四) →

Tornado学习笔记(三)

记录一些Tornado中的常用知识。 Continue reading Tornado学习笔记(三) →

Tornado学习笔记(二)

我一直用python2.x，python2.x内置的字符编码方式是unicode，这就对中文的处理造成了一些困扰，尤其是在用tornado写json接口的时候，如果不做处理，出来的没有中文，都是\x4d5f之类的东西。所以通常需要这样去处理下。 Continue reading Tornado学习笔记(二) →

Tornado学习笔记(一)

最近开始用Tornado做开发了，究其原因，主要是Tornado基于Python，一来代码量少开发速度快，二来采用epoll方式，能够承载的并发量很高。在我的i5台式机上用ab测试，不连接数据库的情况下，单用get生成页面，大概平均的并发量在7900左右。这比php或者java能够承载并发量都高很多很多。三来Python代码可维护性相对来说比php好很多，语法结构清晰。四来，tornado的框架设计的很黄很暴力，以HTTP请求方式作为方法名称，通常情况下，用户写一个页面只需要有get和post两种方式的方法定义就够了。 Continue reading Tornado学习笔记(一) →

Archives