lambda – Slaytanic

Integrate pyspark and sklearn with distributed parallel running on YARN

Python is useful for data scientists, especially with pyspark, but it’s a big problem to sysadmins, they will install python 2.7+ and spark and numpy,scipy,sklearn,pandas on each node, well, because Cloudera said that. Wow, imaging this, You have a cluster with 1000+ nodes or even 5000+ nodes, although you are good at DevOPS tools such as puppet, fabric, this work still cost lot of time. Continue reading Integrate pyspark and sklearn with distributed parallel running on YARN →

Apache Bigtop与卖书求生

快一年没写博客了，终于回来了，最近因公司业务需要，要基于cdh发行版打包自定义patch的rpm，于是又搞起了bigtop，就是那个hadoop编译打包rpm和deb的工具，由于国内基本没有相关的资料和文档，所以觉得有必要把阅读bigtop源码和修改的思路分享一下。 Continue reading Apache Bigtop与卖书求生 →

Archives