Make the most out of EMR with PySpark and pyenv

Fernando Cisneros
Datank.ai Blog
Published in
4 min readOct 16, 2017

--

Elastic Map Reduce, one of many Amazon’s Big Data Analytics products, has been out for a while now. It was first released on 2009 with automated Hadoop+Hive provisioning, basic job management, and S3 transfer optimizations. Since then, it has been actively developed to support more and more Big Data technologies. The latest release of EMR is 5.8.0 as of today.

Our work with EMR

Our workloads consist of several dependent tasks to compose Data Pipelines. These Pipelines are represented as DAGs in order to control execution order, inputs, outputs, resources, etc.

Most of those tasks are written in the Spark Python API (PySpark), whose dependencies vary greatly between them. That said, our fondness for Spark and the Hadoop ecosystem is no secret.

Therefore, the decision whether to use EMR, Hadoop on EC2 or bare metal Hadoop on premises came down to the characteristics of our workloads:

  • Workloads vary in number, size, and duration for each client, and therefore the size of the cluster also varies.
  • Workloads are not run continuously, therefore clusters are created and destroyed on demand.
  • Each job has its own specific set of dependencies (python requirements mainly).

Python on EMR

With the goal of optimizing resources, EMR’s Hadoop is a modified version of the plain vanilla Hadoop, highly integrated with the AMI’s OS and using their own Linux repositories for dependencies like python.

Here is a screenshot of the Linux version of EMR 5.7.0:

And the default yum repositories:

By default, Amazon’s python 2.6, 2.7 and 3.4 are installed:

So, even though this works for most cases, it was an important issue for us due to the heavy use of specific python versions and dependencies that may not be compatible if installed on a global scope.

One option to work around this issue was to simply create and bootstrap each cluster differently for each specific workflow requirements. This didn’t sound like a bad idea due to the effortlessness of provisioning clusters on demand. But a closer look showed a big increase in complexity when chaining several tasks into a single pipeline. Not to mention the costs of having multiple clusters for the same work. This would look like this:

Multiple EMR clusters with specific python version plus dependencies

Pyenv on EMR

A better solution, that we ended up implementing, was to isolate each task environment using pyenv. This made it possible to have multiple python versions and dependencies that will not conflict with each other.

Now each pipeline can have its own EMR cluster or even share a single one (with enough resources for all the tasks). The tricky part was to install pyenv on each node of the cluster, and correctly set up the spark options to use it:

Single EMR cluster with multiple python environments plus dependencies

After some trial and error, we managed to install pyenv using the bootstrap option that EMR expose for your cluster customization. The following is a snippet of a bootstrap script used to install and configure pyenv with python 3.5 and some dependencies:

Now, the final step is to add spark options to a spark-submit script when launching our PySpark tasks:

Conclusion

EMR has proven to be a cost-effective, easy, yet powerful solution to most Big Data Analytics tasks. The simplicity to provision clusters, combined with the flexibility that pyenv gave us to isolate python environments and dependencies has been invaluable to our products development.

The solution shown above may not fit all use cases, but certainly could be used as a reference of the things that can be done with EMR.

--

--