Quantcast
Channel: Working notes by Matthew Rocklin - SciPy
Viewing all articles
Browse latest Browse all 100

Dask EC2 Startup Script

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://youtu.be/KGlhU9kSfVk

Summary

Copy-pasting the following commands gives you a Dask cluster on EC2.

pip install dec2
dec2 up --keyname YOUR-AWS-KEY-NAME
        --keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem
        --count 9   # Provision nine nodes
        --nprocs 8  # Use eight separate worker processes per node

dec2 ssh            # SSH into head node
ipython             # Start IPython console on head node
fromdistributedimportExecutor,s3,progresse=Executor('127.0.0.1:8786')df=s3.read_csv('dask-data/nyc-taxi/2015',lazy=False)progress(df)df.head()

You will have to use your own AWS credentials, but you’ll get fast distributed Pandas access on the NYCTaxi data across a cluster, loaded from S3.

Motivation

Reducing barriers to entry enables curious play.

Curiosity drives us to play with new tools. We love the idea that previously difficult tasks will suddenly become easy, expanding our abilities and opening up a range of newly solvable problems.

However, as our problems grow more complex our tools grow more cumbersome and setup costs increase. This cost stops us from playing around, which is a shame, because playing is good both for the education of the user and for the development of the tool. Tool makers who want feedback are strongly incentivized to decrease setup costs, especially for the play case.

In February we introduced dask.distributed, a lightweight distributed computing framework for Python. We focused on processing data with high level abstractions like dataframes and arrays in the following blogposts:

  1. Analyze GitHub JSON record data in S3
  2. Use Dask DataFrames on CSV data in HDFS
  3. Process NetCDF data with Dask arrays on a traditional cluster

Today we present a simple setup script to launch dask.distributed on EC2, enabling any user with AWS credentials to repeat these experiments easily.

dec2

Devops tooling and EC2 to the rescue

DEC2 does the following:

  1. Provisions nodes on EC2 using your AWS credentials
  2. Installs Anaconda on those nodes
  3. Deploys a dask.distributed Scheduler on the head node and Workers on the rest of the nodes
  4. Helps you to SSH into the head node or connect from your local machine
$ pip install dec2
$ dec2 up --help
Usage: dec2 up [OPTIONS]

Options:
  --keyname TEXT                Keyname on EC2 console  [required]
  --keypair PATH                Path to the keypair that matches the keyname [required]
  --name TEXT                   Tag name on EC2
  --region-name TEXT            AWS region  [default: us-east-1]
  --ami TEXT                    EC2 AMI  [default: ami-d05e75b8]
  --username TEXT               User to SSH to the AMI  [default: ubuntu]
  --type TEXT                   EC2 Instance Type  [default: m3.2xlarge]
  --count INTEGER               Number of nodes  [default: 4]
  --security-group TEXT         Security Group Name  [default: dec2-default]
  --volume-type TEXT            Root volume type  [default: gp2]
  --volume-size INTEGER         Root volume size (GB)  [default: 500]
  --file PATH                   File to save the metadata  [default: cluster.yaml]
  --provision / --no-provision  Provision salt on the nodes  [default: True]
  --dask / --no-dask            Install Dask.Distributed in the cluster [default: True]
  --nprocs INTEGER              Number of processes per worker  [default: 1]
  -h, --help                    Show this message and exit.

Note: dec2 was largely built by Daniel Rodriguez

Run

As an example we use dec2 to create a new cluster of nine nodes. Each worker will run with eight processes, rather than using threads.

dec2 up --keyname my-key-name
        --keypair ~/.ssh/my-key-file.pem
        --count 9       # Provision nine nodes
        --nprocs 8      # Use eight separate worker processes per node

Connect

We ssh into the head node and start playing in an IPython terminal:

localmachine:~$ dec2 ssh                          # SSH into head node
ec2-machine:~$ ipython                            # Start IPython console
In[1]:fromdistributedimportExecutor,s3,progressIn[2]:e=Executor('127.0.0.1:8786')In[3]:eOut[3]:<Executor:scheduler=127.0.0.1:8786workers=64threads=64>

Notebooks

Alternatively we set up a globally visible Jupyter notebook server:

localmachine:~$ dec2 dask-distributed address    # Get location of head node
Scheduler Address: XXX:XXX:XXX:XXX:8786

localmachine:~$ dec2 ssh                          # SSH into head node
ec2-machine:~$ jupyter notebook --ip="*"          # Start Jupyter Server

Then navigate to http://XXX:XXX:XXX:XXX:8888 from your local browser. Note, this method is not secure, see Jupyter docs for a better solution.

Public datasets

We repeat the experiments from our earlier blogpost on NYCTaxi data.

df=s3.read_csv('dask-data/nyc-taxi/2015',lazy=False)progress(df)df.head()
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
022015-01-15 19:05:392015-01-15 19:23:4211.59-73.99389640.7501111N-73.97478540.750618112.01.00.53.2500.317.05
112015-01-10 20:33:382015-01-10 20:53:2813.30-74.00164840.7242431N-73.99441540.759109114.50.50.52.0000.317.80
212015-01-10 20:33:382015-01-10 20:43:4111.80-73.96334140.8027881N-73.95182040.82441329.50.50.50.0000.310.80
312015-01-10 20:33:392015-01-10 20:35:3110.50-74.00908740.7138181N-74.00432640.71998623.50.50.50.0000.34.80
412015-01-10 20:33:392015-01-10 20:52:5813.00-73.97117640.7624281N-74.00418140.742653215.00.50.50.0000.316.30

Acknowledgments

The dec2 startup script is largely the work of Daniel Rodriguez. Daniel usually works on Anaconda for cluster management which is normally part of an Anaconda subscription but is also free for moderate use (4 nodes.) This does things similar to dec2, but much more maturely.

DEC2 was inspired by the excellent spark-ec2 setup script, which is how most Spark users, myself included, were first able to try out the library. The spark-ec2 script empowered many new users to try out a distributed system for the first time.

The S3 work was largely done by Hussain Sultan (Capital One) and Martin Durant (Continuum).

What didn’t work

  • Originally we automatically started an IPython console on the local machine and connected it to the remote scheduler. This felt slick, but was error prone due to mismatches between the user’s environment and the remote cluster’s environment.
  • It’s tricky to replicate functionality that’s part of a proprietary product (Anaconda for cluster management.) Fortunately, Continuum management has been quite supportive.
  • There aren’t many people in the data science community who know Salt, the system that backs dec2. I expect maintenance to be a bit tricky moving forward, especially during periods when Daniel and other developers are occupied.

Viewing all articles
Browse latest Browse all 100

Trending Articles