Working notes by Matthew Rocklin

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Very high performance isn’t about doing one thing well, it’s about doing nothing poorly.

This week I optimized the inter-node communication protocol used by dask.distributed. It was a fun exercise in optimization that involved several different and unexpected components. I separately had to deal with Pickle, NumPy, Tornado, MsgPack, and compression libraries.

This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack.

We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case.

Identify a Problem

This came about as I was preparing a demo using dask.array on a distributed cluster for a Continuum webinar. I noticed that my computations were taking much longer than expected. The Web UI quickly pointed me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks of numpy array data between them. This is very strange because I was on 100MB/s network, and so I expected these transfers to happen in more like 0.3s than 15s.

The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI.

Problem 1: Tornado’s sentinels

Dask’s networking is built off of Tornado’s TCP IOStreams.

There are two common ways to delineate messages on a socket, sentinel values that signal the end of a message, and prefixing a length before every message. Early on we tried both in Dask but found that prefixing a length before every message was slow. It turns out that this was because TCP sockets try to batch small messages to increase bandwidth. Turning this optimization off ended up being an effective and easy solution, see the TCP_NODELAY parameter.

However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long.

Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down.

Problem 2: Memory Copies

A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following:

socket.send(header + payload)

This code concatenates two bytestrings, header and payload before sending the result down a socket. If we cared deeply about avoiding memory copies then we might instead send these two separately:

socket.send(header)
socket.send(payload)

But who cares, right? At 5 GB/s copying memory is cheap!

Unfortunately this breaks down under either of the following conditions

You are sloppy enough to do this multiple times
You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines.

Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort.

Problem 3: Unwanted Compression

Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow.

Now we compress with the following policy:

If the message is less than 10kB, don’t bother
Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload.
Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing.

In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply.

Problem 4: Cloudpickle is not as fast as Pickle

This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays.

In[1]:importnumpyasnpIn[2]:data=np.random.randint(0,255,dtype='u1',size=10000000)In[3]:importpickle,cloudpickleIn[4]:%timelen(pickle.dumps(data,protocol=-1))CPUtimes:user8.65ms,sys:8.42ms,total:17.1msWalltime:16.9msOut[4]:10000161In[5]:%timelen(cloudpickle.dumps(data,protocol=-1))CPUtimes:user20.6ms,sys:24.5ms,total:45.1msWalltime:44.4msOut[5]:10000161

But it turns out that cloudpickle is using the Python implementation, while pickle itself (or cPickle in Python 2) is using the compiled C implemenation. Fortunately this is easy to correct, and a quick typecheck on common large dataformats in Python (NumPy and Pandas) gets us this speed boost.

Problem 5: Pickle is still slower than you’d expect

Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there.

See issue 7544

Problem 6: MsgPack is bad at large bytestrings

Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket).

Problem 7: Tornado makes a copy

Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future.

Results

We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use.

Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)

         381360 function calls (381323 primitive calls) in 1.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.366    0.366    0.366    0.366 {built-in method dumps}
        8    0.289    0.036    0.291    0.036 iostream.py:360(write)
    15353    0.228    0.000    0.228    0.000 {method 'join' of 'bytes' objects}
    15355    0.166    0.000    0.166    0.000 {method 'recv' of '_socket.socket' objects}
    15362    0.156    0.000    0.398    0.000 iostream.py:1510(_merge_prefix)
     7759    0.101    0.000    0.101    0.000 {method 'send' of '_socket.socket' objects}
    17/14    0.026    0.002    0.686    0.049 gen.py:990(run)
    15355    0.021    0.000    0.198    0.000 iostream.py:721(_read_to_buffer)
        8    0.018    0.002    0.203    0.025 iostream.py:876(_consume)
       91    0.017    0.000    0.335    0.004 iostream.py:827(_handle_write)
       89    0.015    0.000    0.217    0.002 iostream.py:585(_read_to_buffer_loop)
   122567    0.009    0.000    0.009    0.000 {built-in method len}
    15355    0.008    0.000    0.173    0.000 iostream.py:1010(read_from_fd)
    38369    0.004    0.000    0.004    0.000 {method 'append' of 'list' objects}
     7759    0.004    0.000    0.104    0.000 iostream.py:1023(write_to_fd)
        1    0.003    0.003    1.451    1.451 ioloop.py:746(start)

Dominant unwanted costs include the following:

400ms: Pickling the NumPy array
400ms: Bytestring handling within Tornado

After this we’re just bound by pushing bytes down a wire.

Conclusion

Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs.

Links

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

TL;DR.

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

Often our computations don’t fit neatly into the bag, dataframe, or array abstractions. In these cases we want the flexibility of normal code with for loops, but still with the computational power of a cluster. With the dask.distributed task interface, we achieve something close to this.

Application: Naive Distributed Random Forest Algorithm

As a motivating application we build a random forest algorithm from the ground up using the single-machine Scikit Learn library, and dask.distributed’s ability to quickly submit individual tasks to run on the cluster. Our algorithm will look like the following:

Pull data from some external source (S3) into several dataframes on the cluster
For each dataframe, create and train one RandomForestClassifier
Scatter single testing dataframe to all machines
For each RandomForestClassifier predict output on test dataframe
Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

Disclaimer and Spoiler Alert: I am not an expert in machine learning. Our algorithm will perform very poorly. If you’re excited about machine learning you can stop reading here. However, if you’re interested in how to build distributed algorithms with Dask then you may want to read on, especially if you happen to know enough machine learning to improve upon my naive solution.

API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures=executor.scatter(data)# scatter datafuture=executor.submit(function,*args,**kwargs)# submit single taskfutures=executor.map(function,sequence)# submit many tasksresults=executor.gather(futures)# gather resultsexecutor.replicate(futures,n=number_of_replications)

In particular, functions like executor.submit(function, *args) let us send individual functions out to our cluster thousands of times a second. Because these functions consume their own results we can create complex workflows that stay entirely on the cluster and trust the distributed scheduler to move data around intelligently.

Load Pandas from S3

First we load data from Amazon S3. We use the s3.read_csv(..., collection=False) function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We get back a list of Future objects that refer to these remote dataframes. The use of collection=False gives us this list of futures rather than a single cohesive Dask.dataframe object.

fromdistributedimportExecutor,s3e=Executor('52.91.1.177:8786')dfs=s3.read_csv('dask-data/nyc-taxi/2015',parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],collection=False)dfs=e.compute(dfs)

Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>>dfs[:5][<Future:status:finished,type:DataFrame,key:finalize-a06c3dd25769f434978fa27d5a4cf24b>,<Future:status:finished,type:DataFrame,key:finalize-7dcb27364a8701f45cb02d2fe034728a>,<Future:status:finished,type:DataFrame,key:finalize-b0dfe075000bd59c3a90bfdf89a990da>,<Future:status:finished,type:DataFrame,key:finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,<Future:status:finished,type:DataFrame,key:finalize-c8254256b09ae287badca3cf6d9e3142>]

If we’re willing to wait a bit then we can pull data from any future back to our local process using the .result() method. We don’t want to do this too much though, data transfer can be expensive and we can’t hold the entire dataset in the memory of a single machine. Here we just bring back one of the dataframes:

>>>df=dfs[0].result()>>>df.head()

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2015-01-15 19:05:39	2015-01-15 19:23:42	1	1.59	-73.993896	40.750111	1	N	-73.974785	40.750618	1	12.0	1.0	0.5	3.25	0.3	17.05
1	1	2015-01-10 20:33:38	2015-01-10 20:53:28	1	3.30	-74.001648	40.724243	1	N	-73.994415	40.759109	1	14.5	0.5	0.5	2.00	0.3	17.80
2	1	2015-01-10 20:33:38	2015-01-10 20:43:41	1	1.80	-73.963341	40.802788	1	N	-73.951820	40.824413	2	9.5	0.5	0.5	0.00	0.3	10.80
3	1	2015-01-10 20:33:39	2015-01-10 20:35:31	1	0.50	-74.009087	40.713818	1	N	-74.004326	40.719986	2	3.5	0.5	0.5	0.00	0.3	4.80
4	1	2015-01-10 20:33:39	2015-01-10 20:52:58	1	3.00	-73.971176	40.762428	1	N	-74.004181	40.742653	2	15.0	0.5	0.5	0.00	0.3	16.30

Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

fromsklearn.ensembleimportRandomForestClassifierfromsklearn.cross_validationimporttrain_test_splitdf_train,df_test=train_test_split(df)columns=['trip_distance','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','payment_type','fare_amount','mta_tax','tip_amount','tolls_amount']est=RandomForestClassifier(n_estimators=4)est.fit(df_train[columns],df_train.passenger_count)

This builds a RandomForestClassifer with four decision trees and then trains it against the numeric columns in the data, trying to predict the passenger_count column. It takes around 10 seconds to train on a single core. We now see how well we do on the holdout testing data:

>>>est.score(df_test[columns],df_test.passenger_count)0.65808188654721012

This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>>fromsklearn.metricsimportaccuracy_score>>>importnumpyasnp>>>accuracy_score(df_test.passenger_count,...np.ones_like(df_test.passenger_count))0.70669390028780987

This is where my ignorance in machine learning really kills us. There is likely a simple way to improve this. However, because I’m more interested in showing how to build distributed computations with Dask than in actually doing machine learning I’m going to go ahead with this naive approach. Spoiler alert: we’re going to do a lot of computation and still not beat the “always guess one” strategy.

Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

deffit(df):est=RandomForestClassifier(n_estimators=4)est.fit(df[columns],df.passenger_count)returnest

Second we call this function on all of our training dataframes on the cluster using the standard e.map(function, sequence) function. This sends out many small tasks for the cluster to run. We use all but the last dataframe for training data and hold out the last dataframe for testing. There are more principled ways to do this, but again we’re going to charge ahead here.

train=dfs[:-1]test=dfs[-1]estimators=e.map(fit,train)

This takes around two minutes to train on all of the 177 dataframes and now we have 177 independent estimators, each capable of guessing how many passengers a particular ride had. There is relatively little overhead in this computation.

Predict on testing data

Recall that we kept separate a future, test, that points to a Pandas dataframe on the cluster that was not used to train any of our 177 estimators. We’re going to replicate this dataframe across all workers on the cluster and then ask each estimator to predict the number of passengers for each ride in this dataset.

e.replicate([test],n=48)defpredict(est,X):returnest.predict(X[columns])predictions=[e.submit(predict,est,test)forestinestimators]

Here we used the executor.submit(function, *args, **kwrags) function in a list comprehension to individually launch many tasks. The scheduler determines when and where to run these tasks for optimal computation time and minimal data transfer. As with all functions, this returns futures that we can use to collect data if we want in the future.

Developers note: we explicitly replicate here in order to take advantage of efficient tree-broadcasting algorithms. This is purely a performance consideration, everything would have worked fine without this, but the explicit broadcast turns a 30s communication+computation into a 2s communication+computation.

Aggregate predictions by majority vote

For each estimator we now have an independent prediction of the passenger counts for all of the rides in our test data. In other words for each ride we have 177 different opinions on how many passengers were in the cab. By averaging these opinions together we hope to achieve a more accurate consensus opinion.

For example, consider the first four prediction arrays:

>>>a_few_predictions=e.gather(predictions[:4])# remote futures -> local arrays>>>a_few_predictions[array([1,2,1,...,2,2,1]),array([1,1,1,...,1,1,1]),array([2,1,1,...,1,1,1]),array([1,1,1,...,1,1,1])]

For the first ride/column we see that three of the four predictions are for a single passenger while one prediction disagrees and is for two passengers. We create a consensus opinion by taking the mode of the stacked arrays:

fromscipy.statsimportmodeimportnumpyasnpdefmymode(*arrays):array=np.stack(arrays,axis=0)returnmode(array)[0][0]>>>mymode(*a_few_predictions)array([1,1,1,...,1,1,1])

And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

Tree Reduction

We could call our mymode function on all of our predictions like this:

>>>mode_prediction=e.submit(mymode,*predictions)# this doesn't scale well

Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

Instead we batch our predictions into groups of size 10, average each group, and then repeat the process with the smaller set of predictions until we have only one left. This sort of multi-step reduction is called a tree reduction. We can write it up with a couple nested loops and executor.submit. This is only an approximation of the mode, but it’s a much more scalable computation. This finishes in about 1.5 seconds.

fromtoolzimportpartition_allwhilelen(predictions)>1:predictions=[e.submit(mymode,*chunk)forchunkinpartition_all(10,predictions)]result=e.gather(predictions)[0]>>>resultarray([1,1,1,...,1,1,1])

Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>>accuracy_score(result,test.result().passenger_count)0.67061974451423045

Still worse than the naive “always guess one” strategy. This just goes to show that, no matter how sophisticated your Big Data solution is, there is no substitute for common sense and a little bit of domain expertise.

What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

Conclusion

This blogpost gives a concrete example using basic task submission with executor.map and executor.submit to build a non-trivial computation. This approach is straightforward and not restrictive. Personally this interface excites me more than collections like Dask.dataframe; there is a lot of freedom in arbitrary task submission.

Links

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We list and combine common bandwidths relevant in data science

Understanding data bandwidths helps us to identify bottlenecks and write efficient code. Both hardware and software can be characterized by how quickly they churn through data. We present a rough list of relevant data bandwidths and discuss how to use this list when optimizing a data pipeline.

Name	Bandwidth MB/s
Memory copy	3000
Basic filtering in C/NumPy/Pandas	3000
Fast decompression	1000
SSD Large Sequential Read	500
Interprocess communication (IPC)	300
msgpack deserialization	125
Gigabit Ethernet	100
Pandas read_csv	100
JSON Deserialization	50
Slow decompression (e.g. gzip/bz2)	50
SSD Small Random Read	20
Wireless network	1

Disclaimer: all numbers in this post are rule-of-thumb and vary by situation

Understanding these scales can help you to identify how to speed up your program. For example, there is no need to use a faster network or disk if you store your data as JSON.

Combining bandwidths

Complex data pipelines involve many stages. The rule to combine bandwidths is to add up the inverses of the bandwidths, then take the inverse again:

This is the same principle behind adding conductances in serial within electrical circuits. One quickly learns to optimize the slowest link in the chain first.

Example

When we read data from disk (500 MB/s) and then deserialize it from JSON (50 MB/s) our full bandwidth is 45 MB/s:

If we invest in a faster hard drive system that has 2GB of read bandwidth then we get only marginal performance improvement:

However if we invest in a faster serialization technology, like msgpack (125 MB/s), then we double our effective bandwidth.

This example demonstrates that we should focus on the weakest bandwidth first. Cheap changes like switching from JSON to msgpack can be more effective than expensive changes, like purchasing expensive hardware for fast storage.

Overlapping Bandwidths

We can overlap certain classes of bandwidths. In particular we can often overlap communication bandwidths with computation bandwidths. In our disk+JSON example above we can probably hide the disk reading time completely. The same would go for network applications if we handle sockets correctly.

Parallel Bandwidths

We can parallelize some computational bandwidths. For example we can parallelize JSON deserialization by our number of cores to quadruple the effective bandwidth 50 MB/s * 4 = 200 MB/s. Typically communication bandwidths are not parallelizable per core.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Disk read and write bandwidths depend strongly on block size.

Disk read/write bandwidths on commodity hardware vary between 10 MB/s (or slower) to 500 MB/s (or faster on fancy hardware). This variance can be characterized by the following rules:

Reading/writing large blocks of data is faster than reading/writing small blocks of data
Reading is faster than writing
Solid state drives are faster than spinning disk especially for many small reads/writes

In this notebook we experiment with the dependence of disk bandwidth on file size. The result of this experiment is the following image, which depicts the read and write bandwidths of a commercial laptop SSD as we vary block size:

Analysis

We see that this particular hard drive wants to read/write data in chunksizes of 1-100 MB. If we can arrange our data so that we consistently pull off larger blocks of data at a time then we can read through data quite quickly at 500 MB/s. We can churn through a 30 GB dataset in one minute. Sophisticated file formats take advantage of this by storing similar data consecutively. For example column stores store all data within a single column in single large blocks.

Difficulties when measuring disk I/O

Your file system is sophisticated. It will buffer both reads and writes in RAM as you think you’re writing to disk. In particular, this guards your disk somewhat against the “many small writes” regime of terrible performance. This is great, your file system does a fantastic job (bringing write numbers up from 0.1 MB/s to 20 MB/s or so) but it makes it a bit tricky to benchmark properly. In the experiment above we fsync each file after write to flush write buffers and explicitly clear all buffers before entering the read section.

Anecdotally I also learned that my operating system caps write speeds at 30 MB/s when operating off of battery power. This anecdote demonstrates how particular your hard drive may be when controlled by a file system. It is worth remembering that your hard drive is a physical machine and not just a convenient abstraction.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We analyze JSON data on a cluster using pure Python projects.

Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset of PyData to distributed computing. Over the next few weeks I and others will write about this system. Please note that dask+distributed is developing quickly and so the API is likely to shift around a bit.

Today we start simple with the typical cluster computing problem, parsing JSON records, filtering, and counting events using dask.bag and the new distributed scheduler. We’ll dive into more advanced problems in future posts.

A video version of this blogpost is available here.

GitHub Archive Data on S3

GitHub releases data dumps of their public event stream as gzipped compressed, line-delimited, JSON. This data is too large to fit comfortably into memory, even on a sizable workstation. We could stream it from disk but, due to the compression and JSON encoding this takes a while and so slogs down interactive use. For an interactive experience with data like this we need a distributed cluster.

Setup and Data

We provision nine m3.2xlarge nodes on EC2. These have eight cores and 30GB of RAM each. On this cluster we provision one scheduler and nine workers (see setup docs). (More on launching in later posts.) We have five months of data, from 2015-01-01 to 2015-05-31 on the githubarchive-data bucket in S3. This data is publicly avaialble if you want to play with it on EC2. You can download the full dataset at https://www.githubarchive.org/ .

The first record looks like the following:

{'actor':{'avatar_url':'https://avatars.githubusercontent.com/u/9152315?','gravatar_id':'','id':9152315,'login':'davidjhulse','url':'https://api.github.com/users/davidjhulse'},'created_at':'2015-01-01T00:00:00Z','id':'2489368070','payload':{'before':'86ffa724b4d70fce46e760f8cc080f5ec3d7d85f','commits':[{'author':{'email':'david.hulse@live.com','name':'davidjhulse'},'distinct':True,'message':'Altered BingBot.jar\n\nFixed issue with multiple account support','sha':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],'distinct_size':1,'head':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','push_id':536740396,'ref':'refs/heads/master','size':1},'public':True,'repo':{'id':28635890,'name':'davidjhulse/davesbingrewardsbot','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},'type':'PushEvent'}

So we have a large dataset on S3 and a moderate sized play cluster on EC2, which has access to S3 data at about 100MB/s per node. We’re ready to play.

Play

We start an ipython interpreter on our local laptop and connect to the dask scheduler running on the cluster. For the purposes of timing, the cluster is on the East Coast while the local machine is in California on commercial broadband internet.

>>>fromdistributedimportExecutor,s3>>>e=Executor('54.173.84.107:8786')>>>e<Executor:scheduler=54.173.84.107:8786workers=72threads=72>

Our seventy-two worker processes come from nine workers with eight processes each. We chose processes rather than threads for this task because computations will be bound by the GIL. We will change this to threads in later examples.

We start by loading a single month of data into distributed memory.

importjsontext=s3.read_text('githubarchive-data','2015-01',compression='gzip')records=text.map(json.loads)records=e.persist(records)

The data lives in S3 in hourly files as gzipped encoded, line delimited JSON. The s3.read_text and text.map functions produce dask.bag objects which track our operations in a lazily built task graph. When we ask the executor to persist this collection we ship those tasks off to the scheduler to run on all of the workers in parallel. The persist function gives us back another dask.bag pointing to these remotely running results. This persist function returns immediately, and the computation happens on the cluster in the background asynchronously. We gain control of our interpreter immediately while the cluster hums along.

The cluster takes around 40 seconds to download, decompress, and parse this data. If you watch the video embedded above you’ll see fancy progress-bars.

We ask for a single record. This returns in around 200ms, which is fast enough that it feels instantaneous to a human.

>>>records.take(1)({'actor':{'avatar_url':'https://avatars.githubusercontent.com/u/9152315?','gravatar_id':'','id':9152315,'login':'davidjhulse','url':'https://api.github.com/users/davidjhulse'},'created_at':'2015-01-01T00:00:00Z','id':'2489368070','payload':{'before':'86ffa724b4d70fce46e760f8cc080f5ec3d7d85f','commits':[{'author':{'email':'david.hulse@live.com','name':'davidjhulse'},'distinct':True,'message':'Altered BingBot.jar\n\nFixed issue with multiple account support','sha':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],'distinct_size':1,'head':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','push_id':536740396,'ref':'refs/heads/master','size':1},'public':True,'repo':{'id':28635890,'name':'davidjhulse/davesbingrewardsbot','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},'type':'PushEvent'},)

This particular event is a 'PushEvent'. Let’s quickly see all the kinds of events. For fun, we’ll also time the interaction:

>>>%timerecords.pluck('type').frequencies().compute()CPUtimes:user112ms,sys:0ns,total:112msWalltime:2.41s[('ReleaseEvent',44312),('MemberEvent',69757),('IssuesEvent',693363),('PublicEvent',14614),('CreateEvent',1651300),('PullRequestReviewCommentEvent',214288),('PullRequestEvent',680879),('ForkEvent',491256),('DeleteEvent',256987),('PushEvent',7028566),('IssueCommentEvent',1322509),('GollumEvent',150861),('CommitCommentEvent',96468),('WatchEvent',1321546)]

And we compute the total count of all commits for this month.

>>>%timerecords.count().compute()CPUtimes:user134ms,sys:133µs,total:134msWalltime:1.49s14036706

We see that it takes a few seconds to walk through the data (and perform all scheduling overhead.) The scheduler adds about a millisecond overhead per task, and there are about 1000 partitions/files here (the GitHub data is split by hour and there are 730 hours in a month) so most of the cost here is overhead.

Investigate Jupyter

We investigate the activities of Project Jupyter. We chose this project because it’s sizable and because we understand the players involved and so can check our accuracy. This will require us to filter our data to a much smaller subset, then find popular repositories and members.

>>>jupyter=(records.filter(lambdad:d['repo']['name'].startswith('jupyter/')).repartition(10))>>>jupyter=e.persist(jupyter)

All records, regardless of event type, have a repository which has a name like 'organization/repository' in typical GitHub fashion. We filter all records that start with 'jupyter/'. Additionally, because this dataset is likely much smaller, we push all of these records into just ten partitions. This dramatically reduces scheduling overhead. The persist call hands this computation off to the scheduler and then gives us back our collection that points to that computing result. Filtering this month for Jupyter events takes about 7.5 seconds. Afterwards computations on this subset feel snappy.

>>>%timejupyter.count().compute()CPUtimes:user5.19ms,sys:97µs,total:5.28msWalltime:199ms747>>>%timejupyter.take(1)CPUtimes:user7.01ms,sys:259µs,total:7.27msWalltime:182ms({'actor':{'avatar_url':'https://avatars.githubusercontent.com/u/26679?','gravatar_id':'','id':26679,'login':'marksteve','url':'https://api.github.com/users/marksteve'},'created_at':'2015-01-01T13:25:44Z','id':'2489612400','org':{'avatar_url':'https://avatars.githubusercontent.com/u/7388996?','gravatar_id':'','id':7388996,'login':'jupyter','url':'https://api.github.com/orgs/jupyter'},'payload':{'action':'started'},'public':True,'repo':{'id':5303123,'name':'jupyter/nbviewer','url':'https://api.github.com/repos/jupyter/nbviewer'},'type':'WatchEvent'},)

So the first event of the year was by 'marksteve' who decided to watch the 'nbviewer' repository on new year’s day.

Notice that these computations take around 200ms. I can’t get below this from my local machine, so we’re likely bound by communicating to such a remote location. A 200ms latency is not great if you’re playing a video game, but it’s decent for interactive computing.

Here are all of the Jupyter repositories touched in the month of January,

>>>%timejupyter.pluck('repo').pluck('name').distinct().compute()CPUtimes:user2.84ms,sys:4.03ms,total:6.86msWalltime:204ms['jupyter/dockerspawner','jupyter/design','jupyter/docker-demo-images','jupyter/jupyterhub','jupyter/configurable-http-proxy','jupyter/nbshot','jupyter/sudospawner','jupyter/colaboratory','jupyter/strata-sv-2015-tutorial','jupyter/tmpnb-deploy','jupyter/nature-demo','jupyter/nbcache','jupyter/jupyter.github.io','jupyter/try.jupyter.org','jupyter/jupyter-drive','jupyter/tmpnb','jupyter/tmpnb-redirector','jupyter/nbgrader','jupyter/nbindex','jupyter/nbviewer','jupyter/oauthenticator']

And the top ten most active people on GitHub.

>>>%time(jupyter.pluck('actor').pluck('login').frequencies().topk(10,lambdakv:kv[1]).compute())CPUtimes:user8.03ms,sys:90µs,total:8.12msWalltime:226ms[('rgbkrk',156),('minrk',87),('Carreau',87),('KesterTong',74),('jhamrick',70),('bollwyvl',25),('pkt',18),('ssanderson',13),('smashwilson',13),('ellisonbg',13)]

Nothing too surprising here if you know these folks.

Full Dataset

The full five months of data is too large to fit in memory, even for this cluster. When we represent semi-structured data like this with dynamic data structures like lists and dictionaries there is quite a bit of memory bloat. Some careful attention to efficient semi-structured storage here could save us from having to switch to such a large cluster, but that will have to be the topic of another post.

Instead, we operate efficiently on this dataset by flowing it through memory, persisting only the records we care about. The distributed dask scheduler descends from the single-machine dask scheduler, which was quite good at flowing through a computation and intelligently removing intermediate results.

From a user API perspective, we call persist only on the jupyter dataset, and not the full records dataset.

>>>full=(s3.read_text('githubarchive-data','2015',compression='gzip').map(json.loads)>>>jupyter=(full.filter(lambdad:d['repo']['name'].startswith('jupyter/')).repartition(10))>>>jupyter=e.persist(jupyter)

It takes 2m36s to download, decompress, and parse the five months of publicly available GitHub events for all Jupyter events on nine m3.2xlarges.

There were seven thousand such events.

>>>jupyter.count().compute()7065

We find which repositories saw the most activity during that time:

>>>%time(jupyter.pluck('repo').pluck('name').frequencies().topk(20,lambdakv:kv[1]).compute())CPUtimes:user6.98ms,sys:474µs,total:7.46msWalltime:219ms[('jupyter/jupyterhub',1262),('jupyter/nbgrader',1235),('jupyter/nbviewer',846),('jupyter/jupyter_notebook',507),('jupyter/jupyter-drive',505),('jupyter/notebook',451),('jupyter/docker-demo-images',363),('jupyter/tmpnb',284),('jupyter/jupyter_client',162),('jupyter/dockerspawner',149),('jupyter/colaboratory',134),('jupyter/jupyter_core',127),('jupyter/strata-sv-2015-tutorial',108),('jupyter/jupyter_nbconvert',103),('jupyter/configurable-http-proxy',89),('jupyter/hubpress.io',85),('jupyter/jupyter.github.io',84),('jupyter/tmpnb-deploy',76),('jupyter/nbconvert',66),('jupyter/jupyter_qtconsole',59)]

We see that projects like jupyterhub were quite active during that time while, surprisingly, nbconvert saw relatively little action.

Local Data

The Jupyter data is quite small and easily fits in a single machine. Let’s bring the data to our local machine so that we can compare times:

>>>%timeL=jupyter.compute()CPUtimes:user4.74s,sys:10.9s,total:15.7sWalltime:30.2s

It takes surprisingly long to download the data, but once its here, we can iterate far more quickly with basic Python.

>>>fromtoolz.curriedimportpluck,frequencies,topk,pipe>>>%timepipe(L,pluck('repo'),pluck('name'),frequencies,dict.items,topk(20,key=lambdakv:kv[1]),list)CPUtimes:user11.8ms,sys:0ns,total:11.8msWalltime:11.5ms[('jupyter/jupyterhub',1262),('jupyter/nbgrader',1235),('jupyter/nbviewer',846),('jupyter/jupyter_notebook',507),('jupyter/jupyter-drive',505),('jupyter/notebook',451),('jupyter/docker-demo-images',363),('jupyter/tmpnb',284),('jupyter/jupyter_client',162),('jupyter/dockerspawner',149),('jupyter/colaboratory',134),('jupyter/jupyter_core',127),('jupyter/strata-sv-2015-tutorial',108),('jupyter/jupyter_nbconvert',103),('jupyter/configurable-http-proxy',89),('jupyter/hubpress.io',85),('jupyter/jupyter.github.io',84),('jupyter/tmpnb-deploy',76),('jupyter/nbconvert',66),('jupyter/jupyter_qtconsole',59)]

The difference here is 20x, which is a good reminder that, once you no longer have a large problem you should probably eschew distributed systems and act locally.

Conclusion

Downloading, decompressing, parsing, filtering, and counting JSON records is the new wordcount. It’s the first problem anyone sees. Fortunately it’s both easy to solve and the common case. Woo hoo!

Here we saw that dask+distributed handle the common case decently well and with a Pure Python stack. Typically Python users rely on a JVM technology like Hadoop/Spark/Storm to distribute their computations. Here we have Python distributing Python; there are some usability gains to be had here like nice stack traces, a bit less serialization overhead, and attention to other Pythonic style choices.

Over the next few posts I intend to deviate from this common case. Most “Big Data” technologies were designed to solve typical data munging problems found in web companies or with simple database operations in mind. Python users care about these things too, but they also reach out to a wide variety of fields. In dask+distributed development we care about the common case, but also support less traditional workflows that are commonly found in the life, physical, and algorithmic sciences.

By designing to support these more extreme cases we’ve nailed some common pain points in current distributed systems. Today we’ve seen low latency and remote control; in the future we’ll see others.

What doesn’t work

I’ll have an honest section like this at the end of each upcoming post describing what doesn’t work, what still feels broken, or what I would have done differently with more time.

The imports for dask and distributed are still strange. They’re two separate codebases that play very nicely together. Unfortunately the functionality you need is sometimes in one or in the other and it’s not immediately clear to the novice user where to go. For example dask.bag, the collection we’re using for records, jupyter, etc. is in dask but the s3 module is within the distributed library. We’ll have to merge things at some point in the near-to-moderate future. Ditto for the API: there are compute methods both on the dask collections (records.compute()) and on the distributed executor (e.compute(records)) that behave slightly differently.
We lack an efficient distributed shuffle algorithm. This is very important if you want to use operations like .groupby (which you should avoid anyway). The user API here doesn’t even cleanly warn users that this is missing in the distributed case which is kind of a mess. (It works fine on a single machine.) Efficient alternatives like foldbyare available.
I would have liked to run this experiment directly on the cluster to see how low we could have gone below the 200ms barrier we ran into here.

Links

dask, the original project
dask.distributed, the distributed memory scheduler powering the cluster computing
dask.bag, the user API we’ve used in this post.
This post largely repeats work by Blake Griffith in a similar post last year with an older iteration of the dask distributed scheduler

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these computations with dask.dataframe. A screencast version of this blogpost is available here and the previous post in this series is available here.

To start, we connect to our scheduler, import the hdfs module from the distributed library, and read our CSV data from HDFS.

>>>fromdistributedimportExecutor,hdfs,progress>>>e=Executor('127.0.0.1:8786')>>>e<Executor:scheduler=127.0.0.1:8786workers=64threads=64>>>>nyc2014=hdfs.read_csv('/nyctaxi/2014/*.csv',...parse_dates=['pickup_datetime','dropoff_datetime'],...skipinitialspace=True)>>>nyc2015=hdfs.read_csv('/nyctaxi/2015/*.csv',...parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'])>>>nyc2014,nyc2015=e.persist([nyc2014,nyc2015])>>>progress(nyc2014,nyc2015)

Our data comes from the New York City Taxi and Limousine Commission which publishes all yellow cab taxi rides in NYC for various years. This is a nice model dataset for computational tabular data because it’s large enough to be annoying while also deep enough to be broadly appealing. Each year is about 25GB on disk and about 60GB in memory as a Pandas DataFrame.

HDFS breaks up our CSV files into 128MB chunks on various hard drives spread throughout the cluster. The dask.distributed workers each read the chunks of bytes local to them and call the pandas.read_csv function on these bytes, producing 391 separate Pandas DataFrame objects spread throughout the memory of our eight worker nodes. The returned objects, nyc2014 and nyc2015, are dask.dataframe objects which present a subset of the Pandas API to the user, but farm out all of the work to the many Pandas dataframes they control across the network.

Play with Distributed Data

If we wait for the data to load fully into memory then we can perform pandas-style analysis at interactive speeds.

>>>nyc2015.head()

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2015-01-15 19:05:39	2015-01-15 19:23:42	1	1.59	-73.993896	40.750111	1	N	-73.974785	40.750618	1	12.0	1.0	0.5	3.25	0.3	17.05
1	1	2015-01-10 20:33:38	2015-01-10 20:53:28	1	3.30	-74.001648	40.724243	1	N	-73.994415	40.759109	1	14.5	0.5	0.5	2.00	0.3	17.80
2	1	2015-01-10 20:33:38	2015-01-10 20:43:41	1	1.80	-73.963341	40.802788	1	N	-73.951820	40.824413	2	9.5	0.5	0.5	0.00	0.3	10.80
3	1	2015-01-10 20:33:39	2015-01-10 20:35:31	1	0.50	-74.009087	40.713818	1	N	-74.004326	40.719986	2	3.5	0.5	0.5	0.00	0.3	4.80
4	1	2015-01-10 20:33:39	2015-01-10 20:52:58	1	3.00	-73.971176	40.762428	1	N	-74.004181	40.742653	2	15.0	0.5	0.5	0.00	0.3	16.30

>>>len(nyc2014)165114373>>>len(nyc2015)146112989

Interestingly it appears that the NYC cab industry has contracted a bit in the last year. There are fewer cab rides in 2015 than in 2014.

When we ask for something like the length of the full dask.dataframe we actually ask for the length of all of the hundreds of Pandas dataframes and then sum them up. This process of reaching out to all of the workers completes in around 200-300 ms, which is generally fast enough to feel snappy in an interactive session.

The dask.dataframe API looks just like the Pandas API, except that we call .compute() when we want an actual result.

>>>nyc2014.passenger_count.sum().compute()279997507.0>>>nyc2015.passenger_count.sum().compute()245566747

Dask.dataframes build a plan to get your result and the distributed scheduler coordinates that plan on all of the little Pandas dataframes on the workers that make up our dataset.

Pandas for Metadata

Let’s appreciate for a moment all the work we didn’t have to do around CSV handling because Pandas magically handled it for us.

>>>nyc2015.dtypesVendorIDint64tpep_pickup_datetimedatetime64[ns]tpep_dropoff_datetimedatetime64[ns]passenger_countint64trip_distancefloat64pickup_longitudefloat64pickup_latitudefloat64RateCodeIDint64store_and_fwd_flagobjectdropoff_longitudefloat64dropoff_latitudefloat64payment_typeint64fare_amountfloat64extrafloat64mta_taxfloat64tip_amountfloat64tolls_amountfloat64improvement_surchargefloat64total_amount\rfloat64dtype:object

We didn’t have to find columns or specify data-types. We didn’t have to parse each value with an int or float function as appropriate. We didn’t have to parse the datetimes, but instead just specified a parse_datetimes= keyword. The CSV parsing happened about as quickly as can be expected for this format, clocking in at a network total of a bit under 1 GB/s.

Pandas is well loved because it removes all of these little hurdles from the life of the analyst. If we tried to reinvent a new “Big-Data-Frame” we would have to reimplement all of the work already well done inside of Pandas. Instead, dask.dataframe just coordinates and reuses the code within the Pandas library. It is successful largely due to work from core Pandas developers, notably Masaaki Horikoshi (@sinhrks), who have done tremendous work to align the API precisely with the Pandas core library.

Analyze Tips and Payment Types

In an effort to demonstrate the abilities of dask.dataframe we ask a simple question of our data, “how do New Yorkers tip?”. The 2015 NYCTaxi data is quite good about breaking down the total cost of each ride into the fare amount, tip amount, and various taxes and fees. In particular this lets us measure the percentage that each rider decided to pay in tip.

>>>nyc2015[['fare_amount','tip_amount','payment_type']].head()

	fare_amount	tip_amount	payment_type
0	12.0	3.25	1
1	14.5	2.00	1
2	9.5	0.00	2
3	3.5	0.00	2
4	15.0	0.00	2

In the first two lines we see evidence supporting the 15-20% tip standard common in the US. The following three lines interestingly show zero tip. Judging only by these first five lines (a very small sample) we see a strong correlation here with the payment type. We analyze this a bit more by counting occurrences in the payment_type column both for the full dataset, and filtered by zero tip:

>>>%timenyc2015.payment_type.value_counts().compute()CPUtimes:user132ms,sys:0ns,total:132msWalltime:558ms19157464425386464835030704170599528Name:payment_type,dtype:int64>>>%timenyc2015[nyc2015.tip_amount==0].payment_type.value_counts().compute()CPUtimes:user212ms,sys:4ms,total:216msWalltime:1.69s2538625571336566835020254170234526Name:payment_type,dtype:int64

We find that almost all zero-tip rides correspond to payment type 2, and that almost all payment type 2 rides don’t tip. My un-scientific hypothesis here is payment type 2 corresponds to cash fares and that we’re observing a tendancy of drivers not to record cash tips. However we would need more domain knowledge about our data to actually make this claim with any degree of authority.

Analyze Tips Fractions

Lets make a new column, tip_fraction, and then look at the average of this column grouped by day of week and grouped by hour of day.

First, we need to filter out bad rows, both rows with this odd payment type, and rows with zero fare (there are a surprising number of free cab rides in NYC.) Second we create a new column equal to the ratio of tip_amount / fare_amount.

>>>df=nyc2015[(nyc2015.fare_amount>0)&(nyc2015.payment_type!=2)]>>>df=df.assign(tip_fraction=(df.tip_amount/df.fare_amount))

Next we choose to groupby the pickup datetime column in order to see how the average tip fraction changes by day of week and by hour. The groupby and datetime handling of Pandas makes these operations trivial.

>>>dayofweek=df.groupby(df.tpep_pickup_datetime.dt.dayofweek).tip_fraction.mean()>>>hour=df.groupby(df.tpep_pickup_datetime.dt.hour).tip_fraction.mean()>>>dayofweek,hour=e.persist([dayofweek,hour])>>>progress(dayofweek,hour)

Grouping by day-of-week doesn’t show anything too striking to my eye. However I would like to note at how generous NYC cab riders seem to be. A 23-25% tip can be quite nice:

>>>dayofweek.compute()tpep_pickup_datetime00.23751010.23649420.23607330.24600740.24208150.23241560.259974Name:tip_fraction,dtype:float64

But grouping by hour shows that late night and early morning riders are more likely to tip extravagantly:

>>>hour.compute()tpep_pickup_datetime00.26360210.27882820.29353630.27678440.34864950.24861860.23325770.21600380.22150890.217018100.225618110.231396120.225186130.235662140.237636150.228832160.234086170.240635180.237488190.272792200.235866210.242157220.243244230.244586Name:tip_fraction,dtype:float64In[24]:

We plot this with matplotlib and see a nice trough during business hours with a surge in the early morning with an astonishing peak of 34% at 4am:

Performance

Lets dive into a few operations that run at different time scales. This gives a good understanding of the strengths and limits of the scheduler.

>>>%timenyc2015.head()CPUtimes:user4ms,sys:0ns,total:4msWalltime:20.9ms

This head computation is about as fast as a film projector. You could perform this roundtrip computation between every consecutive frame of a movie; to a human eye this appears fluid. In the last post we asked about how low we could bring latency. In that post we were running computations from my laptop in California and so were bound by transcontinental latencies of 200ms. This time, because we’re operating from the cluster, we can get down to 20ms. We’re only able to be this fast because we touch only a single data element, the first partition. Things change when we need to touch the entire dataset.

>>>%timelen(nyc2015)CPUtimes:user48ms,sys:0ns,total:48msWalltime:271ms

The length computation takes 200-300 ms. This computation takes longer because we touch every individual partition of the data, of which there are 178. The scheduler incurs about 1ms of overhead per task, add a bit of latency and you get the ~200ms total. This means that the scheduler will likely be the bottleneck whenever computations are very fast, such as is the case for computing len. Really, this is good news; it means that by improving the scheduler we can reduce these durations even further.

If you look at the groupby computations above you can add the numbers in the progress bars to show that we computed around 3000 tasks in around 7s. It looks like this computation is about half scheduler overhead and about half bound by actual computation.

Conclusion

We used dask+distributed on a cluster to read CSV data from HDFS into a dask dataframe. We then used dask.dataframe, which looks identical to the Pandas dataframe, to manipulate our distributed dataset intuitively and efficiently.

We looked a bit at the performance characteristics of simple computations.

What doesn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

Dask dataframe implements a commonly used subset of Pandas functionality, not all of it. It’s surprisingly hard to communicate the exact bounds of this subset to users. Notably, in the distributed setting we don’t have a shuffle algorithm, so groupby(...).apply(...) and some joins are not yet possible.
If you want to use threads, you’ll need Pandas 0.18.0 which, at the time of this writing, was still in release candidate stage. This Pandas release fixes some important GIL related issues.
The 1ms overhead per task limit is significant. While we can still scale out to clusters far larger than what we have here, we probably won’t be able to strongly accelerate very quick operations until we reduce this number.
We use the hdfs3 library to read data from HDFS. This library seems to work great but is new and could use more active users to flush out bug reports.

Setup and Data

You can obtain public data from the New York City Taxi and Limousine Commission here. I downloaded this onto the head node and dumped it into HDFS with commands like the following:

$ wget https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-{01..12}.csv
$ hdfs dfs -mkdir /nyctaxi
$ hdfs dfs -mkdir /nyctaxi/2015
$ hdfs dfs -put yellow*.csv /nyctaxi/2015/

The cluster was hosted on EC2 and was comprised of nine m3.2xlarges with 8 cores and 30GB of RAM each. Eight of these nodes were used as workers; they used processes for parallelism, not threads.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus on the following:

How to set up the distributed scheduler with a job scheduler like Sun GridEngine.
How to load NetCDF data from a network file system (NFS) into distributed RAM
How to manipulate data with dask.arrays
How to interact with distributed data using IPython widgets

This blogpost has an accompanying screencast which might be a bit more fun than this text version.

This is the third in a sequence of blogposts about dask.distributed:

Setup

We wanted to emulate the typical academic cluster setup using a job scheduler like SunGridEngine (similar to SLURM, Torque, PBS scripts and other technologies), a shared network file system, and typical binary stored arrays in NetCDF files (similar to HDF5).

To this end we used Starcluster, a quick way to set up such a cluster on EC2 with SGE and NFS, and we downloaded data from the European Centre for Meteorology and Weather Forecasting

To deploy dask’s distributed scheduler with SGE we made a scheduler on the master node:

sgeadmin@master:~$ dscheduler
distributed.scheduler - INFO - Start Scheduler at:  172.31.7.88:8786

And then used the qsub command to start four dask workers, pointing to the scheduler address:

sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 1 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 2 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 3 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 4 ("dworker") has been submitted

After a few seconds these workers start on various nodes in the cluster and connect to the scheduler.

Load sample data on a single machine

On the shared NFS drive we’ve downloaded several NetCDF3 files, each holding the global temperature every six hours for a single day:

>>>fromglobimportglob>>>filenames=sorted(glob('*.nc3'))>>>filenames[:5]['2014-01-01.nc3','2014-01-02.nc3','2014-01-03.nc3','2014-01-04.nc3','2014-01-05.nc3']

We use conda to install the netCDF4 library and make a small function to read the t2m variable for “temperature at two meters elevation” from a single filename:

$ conda install netcdf4

importnetCDF4defload_temperature(fn):withnetCDF4.Dataset(fn)asf:returnf.variables['t2m'][:]

This converts a single file into a single numpy array in memory. We could call this on an individual file locally as follows:

>>>load_temperature(filenames[0])array([[[253.96238624,253.96238624,253.96238624,...,253.96238624,253.96238624,253.96238624],[252.80590921,252.81070124,252.81389593,...,252.79792249,252.80111718,252.80271452],...>>>load_temperature(filenames[0]).shape(4,721,1440)

Our dataset has dimensions of (time, latitude, longitude). Note above that each day has four time entries (measurements every six hours).

The NFS set up by Starcluster is unfortunately quite small. We were only able to fit around five months of data (136 days) in shared disk.

Load data across cluster

We want to call the load_temperature function on our list filenames on each of our four workers. We connect a dask Executor to our scheduler address and then map our function on our filenames:

>>>fromdistributedimportExecutor,progress>>>e=Executor('172.31.7.88:8786')>>>e<Executor:scheduler=172.31.7.88:8786workers=4threads=32>>>>futures=e.map(load_temperature,filenames)>>>progress(futures)

After this completes we have several numpy arrays scattered about the memory of each of our four workers.

Coordinate with dask.array

We coordinate these many numpy arrays into a single logical dask array as follows:

>>>fromdistributed.collectionsimportfutures_to_dask_arrays>>>xs=futures_to_dask_arrays(futures)# many small dask arrays>>>importdask.arrayasda>>>x=da.concatenate(xs,axis=0)# one large dask array, joined by time>>>xdask.array<concate...,shape=(544,721,1440),dtype=float64,chunksize=(4,721,1440)>

This single logical dask array is comprised of 136 numpy arrays spread across our cluster. Operations on the single dask array will trigger many operations on each of our numpy arrays.

Interact with Distributed Data

We can now interact with our dataset using standard NumPy syntax and other PyData libraries. Below we pull out a single time slice and render it to the screen with matplotlib.

frommatplotlibimportpyplotaspltplt.imshow(x[100,:,:].compute(),cmap='viridis')plt.colorbar()

In the screencast version of this post we hook this up to an IPython slider widget and scroll around time, which is fun.

Speed

We benchmark a few representative operations to look at the strengths and weaknesses of the distributed system.

Single element

This single element computation accesses a single number from a single NumPy array of our dataset. It is bound by a network roundtrip from client to scheduler, to worker, and back.

>>>%timex[0,0,0].compute()CPUtimes:user4ms,sys:0ns,total:4msWalltime:9.72ms

Single time slice

This time slice computation pulls around 8 MB from a single NumPy array on a single worker. It is likely bound by network bandwidth.

>>>%timex[0].compute()CPUtimes:user24ms,sys:24ms,total:48msWalltime:274ms

Mean computation

This mean computation touches every number in every NumPy array across all of our workers. Computing means is quite fast, so this is likely bound by scheduler overhead.

>>>%timex.mean().compute()CPUtimes:user88ms,sys:0ns,total:88msWalltime:422ms

Interactive Widgets

To make these times feel more visceral we hook up these computations to IPython Widgets.

This first example looks fairly fluid. This only touches a single worker and returns a small result. It is cheap because it indexes in a way that is well aligned with how our NumPy arrays are split up by time.

@interact(time=[0,x.shape[0]-1])deff(time):returnx[time,:,:].mean().compute()

This second example is less fluid because we index across our NumPy chunks. Each computation touches all of our data. It’s still not bad though and quite acceptable by today’s standards of interactive distributed data science.

@interact(lat=[0,x.shape[1]-1])deff(lat):returnx[:,lat,:].mean().compute()

Normalize Data

Until now we’ve only performed simple calculations on our data, usually grabbing out means. The image of the temperature above looks unsurprising. The image is dominated by the facts that land is warmer than oceans and that the equator is warmer than the poles. No surprises there.

To make things more interesting we subtract off the mean and divide by the standard deviation over time. This will tell us how unexpectedly hot or cold a particular point was, relative to all measurements of that point over time. This gives us something like a geo-located Z-Score.

z=(x-x.mean(axis=0))/x.std(axis=0)z=e.persist(z)progress(z)

plt.imshow(z[slice].compute(),cmap='RdBu_r')plt.colorbar()

We can now see much more fine structure of the currents of the day. In the screencast version we hook this dataset up to a slider as well and inspect various times.

I’ve avoided displaying GIFs of full images changing in this post to keep the size down, however we can easily render a plot of average temperature by latitude changing over time here:

importnumpyasnpxrange=90-np.arange(z.shape[1])/4@interact(time=[0,z.shape[0]-1])deff(time):plt.figure(figsize=(10,4))plt.plot(xrange,z[time].mean(axis=1).compute())plt.ylabel("Normalized Temperature")plt.xlabel("Latitude (degrees)")

Conclusion

We showed how to use distributed dask.arrays on a typical academic cluster. I’ve had several conversations with different groups about this topic; it seems to be a common case. I hope that the instructions at the beginning of this post prove to be helpful to others.

It is really satisfying to me to couple interactive widgets with data on a cluster in an intuitive way. This sort of fluid interaction on larger datasets is a core problem in modern data science.

What didn’t work

As always I’ll include a section like this on what didn’t work well or what I would have done with more time:

No high-level read_netcdf function: We had to use the mid-level API of executor.map to construct our dask array. This is a bit of a pain for novice users. We should probably adapt existing high-level functions in dask.array to robustly handle the distributed data case.
Need a larger problem: Our dataset could have fit into a Macbook Pro. A larger dataset that could not have been efficiently investigated from a single machine would have really cemented the need for this technology.
Easier deployment: The solution above with qsub was straightforward but not always accessible to novice users. Additionally while SGE is common there are several other systems that are just as common. We need to think of nice ways to automate this for the user.
XArray integration: Many people use dask.array on single machines through XArray, an excellent library for the analysis of labeled nd-arrays especially common in climate science. It would be good to integrate this new distributed work into the XArray project. I suspect that doing this mostly involves handling the data ingest problem described above.
Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

Links

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Very high performance isn’t about doing one thing well, it’s about doing nothing poorly.

Identify a Problem

Problem 1: Tornado’s sentinels

Dask’s networking is built off of Tornado’s TCP IOStreams.

Problem 2: Memory Copies

socket.send(header + payload)

socket.send(header)
socket.send(payload)

But who cares, right? At 5 GB/s copying memory is cheap!

Unfortunately this breaks down under either of the following conditions

You are sloppy enough to do this multiple times
You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines.

Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort.

Problem 3: Unwanted Compression

Now we compress with the following policy:

If the message is less than 10kB, don’t bother
Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload.
Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing.

In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply.

Problem 4: Cloudpickle is not as fast as Pickle

This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays.

In[1]:importnumpyasnpIn[2]:data=np.random.randint(0,255,dtype='u1',size=10000000)In[3]:importpickle,cloudpickleIn[4]:%timelen(pickle.dumps(data,protocol=-1))CPUtimes:user8.65ms,sys:8.42ms,total:17.1msWalltime:16.9msOut[4]:10000161In[5]:%timelen(cloudpickle.dumps(data,protocol=-1))CPUtimes:user20.6ms,sys:24.5ms,total:45.1msWalltime:44.4msOut[5]:10000161

Problem 5: Pickle is still slower than you’d expect

See issue 7544

Problem 6: MsgPack is bad at large bytestrings

Problem 7: Tornado makes a copy

Results

Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)

         381360 function calls (381323 primitive calls) in 1.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.366    0.366    0.366    0.366 {built-in method dumps}
        8    0.289    0.036    0.291    0.036 iostream.py:360(write)
    15353    0.228    0.000    0.228    0.000 {method 'join' of 'bytes' objects}
    15355    0.166    0.000    0.166    0.000 {method 'recv' of '_socket.socket' objects}
    15362    0.156    0.000    0.398    0.000 iostream.py:1510(_merge_prefix)
     7759    0.101    0.000    0.101    0.000 {method 'send' of '_socket.socket' objects}
    17/14    0.026    0.002    0.686    0.049 gen.py:990(run)
    15355    0.021    0.000    0.198    0.000 iostream.py:721(_read_to_buffer)
        8    0.018    0.002    0.203    0.025 iostream.py:876(_consume)
       91    0.017    0.000    0.335    0.004 iostream.py:827(_handle_write)
       89    0.015    0.000    0.217    0.002 iostream.py:585(_read_to_buffer_loop)
   122567    0.009    0.000    0.009    0.000 {built-in method len}
    15355    0.008    0.000    0.173    0.000 iostream.py:1010(read_from_fd)
    38369    0.004    0.000    0.004    0.000 {method 'append' of 'list' objects}
     7759    0.004    0.000    0.104    0.000 iostream.py:1023(write_to_fd)
        1    0.003    0.003    1.451    1.451 ioloop.py:746(start)

Dominant unwanted costs include the following:

400ms: Pickling the NumPy array
400ms: Bytestring handling within Tornado

After this we’re just bound by pushing bytes down a wire.

Conclusion

Links

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

TL;DR.

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

Application: Naive Distributed Random Forest Algorithm

Pull data from some external source (S3) into several dataframes on the cluster
For each dataframe, create and train one RandomForestClassifier
Scatter single testing dataframe to all machines
For each RandomForestClassifier predict output on test dataframe
Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures=executor.scatter(data)# scatter datafuture=executor.submit(function,*args,**kwargs)# submit single taskfutures=executor.map(function,sequence)# submit many tasksresults=executor.gather(futures)# gather resultsexecutor.replicate(futures,n=number_of_replications)

Load Pandas from S3

fromdistributedimportExecutor,s3e=Executor('52.91.1.177:8786')dfs=s3.read_csv('dask-data/nyc-taxi/2015',parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],collection=False)dfs=e.compute(dfs)

Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>>dfs[:5][<Future:status:finished,type:DataFrame,key:finalize-a06c3dd25769f434978fa27d5a4cf24b>,<Future:status:finished,type:DataFrame,key:finalize-7dcb27364a8701f45cb02d2fe034728a>,<Future:status:finished,type:DataFrame,key:finalize-b0dfe075000bd59c3a90bfdf89a990da>,<Future:status:finished,type:DataFrame,key:finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,<Future:status:finished,type:DataFrame,key:finalize-c8254256b09ae287badca3cf6d9e3142>]

>>>df=dfs[0].result()>>>df.head()

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2015-01-15 19:05:39	2015-01-15 19:23:42	1	1.59	-73.993896	40.750111	1	N	-73.974785	40.750618	1	12.0	1.0	0.5	3.25	0.3	17.05
1	1	2015-01-10 20:33:38	2015-01-10 20:53:28	1	3.30	-74.001648	40.724243	1	N	-73.994415	40.759109	1	14.5	0.5	0.5	2.00	0.3	17.80
2	1	2015-01-10 20:33:38	2015-01-10 20:43:41	1	1.80	-73.963341	40.802788	1	N	-73.951820	40.824413	2	9.5	0.5	0.5	0.00	0.3	10.80
3	1	2015-01-10 20:33:39	2015-01-10 20:35:31	1	0.50	-74.009087	40.713818	1	N	-74.004326	40.719986	2	3.5	0.5	0.5	0.00	0.3	4.80
4	1	2015-01-10 20:33:39	2015-01-10 20:52:58	1	3.00	-73.971176	40.762428	1	N	-74.004181	40.742653	2	15.0	0.5	0.5	0.00	0.3	16.30

Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

fromsklearn.ensembleimportRandomForestClassifierfromsklearn.cross_validationimporttrain_test_splitdf_train,df_test=train_test_split(df)columns=['trip_distance','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','payment_type','fare_amount','mta_tax','tip_amount','tolls_amount']est=RandomForestClassifier(n_estimators=4)est.fit(df_train[columns],df_train.passenger_count)

>>>est.score(df_test[columns],df_test.passenger_count)0.65808188654721012

This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>>fromsklearn.metricsimportaccuracy_score>>>importnumpyasnp>>>accuracy_score(df_test.passenger_count,...np.ones_like(df_test.passenger_count))0.70669390028780987

Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

deffit(df):est=RandomForestClassifier(n_estimators=4)est.fit(df[columns],df.passenger_count)returnest

train=dfs[:-1]test=dfs[-1]estimators=e.map(fit,train)

Predict on testing data

e.replicate([test],n=48)defpredict(est,X):returnest.predict(X[columns])predictions=[e.submit(predict,est,test)forestinestimators]

Aggregate predictions by majority vote

For example, consider the first four prediction arrays:

>>>a_few_predictions=e.gather(predictions[:4])# remote futures -> local arrays>>>a_few_predictions[array([1,2,1,...,2,2,1]),array([1,1,1,...,1,1,1]),array([2,1,1,...,1,1,1]),array([1,1,1,...,1,1,1])]

fromscipy.statsimportmodeimportnumpyasnpdefmymode(*arrays):array=np.stack(arrays,axis=0)returnmode(array)[0][0]>>>mymode(*a_few_predictions)array([1,1,1,...,1,1,1])

And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

Tree Reduction

We could call our mymode function on all of our predictions like this:

>>>mode_prediction=e.submit(mymode,*predictions)# this doesn't scale well

Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

fromtoolzimportpartition_allwhilelen(predictions)>1:predictions=[e.submit(mymode,*chunk)forchunkinpartition_all(10,predictions)]result=e.gather(predictions)[0]>>>resultarray([1,1,1,...,1,1,1])

Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>>accuracy_score(result,test.result().passenger_count)0.67061974451423045

What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

Conclusion

Links

This work is supported by Continuum Analytics

Introduction

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
How can we grow and shrink the cluster dynamically based on use?
How do users manage software environments on the workers?
How secure is the distributed scheduler?
Dask is resilient to worker failure, how about scheduler failure?
What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
How do we handle multiple concurrent users and priorities?
How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

fromdask.distributedimportExecutor,SGEClusterc=SGECluster(nworkers=200,**options)e=Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

“I’m under duress, please add workers”
“I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

Workers can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
Have a hot failover node that gets a copy of every action that the scheduler takes
Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

Multiple data centers geographically distributed around the country
Multiple racks within a single data center
Multiple workers that have GPUs that can move data between each other easily
Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

What are the social expectations of open source developers to help users understand their projects? What are the social expectations of users when asking for help?

As part of developing Dask, an open source library with growing adoption, I directly interact with users over GitHub issues for bug reports, StackOverflow for usage questions, a mailing list and live Gitter chat for community conversation. Dask is blessed with awesome users. These are researchers doing very cool work of high impact and with novel use cases. They report bugs and usage questions with such skill that it’s clear that they are Veteran Users of open source projects.

Veteran Users are Heroes

It’s not easy being a veteran user. It takes a lot of time to distill a bug down to a reproducible example, or a question into an MCVE, or to read all of the documentation to make sure that a conceptual question definitely isn’t answered in the docs. And yet this effort really shines through and it’s incredibly valuable to making open source software better. These distilled reports are arguably more important than fixing the actual bug or writing the actual documentation.

Bugs occur in the wild, in code that is half related to the developer’s library (like Pandas or Dask) and half related to the user’s application. The veteran user works hard to pull away all of their code and data, creating a gem of an example that is trivial to understand and run anywhere that still shows off the problem.

This way the veteran user can show up with their problem to the development team and say “here is something that you will quickly understand to be a problem.” On the developer side this is incredibly valuable. They learn of a relevant bug and immediately understand what’s going on, without having to download someone else’s data or understand their domain. This switches from merely convenient to strictly necessary when the developers deal with 10+ such reports a day.

Novice Users need help too

However there are a lot of novice users out there. We have all been novice users once, and even if we are veterans today we are probably still novices at something else. Knowing what to do and how to ask for help is hard. Having the guts to walk into a chat room where people will quickly see that you’re a novice is even harder. It’s like using public transit in a deeply foreign language. Respect is warranted here.

I categorize novice users into two groups:

Experienced technical novices, who are very experienced in their field and technical things generally, but who don’t yet have a thorough understanding of open source culture and how to ask questions smoothly. They’re entirely capable of behaving like a veteran user if pointed in the right directions.
Novice technical novices, who don’t yet have the ability to distill their problems into the digestible nuggets that open source developers expect.

In the first case of technically experienced novices, I’ve found that being direct works surprisingly well. I used to be apologetic in asking people to submit MCVEs. Today I’m more blunt but surprisingly I find that this group doesn’t seem to mind. I suspect that this group is accustomed to operating in situations where other people’s time is very costly.

The second case of novice novice users are more challenging for individual developers to handle one-by-one, both because novices are more common, and because solving their problems often requires more time commitment. Instead open source communities often depend on broadcast and crowd-sourced solutions, like documentation, StackOverflow, or meetups and user groups. For example in Dask we strongly point people towards StackOverflow in order to build up a knowledge-base of question-answer pairs. Pandas has done this well; almost every Pandas question you Google leads to a StackOverflow post, handling 90% of the traffic and improving the lives of thousands. Many projects simply don’t have the human capital to hand-hold individuals through using the library.

In a few projects there are enough generous and experienced users that they’re able to field questions from individual users. SymPy is a good example here. I learned open source programming within SymPy. Their community was broad enough that they were able to hold my hand as I learned Git, testing, communication practices and all of the other soft skills that we need to be effective in writing great software. The support structure of SymPy is something that I’ve never experienced anywhere else.

My Apologies

I’ve found myself becoming increasingly impolite when people ask me for certain kinds of extended help with their code. I’ve been trying to track down why this is and I think that it comes from a mismatch of social contracts.

Large parts of technical society have an (entirely reasonable) belief that open source developers are available to answer questions about how we use their project. This was probably true in popular culture, where our stereotypical image of an open source developer was working out of their basement long into the night on things that relatively few enthusiasts bothered with. They were happy to engage and had the free time in which to do it.

In some ways things have changed a lot. We now have paid professionals building software that is used by thousands or millions of users. These professionals easily charge consulting fees of hundreds of dollars per hour for exactly the kind of assistance that people show up expecting for free under the previous model. These developers have to answer for how they spend their time when they’re at work, and when they’re not at work they now have families and kids that deserve just as much attention as their open source users.

Both of these cultures, the creative do-it-yourself basement culture and the more corporate culture, are important to the wonderful surge we’ve seen in open source software. How do we balance them? Should developers, like doctors or lawyers perform pro-bono work as part of their profession? Should grants specifically include paid time for community engagement and outreach? Should users, as part of receiving help feel an obligation to improve documentation or stick around and help others?

Solutions?

I’m not sure what to do here. I feel an obligation to remain connected with users from a broad set of applications, even those that companies or grants haven’t decided to fund. However at the same time I don’t know how to say “I’m sorry, I simply don’t have the time to help you with your problem.” in a way that feels at all compassionate.

I think that people should still ask questions. I think that we need to foster an environment in which developers can say “Sorry. Busy.” more easily. I think that we as a community need better resources to teach novice users to become veteran users.

One positive approach is to honor veteran users, and through this public praise to encourage other users to “up their game”, much as developers do today with coding skills. There are thousands of blogposts about how to develop code well, and people strive tirelessly to improve themselves. My hope is that by attaching the language of skill, like the term “veteran”, to user behaviors we can create an environment where people are proud of how cleanly they can raise issues and how clearly they can describe questions for documentation. Doing this well is critical for a project’s success and requires substantial effort and personal investment.

I’m pleased to announce a release of Dask’s distributed scheduler, dask.distributed, version 1.13.0.

conda install dask distributed -c conda-forge
or
pip install dask distributed --upgrade

The last few months have seen a number of important user-facing features:

Executor is renamed to Client
Workers can spill excess data to disk when they run out of memory
The Client.compute and Client.persist methods for dealing with dask collections (like dask.dataframe or dask.delayed) gain the ability to restrict sub-components of the computation to different parts of the cluster with a workers= keyword argument.
IPython kernels can be deployed on the worker and schedulers for interactive debugging.
The Bokeh web interface has gained new plots and improve the visual styling of old ones.

Additionally there are beta features in current development. These features are available now, but may change without warning in future versions. Experimentation and feedback by users comfortable with living on the bleeding edge is most welcome:

Clients can publish named datasets on the scheduler to share between them
Tasks can launch other tasks
Workers can restart themselves in new software environments provided by the user

There have also been significant internal changes. Other than increased performance these changes should not be directly apparent.

The scheduler was refactored to a more state-machine like architecture. Doc page
Short-lived connections are now managed by a connection pool
Work stealing has changed and grown more responsive: Doc page
General resilience improvements

The rest of this post will contain very brief explanations of the topics above. Some of these topics may become blogposts of their own at some point. Until then I encourage people to look at the distributed scheduler’s documentation which is separate from dask’s normal documentation and so may contain new information for some readers (Google Analytics reports about 5-10x the readership on http://dask.readthedocs.org than on http://distributed.readthedocs.org.

Major Changes and Features

Rename Executor to Client

http://distributed.readthedocs.io/en/latest/api.html

The term Executor was originally chosen to coincide with the concurrent.futures Executor interface, which is what defines the behavior for the .submit, .map, .result methods and Future object used as the primary interface.

Unfortunately, this is the same term used by projects like Spark and Mesos for “the low-level thing that executes tasks on each of the workers” causing significant confusion when communicating with other communities or for transitioning users.

In response we rename Executor to a somewhat more generic term, Client to designate its role as the thing users interact with to control their computations.

>>>fromdistributedimportExecutor# Old>>>e=Executor()# Old>>>fromdistributedimportClient# New>>>c=Client()# New

Executor remains an alias for Client and will continue to be valid for some time, but there may be some backwards incompatible changes for internal use of executor= keywords within methods. Newer examples and materials will all use the term Client.

Workers Spill Excess Data to Disk

http://distributed.readthedocs.io/en/latest/worker.html#spill-excess-data-to-disk

When workers get close to running out of memory they can send excess data to disk. This is not on by default and instead requires adding the --memory-limit=auto option to dask-worker.

dask-worker scheduler:8786                      # Old
dask-worker scheduler:8786 --memory-limit=auto  # New

This will eventually become the default (and is now when using LocalCluster) but we’d like to see how things progress and phase it in slowly.

Generally this feature should improve robustness and allow the solution of larger problems on smaller clusters, although with a performance cost. Dask’s policies to reduce memory use through clever scheduling remain in place, so in the common case you should never need this feature, but it’s nice to have as a failsafe.

Enable restriction of valid workers for compute and persist methods

http://distributed.readthedocs.io/en/latest/locality.html#user-control

Expert users of the distributed scheduler will be aware of the ability to restrict certain tasks to run only on certain computers. This tends to be useful when dealing with GPUs or with special databases or instruments only available on some machines.

Previously this option was available only on the submit, map, and scatter methods, forcing people to use the more immedate interface. Now the dask collection interface functions compute and persist support this keyword as well.

IPython Integration

http://distributed.readthedocs.io/en/latest/ipython.html

You can start IPython kernels on the workers or scheduler and then access them directly using either IPython magics or the QTConsole. This tends to be valuable when things go wrong and you want to interactively debug on the worker nodes themselves.

Start IPython on the Scheduler

>>>client.start_ipython_scheduler()# Start IPython kernel on the scheduler>>>%schedulerscheduler.processing# Use IPython magics to inspect scheduler{'127.0.0.1:3595':['inc-1','inc-2'],'127.0.0.1:53589':['inc-2','add-5']}

Start IPython on the Workers

>>>info=e.start_ipython_workers()# Start IPython kernels on all workers>>>list(info)['127.0.0.1:4595','127.0.0.1:53589']>>>%remoteinfo['127.0.0.1:3595']worker.active# Use IPython magics{'inc-1','inc-2'}

Bokeh Interface

http://distributed.readthedocs.io/en/latest/web.html

The Bokeh web interface to the cluster continues to evolve both by improving existing plots and by adding new plots and new pages.

dask progress bar

For example the progress bars have become more compact and shrink down dynamically to respond to addiional bars.

And we’ve added in extra tables and plots to monitor workers, such as their memory use and current backlog of tasks.

Experimental Features

The features described below are experimental and may change without warning. Please do not depend on them in stable code.

Publish Datasets

http://distributed.readthedocs.io/en/latest/publish.html

You can now save collections on the scheduler, allowing you to come back to the same computations later or allow collaborators to see and work off of your results. This can be useful in the following cases:

There is a dataset from which you frequently base all computations, and you want that dataset always in memory and easy to access without having to recompute it each time you start work, even if you disconnect.
You want to send results to a colleague working on the same Dask cluster and have them get immediate access to your computations without having to send them a script and without them having to repeat the work on the cluster.

Example: Client One

fromdask.distributedimportClientclient=Client('scheduler-address:8786')importdask.dataframeasdddf=dd.read_csv('s3://my-bucket/*.csv')df2=df[df.balance<0]df2=client.persist(df2)>>>df2.head()namebalance0Alice-1001Bob-2002Charlie-3003Dennis-4004Edith-500client.publish_dataset(accounts=df2)

Example: Client Two

>>>fromdask.distributedimportClient>>>client=Client('scheduler-address:8786')>>>client.list_datasets()['accounts']>>>df=client.get_dataset('accounts')>>>df.head()namebalance0Alice-1001Bob-2002Charlie-3003Dennis-4004Edith-500

Launch Tasks from tasks

http://distributed.readthedocs.io/en/latest/task-launch.html

You can now submit tasks to the cluster that themselves submit more tasks. This allows the submission of highly dynamic workloads that can shape themselves depending on future computed values without ever checking back in with the original client.

This is accomplished by starting new local Clients within the task that can interact with the scheduler.

deffunc():fromdistributedimportlocal_clientwithlocal_client()asc2:future=c2.submit(...)c=Client(...)future=c.submit(func)

There are a few straightforward use cases for this, like iterative algorithms with stoping criteria, but also many novel use cases including streaming and monitoring systems.

Restart Workers in Redeployable Python Environments

You can now zip up and distribute full Conda environments, and ask dask-workers to restart themselves, live, in that environment. This involves the following:

Create a conda environment locally (or any redeployable directory including a python executable)
Zip up that environment and use the existing dask.distributed network to copy it to all of the workers
Shut down all of the workers and restart them within the new environment

This helps users to experiment with different software environments with a much faster turnaround time (typically tens of seconds) than asking IT to install libraries or building and deploying Docker containers (which is also a fine solution). Note that they typical solution of uploading individual python scripts or egg files has been around for a while, see API docs for upload_file

Acknowledgements

Since version 1.12.0 on August 18th the following people have contributed commits to the dask/distributed repository

Dave Hirschfeld
dsidi
Jim Crist
Joseph Crail
Loïc Estève
Martin Durant
Matthew Rocklin
Min RK
Scott Sievert

Code is only as good as its prose.

Like many programmers I spend more time writing prose than code. This is great; writing clean prose focuses my thoughts during design and disseminates understanding so that people see how a project can benefit them.

However, I now question how and where I should write and publish prose. When communicating to users there are generally two options:

Blogposts
Documentation

Given that developer time is finite we need to strike some balance between these two activities. I used to blog frequently, then I switched to almost only documentation, and I think I’m probably about to swing back a bit. Here’s why:

Blogposts

Blogposts excel at generating interest, informing people of new functionality, and providing workable examples that people can copy and modify. I used to blog about Dask (my current software project) pretty regularly here on my blog and continuously got positive feedback from it. This felt great.

However, blogging about evolving software also generates debt. Such blogs grow stale and inaccurate and so when they’re the only source of information about a project, users grow confused when they try things that no longer work, and they’re stuck without a clear reference to turn. Basing core understanding on blogs can be a frustrating experience.

Documentation

So I switched from writing blogposts to spending a lot of time writing technical documentation. This was a positive move. User comprehension seemed to increase, the questions I was fielding were of a far higher level than before.

Documentation gets updated as features mature. New pages assimilate cleanly and obsolete pages get cleaned up. Documentation is generally more densely linked than linear blogs, and readers tend to explore more deeply within the website. Comparing the Google Analytics results for my blog and my documentation show significantly increased engagement, both with longer page views as well as longer chains of navigation throughout the site. Documentation seems to engage readers more strongly than do blogs (at least more strongly than my blog).

However, documentation doesn’t get in front of people the same way that Blogs do. No one subscribes to receive documentation updates. Doc pages for new features rarely end up on Reddit or Hacker News. The way people pass around blog links encourages Google to point people there way more often than to doc pages. There is no way for interested users to keep up with the latest news except by subscribing to fairly dry release e-mails.

Blogposts are way sexier. This feels a little shallow if you’re not into sales and marketing, but lets remember that software dies without users and that users are busy people who have to be stimulated into taking the time to learn new things.

Current Plan

I still think its wise for core developers to focus 80% of their prose time on documentation, especially for new or in-flux features that haven’t had a decent amount of time for users to provide feedback.

However I personally hope to blog more about concepts or timely experiences that have to do with development, if not the features themeselves. For example, right now I’m building a Mesos-powered Scheduler for Dask.distributed. I’ll probably write about the experiences of a developer meeting Mesos for the first time, but I probably won’t include a how-to of using Dask with Mesos.

I also hope to find some way to polish existing doc pages into blogposts once they have proven to be fairly stable. This mostly involves finding a meaningful and reproducible example to work through.

Feedback

I would love to hear how other projects handle this tension between timely and timeless documentation.

This post compares two Python distributed task processing systems, Dask.distributed and Celery.

Disclaimer: technical comparisons are hard to do well. I am biased towards Dask and ignorant of correct Celery practices. Please keep this in mind. Critical feedback by Celery experts is welcome.

Celery is a distributed task queue built in Python and heavily used by the Python community for task-based workloads.

Dask is a parallel computing library popular within the PyData community that has grown a fairly sophisticated distributed task scheduler. This post explores if Dask.distributed can be useful for Celery-style problems.

Comparing technical projects is hard both because authors have bias, and also because the scope of each project can be quite large. This allows authors to gravitate towards the features that show off our strengths. Fortunately a Celery user asked how Dask compares on Github and they listed a few concrete features:

Handling multiple queues
Canvas (celery’s workflow)
Rate limiting
Retrying

These provide an opportunity to explore the Dask/Celery comparision from the bias of a Celery user rather than from the bias of a Dask developer.

In this post I’ll point out a couple of large differences, then go through the Celery hello world in both projects, and then address how these requested features are implemented or not within Dask. This anecdotal comparison over a few features should give us a general comparison.

Biggest difference: Worker state and communication

First, the biggest difference (from my perspective) is that Dask workers hold onto intermediate results and communicate data between each other while in Celery all results flow back to a central authority. This difference was critical when building out large parallel arrays and dataframes (Dask’s original purpose) where we needed to engage our worker processes’ memory and inter-worker communication bandwidths. Computational systems like Dask do this, more data-engineering systems like Celery/Airflow/Luigi don’t. This is the main reason why Dask wasn’t built on top of Celery/Airflow/Luigi originally.

That’s not a knock against Celery/Airflow/Luigi by any means. Typically they’re used in settings where this doesn’t matter and they’ve focused their energies on several features that Dask similarly doesn’t care about or do well. Tasks usually read data from some globally accessible store like a database or S3 and either return very small results, or place larger results back in the global store.

The question on my mind is now is Can Dask be a useful solution in more traditional loose task scheduling problems where projects like Celery are typically used? What are the benefits and drawbacks?

Hello World

To start we do the First steps with Celery walk-through both in Celery and Dask and compare the two:

Celery

I follow the Celery quickstart, using Redis instead of RabbitMQ because it’s what I happen to have handy.

# tasks.pyfromceleryimportCeleryapp=Celery('tasks',broker='redis://localhost',backend='redis')@app.taskdefadd(x,y):returnx+y

$ redis-server
$ celery -A tasks worker --loglevel=info

In[1]:fromtasksimportaddIn[2]:%timeadd.delay(1,1).get()# submit and retrieve roundtripCPUtimes:user60ms,sys:8ms,total:68msWalltime:567msOut[2]:2In[3]:%%time...:futures=[add.delay(i,i)foriinrange(1000)]...:results=[f.get()forfinfutures]...:CPUtimes:user888ms,sys:72ms,total:960msWalltime:1.7s

Dask

We do the same workload with dask.distributed’s concurrent.futures interface, using the default single-machine deployment.

In[1]:fromdistributedimportClientIn[2]:c=Client()In[3]:fromoperatorimportaddIn[4]:%timec.submit(add,1,1).result()CPUtimes:user20ms,sys:0ns,total:20msWalltime:20.7msOut[4]:2In[5]:%%time...:futures=[c.submit(add,i,i)foriinrange(1000)]...:results=c.gather(futures)...:CPUtimes:user328ms,sys:12ms,total:340msWalltime:369ms

Comparison

Functions: In Celery you register computations ahead of time on the server. This is good if you know what you want to run ahead of time (such as is often the case in data engineering workloads) and don’t want the security risk of allowing users to run arbitrary code on your cluster. It’s less pleasant on users who want to experiment. In Dask we choose the functions to run on the user side, not on the server side. This ends up being pretty critical in data exploration but may be a hinderance in more conservative/secure compute settings.
Setup: In Celery we depend on other widely deployed systems like RabbitMQ or Redis. Dask depends on lower-level Torando TCP IOStreams and Dask’s own custom routing logic. This makes Dask trivial to set up, but also probably less durable. Redis and RabbitMQ have both solved lots of problems that come up in the wild and leaning on them inspires confidence.
Performance: They both operate with sub-second latencies and millisecond-ish overheads. Dask is marginally lower-overhead but for data engineering workloads differences at this level are rarely significant. Dask is an order of magnitude lower-latency, which might be a big deal depending on your application. For example if you’re firing off tasks from a user clicking a button on a website 20ms is generally within interactive budget while 500ms feels a bit slower.

Simple Dependencies

The question asked about Canvas, Celery’s dependency management system.

Often tasks depend on the results of other tasks. Both systems have ways to help users express these dependencies.

Celery

The apply_async method has a link= parameter that can be used to call tasks after other tasks have run. For example we can compute (1 + 2) + 3 in Celery as follows:

add.apply_async((1,2),link=add.s(3))

Dask.distributed

With the Dask concurrent.futures API, futures can be used within submit calls and dependencies are implicit.

x=c.submit(add,1,2)y=c.submit(add,x,3)

We could also use the dask.delayed decorator to annotate arbitrary functions and then use normal-ish Python.

@dask.delayeddefadd(x,y):returnx+yx=add(1,2)y=add(x,3)y.compute()

Comparison

I prefer the Dask solution, but that’s subjective.

Complex Dependencies

Celery

Celery includes a rich vocabulary of terms to connect tasks in more complex ways including groups, chains, chords, maps, starmaps, etc.. More detail here in their docs for Canvas, the system they use to construct complex workflows: http://docs.celeryproject.org/en/master/userguide/canvas.html

For example here we chord many adds and then follow them with a sum.

In[1]:fromtasksimportadd,tsum# I had to add a sum method to tasks.pyIn[2]:fromceleryimportchordIn[3]:%timechord(add.s(i,i)foriinrange(100))(tsum.s()).get()CPUtimes:user172ms,sys:12ms,total:184msWalltime:1.21sOut[3]:9900

Dask

Dask’s trick of allowing futures in submit calls actually goes pretty far. Dask doesn’t really need any additional primitives. It can do all of the patterns expressed in Canvas fairly naturally with normal submit calls.

In[4]:%%time...:futures=[c.submit(add,i,i)foriinrange(100)]...:total=c.submit(sum,futures)...:total.result()...:CPUtimes:user52ms,sys:0ns,total:52msWalltime:60.8ms

Or with Dask.delayed

futures=[add(i,i)foriinrange(100)]total=dask.delayed(sum)(futures)total.result()

Multiple Queues

In Celery there is a notion of queues to which tasks can be submitted and that workers can subscribe. An example use case is having “high priority” workers that only process “high priority” tasks. Every worker can subscribe to the high-priority queue but certain workers will subscribe to that queue exclusively:

celery -A my-project worker -Q high-priority  # only subscribe to high priority
celery -A my-project worker -Q celery,high-priority  # subscribe to both
celery -A my-project worker -Q celery,high-priority
celery -A my-project worker -Q celery,high-priority

This is like the TSA pre-check line or the express lane in the grocery store.

Dask has a couple of topics that are similar or could fit this need in a pinch, but nothing that is strictly analogous.

First, for the common case above, tasks have priorities. These are typically set by the scheduler to minimize memory use but can be overridden directly by users to give certain tasks precedence over others.

Second, you can restrict tasks to run on subsets of workers. This was originally designed for data-local storage systems like the Hadoop FileSystem (HDFS) or clusters with special hardware like GPUs but can be used in the queues case as well. It’s not quite the same abstraction but could be used to achieve the same results in a pinch. For each task you can restrict the pool of workers on which it can run.

The relevant docs for this are here: http://distributed.readthedocs.io/en/latest/locality.html#user-control

Retrying Tasks

Celery allows tasks to retry themselves on a failure.

@app.task(bind=True)defsend_twitter_status(self,oauth,tweet):try:twitter=Twitter(oauth)twitter.update_status(tweet)except(Twitter.FailWhaleError,Twitter.LoginError)asexc:raiseself.retry(exc=exc)# Example from http://docs.celeryproject.org/en/latest/userguide/tasks.html#retrying

Sadly Dask currently has no support for this (see open issue). All functions are considered pure and final. If a task errs the exception is considered to be the true result. This could change though; it has been requested a couple of times now.

Until then users need to implement retry logic within the function (which isn’t a terrible idea regardless).

@app.task(bind=True)defsend_twitter_status(self,oauth,tweet,n_retries=5):foriinrange(n_retries):try:twitter=Twitter(oauth)twitter.update_status(tweet)returnexcept(Twitter.FailWhaleError,Twitter.LoginError)asexc:pass

Rate Limiting

Celery lets you specify rate limits on tasks, presumably to help you avoid getting blocked from hammering external APIs

@app.task(rate_limit='1000/h')defquery_external_api(...):...

Dask definitely has nothing built in for this, nor is it planned. However, this could be done externally to Dask fairly easily. For example, Dask supports mapping functions over arbitrary Python Queues. If you send in a queue then all current and future elements in that queue will be mapped over. You could easily handle rate limiting in Pure Python on the client side by rate limiting your input queues. The low latency and overhead of Dask makes it fairly easy to manage logic like this on the client-side. It’s not as convenient, but it’s still straightforward.

>>>fromqueueimportQueue>>>q=Queue()>>>out=c.map(query_external_api,q)>>>type(out)Queue

Final Thoughts

Based on this very shallow exploration of Celery, I’ll foolishly claim that Dask can handle Celery workloads, if you’re not diving into deep API. However all of that deep API is actually really important. Celery evolved in this domain and developed tons of features that solve problems that arise over and over again. This history saves users an enormous amount of time. Dask evolved in a very different space and has developed a very different set of tricks. Many of Dask’s tricks are general enough that they can solve Celery problems with a small bit of effort, but there’s still that extra step. I’m seeing people applying that effort to problems now and I think it’ll be interesting to see what comes out of it.

Going through the Celery API was a good experience for me personally. I think that there are some good concepts from Celery that can inform future Dask development.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

All code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a cluster please refer instead to the documentation instead.

Dask is deployed today on the following systems in the wild:

SGE
SLURM,
Torque
Condor
LSF
Mesos
Marathon
Kubernetes
SSH and custom scripts
… there may be more. This is what I know of first-hand.

These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment.

The people deploying Dask on these cluster resource managers are power-users; they know how their resource managers work and they read the documentation on how to setup Dask clusters. Generally these users are pretty happy; however we should reduce this barrier so that non-power-users with access to a cluster resource manager can use Dask on their cluster just as easily.

Unfortunately, there are a few challenges:

Several cluster resource managers exist, each with significant adoption. Finite developer time stops us from supporting all of them.
Policies for scaling out vary widely. For example we might want a fixed number of workers, or we might want workers that scale out based on current use. Different groups will want different solutions.
Individual cluster deployments are highly configurable. Dask needs to get out of the way quickly and let existing technologies configure themselves.

This post talks about some of these issues. It does not contain a definitive solution.

Example: Kubernetes

For example, both Olivier Griesl (INRIA, scikit-learn) and Tim O’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on Kubernetes.

These instructions are well organized. They include Dockerfiles, published images, Kubernetes config files, and instructions on how to interact with cloud providers’ infrastructure. Olivier and Tim both obviously know what they’re doing and care about helping others to do the same.

Tim (who came second) wasn’t aware of Olivier’s solution and wrote up his own. Tim was capable of doing this but many beginners wouldn’t be.

One solution would be to include a prominent registry of solutions like these within Dask documentation so that people can find quality references to use as starting points. I’ve started a list of resources here: dask/distributed #547 comments pointing to other resources would be most welcome..

However, even if Tim did find Olivier’s solution I suspect he would still need to change it. Tim has different software and scalability needs than Olivier. This raises the question of “What should Dask provide and what should it leave to administrators?” It may be that the best we can do is to support copy-paste-edit workflows.

What is Dask-specific, resource-manager specific, and what needs to be configured by hand each time?

Adaptive Deployments

In order to explore this topic of separable solutions I built a small adaptive deployment system for Dask.distributed on Marathon, an orchestration platform on top of Mesos.

This solution does two things:

It scales a Dask cluster dynamically based on the current use. If there are more tasks in the scheduler then it asks for more workers.
It deploys those workers using Marathon.

To encourage replication, these two different aspects are solved in two different pieces of code with a clean API boundary.

A backend-agnostic piece for adaptivity that says when to scale workers up and how to scale them down safely
A Marathon-specific piece that deploys or destroys dask-workers using the Marathon HTTP API

This combines a policy, adaptive scaling, with a backend, Marathon such that either can be replaced easily. For example we could replace the adaptive policy with a fixed one to always keep N workers online, or we could replace Marathon with Kubernetes or Yarn.

My hope is that this demonstration encourages others to develop third party packages. The rest of this post will be about diving into this particular solution.

Adaptivity

The distributed.deploy.Adaptive class wraps around a Scheduler and determines when we should scale up and by how many nodes, and when we should scale down specifying which idle workers to release.

The current policy is fairly straightforward:

If there are unassigned tasks or any stealable tasks and no idle workers, or if the average memory use is over 50%, then increase the number of workers by a fixed factor (defaults to two).
If there are idle workers and the average memory use is below 50% then reclaim the idle workers with the least data on them (after moving data to nearby workers) until we’re near 50%

Think this policy could be improved or have other thoughts? Great. It was easy to implement and entirely separable from the main code so you should be able to edit it easily or create your own. The current implementation is about 80 lines (source).

However, this Adaptive class doesn’t actually know how to perform the scaling. Instead it depends on being handed a separate object, with two methods, scale_up and scale_down:

classMyCluster(object):defscale_up(n):"""
        Bring the total count of workers up to ``n``

        This function/coroutine should bring the total number of workers up to
        the number ``n``.
        """raiseNotImplementedError()defscale_down(self,workers):"""
        Remove ``workers`` from the cluster

        Given a list of worker addresses this function should remove those
        workers from the cluster.
        """raiseNotImplementedError()

This cluster object contains the backend-specific bits of how to scale up and down, but none of the adaptive logic of when to scale up and down. The single-machine LocalCluster object serves as reference implementation.

So we combine this adaptive scheme with a deployment scheme. We’ll use a tiny Dask-Marathon deployment library available here

fromdask_marathonimportMarathonClusterfromdistributedimportSchedulerfromdistributed.deployimportAdaptives=Scheduler()mc=MarathonCluster(s,cpus=1,mem=4000,docker_image='mrocklin/dask-distributed')ac=Adaptive(s,mc)

This combines a policy, Adaptive, with a deployment scheme, Marathon in a composable way. The Adaptive cluster watches the scheduler and calls the scale_up/down methods on the MarathonCluster as necessary.

Marathon code

Because we’ve isolated all of the “when” logic to the Adaptive code, the Marathon specific code is blissfully short and specific. We include a slightly simplified version below. There is a fair amount of Marathon-specific setup in the constructor and then simple scale_up/down methods below:

frommarathonimportMarathonClient,MarathonAppfrommarathon.models.containerimportMarathonContainerclassMarathonCluster(object):def__init__(self,scheduler,executable='dask-worker',docker_image='mrocklin/dask-distributed',marathon_address='http://localhost:8080',name=None,cpus=1,mem=4000,**kwargs):self.scheduler=scheduler# Create Marathon App to run dask-workerargs=[executable,scheduler.address,'--nthreads',str(cpus),'--name','$MESOS_TASK_ID',# use Mesos task ID as worker name'--worker-port','$PORT_WORKER','--nanny-port','$PORT_NANNY','--http-port','$PORT_HTTP']ports=[{'port':0,'protocol':'tcp','name':name}fornamein['worker','nanny','http']]args.extend(['--memory-limit',str(int(mem*0.6*1e6))])kwargs['cmd']=' '.join(args)container=MarathonContainer({'image':docker_image})app=MarathonApp(instances=0,container=container,port_definitions=ports,cpus=cpus,mem=mem,**kwargs)# Connect and register appself.client=MarathonClient(marathon_address)self.app=self.client.create_app(nameor'dask-%s'%uuid.uuid4(),app)defscale_up(self,instances):self.client.scale_app(self.app.id,instances=instances)defscale_down(self,workers):forwinworkers:self.client.kill_task(self.app.id,self.scheduler.worker_info[w]['name'],scale=True)

This isn’t trivial, you need to know about Marathon for this to make sense, but fortunately you don’t need to know much else. My hope is that people familiar with other cluster resource managers will be able to write similar objects and will publish them as third party libraries as I have with this Marathon solution here: https://github.com/mrocklin/dask-marathon (thanks goes to Ben Zaitlen for setting up a great testing harness for this and getting everything started.)

Adaptive Policies

Similarly, we can design new policies for deployment. You can read more about the policies for the Adaptive class in the documentation or the source (about eighty lines long). I encourage people to implement and use other policies and contribute back those policies that are useful in practice.

Final thoughts

We laid out a problem

How does a distributed system support a variety of cluster resource managers and a variety of scheduling policies while remaining sensible?

We proposed two solutions:

Maintain a registry of links to solutions, supporting copy-paste-edit practices
Develop an API boundary that encourages separable development of third party libraries.

It’s not clear that either solution is sufficient, or that the current implementation of either solution is any good. This is is an important problem though as Dask.distributed is, today, still mostly used by super-users. I would like to engage community creativity here as we search for a good solution.

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Dask has been active lately due to a combination of increased adoption and funded feature development by private companies. This increased activity is great, however an unintended side effect is that I have spent less time writing about development and engaging with the broader community. To address this I hope to write one blogpost a week about general development. These will not be particularly polished, nor will they announce ready-to-use features for users, however they should increase transparency and hopefully better engage the developer community.

So themes of last week

Embedded Bokeh servers for the Workers
Smarter workers
An overhauled scheduler that is slightly simpler overall (thanks to the smarter workers) but with more clever work stealing
Fastparquet

Embedded Bokeh Servers in Dask Workers

The distributed scheduler’s web diagnostic page is one of Dask’s more flashy features. It shows the passage of every computation on the cluster in real time. These diagnostics are invaluable for understanding performance both for users and for core developers.

I intend to focus on worker performance soon, so I decided to attach a Bokeh server to every worker to serve web diagnostics about that worker. To make this easier, I also learned how to embed Bokeh servers inside of other Tornado applications. This has reduced the effort to create new visuals and expose real time information considerably and I can now create a full live visualization in around 30 minutes. It is now faster for me to build a new diagnostic than to grep through logs. It’s pretty useful.

Here are some screenshots. Nothing too flashy, but this information is highly valuable to me as I measure bandwidths, delays of various parts of the code, how workers send data between each other, etc..

To be clear, these diagnostic pages aren’t polished in any way. There’s lots missing, it’s just what I could get done in a day. Still, everyone running a Tornado application should have an embedded Bokeh server running. They’re great for rapidly pushing out visually rich diagnostics.

Smarter Workers and a Simpler Scheduler

Previously the scheduler knew everything and the workers were fairly simple-minded. Now we’ve moved some of the knowledge and responsibility over to the workers. Previously the scheduler would give just enough work to the workers to keep them occupied. This allowed the scheduler to make better decisions about the state of the entire cluster. By delaying committing a task to a worker until the last moment we made sure that we were making the right decision. However, this also means that the worker sometimes has idle resources, particularly network bandwidth, when it could be speculatively preparing for future work.

Now we commit all ready-to-run tasks to a worker immediately and that worker has the ability to pipeline those tasks as it sees fit. This is better locally but slightly worse globally. To counter balance this we’re now being much more aggressive about work stealing and, because the workers have more information, they can manage some of the administrative costs of works stealing themselves. Because this isn’t bound to run on just the scheduler we can use more expensive algorithms than when when did everything on the scheduler.

There were a few motivations for this change:

Dataframe performance was bound by keeping the worker hardware fully occupied, which we weren’t doing. I expect that these changes will eventually yield something like a 30% speedup.
Users on traditional job scheduler machines (SGE, SLURM, TORQUE) and users who like GPUS, both wanted the ability to tag tasks with specific resource constraints like “This consumes one GPU” or “This task requires a 5GB of RAM while running” and ensure that workers would respect those constraints when running tasks. The old workers weren’t complex enough to reason about these constraints. With the new workers, adding this feature was trivial.
By moving logic from the scheduler to the worker we’ve actually made them both easier to reason about. This should lower barriers for contributors to get into the core project.

Dataframe algorithms

Approximate nunique and multiple-output-partition groupbys landed in master last week. These arose because some power-users had very large dataframes that weree running into scalability limits. Thanks to Mike Graham for the approximate nunique algorithm. This has also pushed hashing changes upstream to Pandas.

Fast Parquet

Martin Durant has been working on a Parquet reader/writer for Python using Numba. It’s pretty slick. He’s been using it on internal Continuum projects for a little while and has seen both good performance and a very Pythonic experience for what was previously a format that was pretty inaccessible.

He’s planning to write about this in the near future so I won’t steal his thunder. Here is a link to the documentation: fastparquet.readthedocs.io

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-05 and 2016-12-12. Nothing here is stable or ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

Dask.array without known chunk sizes
Import time
Fastparquet blogpost and feedback
Scheduler improvements for 1000+ worker clusters
Channels and inter-client communication
New dependencies?

Dask array without known chunk sizes

Dask arrays can now work even in situations where we don’t know the exact chunk size. This is particularly important because it allows us to convert dask.dataframes to dask.arrays in a standard analysis cycle that includes both data preparation and statistical or machine learning algorithms.

x=df.valuesx=df.to_records()

This work was motivated by the work of Christopher White on building scalable solvers for problems like logistic regression and generalized linear models over at dask-glm.

As a pleasant side effect we can now also index dask.arrays with dask.arrays (a previous limitation)

x[x>0]

and mutate dask.arrays in certain cases with setitem

x[x>0]=0

Both of which are frequntly requested.

However, there are still holes in this implementation and many operations (like slicing) generally don’t work on arrays without known chunk sizes. We’re increasing capability here but blurring the lines of what is possible and what is not possible, which used to be very clear.

Import time

Import times had been steadily climbing for a while, rising above one second at times. These were reduced by Antoine Pitrou down to a more reasonable 300ms.

FastParquet blogpost and feedback

Martin Durant has built a nice Python Parquet library here: http://fastparquet.readthedocs.io/en/latest/ and released a blogpost about it last week here: https://www.continuum.io/blog/developer-blog/introducing-fastparquet

Since then we’ve gotten some good feedback and error reports (non-string column names etc.) Martin has been optimizing performance and recently adding append support.

Scheduler optimizations for 1000+ worker clusters

The recent refactoring of the scheduler and worker exposed new opportunities for performance and for measurement. One of the 1000+ worker deployments here in NYC was kind enough to volunteer some compute time to run some experiments. It was very fun having all of the Dask/Bokeh dashboards up at once (there are now half a dozen of these things) giving live monitoring information on a thousand-worker deployment. It’s stunning how clearly performance issues present themselves when you have the right monitoring system.

Anyway, this lead to better sequentialization when handling messages, greatly reduced open file handle requirements, and the use of cytoolz over toolz in a few critical areas.

I intend to try this experiment again this week, now with new diagnostics. To aid in that we’ve made it very easy to turn timings and counters automatically into live Bokeh plots. It now takes literally one line of code to add a new plot to these pages (left: scheduler right: worker)

Already we can see that the time it takes to connect between workers is absurdly high in the 10ms to 100ms range, highlighting an important performance flaw.

This depends on an experimental project, crick, by Jim Crist that provides a fast T-Digest implemented in C (see also Ted Dunning’s implementation.

Channels and inter-worker communication

I’m starting to experiment with mechanisms for inter-client communication of futures. This enables both collaborative workflows (two researchers sharing the same cluster) and also complex workflows in which tasks start other tasks in a more streaming setting.

We added a simple mechanism to share a rolling buffer of futures between clients:

# Client 1c=Client('scheduler:8786')x=c.channel('x')future=c.submit(inc,1)x.put(future)

# Client 1c=Client('scheduler:8786')x=c.channel('x')future=next(iter(x))

Additionally, this relatively simple mechanism was built external to the scheduler and client, establishing a pattern we can repeat in the future for more complex inter-client communication systems. Generally I’m on the lookout for other ways to make the system more extensible. This range of extension requests for the scheduler is somewhat large these days and we’d like to find ways to keep these expansions maintainable going forward.

https://github.com/dask/distributed/pull/729

New dependency: Sorted collections

The scheduler is now using the sortedcollections module, which is based off of sortedcontainers which is a pure-Python library offering sorted containers SortedList, SortedSet, ValueSortedDict, etc. at C-extensions speeds.

So far I’m pretty sold on these libraries. I encourage other library maintainers to consider them.

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

Benchmarking new scheduler and worker on larger systems
Kubernetes and Google Container Engine
Fastparquet on S3

Rewriting Load Balancing

In the last two weeks we rewrote a significant fraction of the worker and scheduler. This enables future growth, but also resulted in a loss of our load balancing and work stealing algorithms (the old one no longer made sense in the context of the new system.) Careful dynamic load balancing is essential to running atypical workloads (which are surprisingly typical among Dask users) so rebuilding this has been all-consuming this week for me personally.

Briefly, Dask initially assigns tasks to workers taking into account the expected runtime of the task, the size and location of the data that the task needs, the duration of other tasks on every worker, and where each piece of data sits on all of the workers. Because the number of tasks can grow into the millions and the number of workers can grow into the thousands, Dask needs to figure out a near-optimal placement in near-constant time, which is hard. Furthermore, after the system runs for a while, uncertainties in our estimates build, and we need to rebalance work from saturated workers to idle workers relatively frequently. Load balancing intelligently and responsively is essential to a satisfying user experience.

We have a decently strong test suite around these behaviors, but it’s hard to be comprehensive on performance-based metrics like this, so there has also been a lot of benchmarking against real systems to identify new failure modes. We’re doing what we can to create isolated tests for every failure mode that we find to make future rewrites retain good behavior.

Generally working on the Dask distributed scheduler has taught me the brittleness of unit tests. As we have repeatedly rewritten internals while maintaining the same external API our testing strategy has evolved considerably away from fine-grained unit tests to a mixture of behavioral integration tests and a very strict runtime validation system.

Rebuilding the load balancing algorithms has been high priority for me personally because these performance issues inhibit current power-users from using the development version on their problems as effectively as with the latest release. I’m looking forward to seeing load-balancing humming nicely again so that users can return to git-master and so that I can return to handling a broader base of issues. (Sorry to everyone I’ve been ignoring the last couple of weeks).

Test deployments on Google Container Engine

I’ve personally started switching over my development cluster from Amazon’s EC2 to Google’s Container Engine. Here are some pro’s and con’s from my particular perspective. Many of these probably have more to do with how I use each particular tool rather than intrinsic limitations of the service itself.

In Google’s Favor

Native and immediate support for Kubernetes and Docker, the combination of which allows me to more quickly and dynamically create and scale clusters for different experiments.
Dynamic scaling from a single node to a hundred nodes and back ten minutes later allows me to more easily run a much larger range of scales.
I like being charged by the minute rather than by the hour, especially given the ability to dynamically scale up
Authentication and billing feel simpler

In Amazon’s Favor

I already have tools to launch Dask on EC2
All of my data is on Amazon’s S3
I have nice data acquisition tools, s3fs, for S3 based on boto3. Google doesn’t seem to have a nice Python 3 library for accessing Google Cloud Storage :(

I’m working from Olivier Grisel’s repository docker-distributed although updating to newer versions and trying to use as few modifications from naive deployment as possible. My current branch is here. I hope to have something more stable for next week.

Fastparquet on S3

We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on Friday. I was surprised that everything seemed to work out of the box. Martin Durant, who built both fastparquet and s3fs has done some nice work to make sure that all of the pieces play nicely together. We ran into some performance issues pulling bytes from S3 itself. I expect that there will be some tweaking over the next few weeks.

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Themes of last week:

Cleanup of load balancing
Found cause of worker lag
Initial Spark/Dask Dataframe comparisons
Benchmarks with asv

Load Balancing Cleanup

The last two weeks saw several disruptive changes to the scheduler and workers. This resulted in an overall performance degradation on messy workloads when compared to the most recent release, which stopped bleeding-edge users from using recent dev builds. This has been resolved, and bleeding-edge git-master is back up to the old speed and then some.

As a visual aid, this is what bad (or in this case random) load balancing looks like:

Identified and removed worker lag

For a while there have been significant gaps of 100ms or more between successive tasks in workers, especially when using Pandas. This was particularly odd because the workers had lots of backed up work to keep them busy (thanks to the nice load balancing from before). The culprit here was the calculation of the size of the intermediate on object dtype dataframes.

Explaining this in greater depth, recall that to schedule intelligently, the workers calculate the size in bytes of every intermediate result they produce. Often this is quite fast, for example for numpy arrays we can just multiply the number of elements by the dtype itemsize. However for object dtype arrays or dataframes (which are commonly used for text) it can take a long while to calculate an accurate result here. Now we no longer calculuate an accurate result, but instead take a fairly pessimistic guess. The gaps between tasks shrink considerably.

Although there is still a significant bit of lag around 10ms long between tasks on these workloads (see zoomed version on the right). On other workloads we’re able to get inter-task lag down to the tens of microseconds scale. While 10ms may not sound like a long time, when we perform very many very short tasks this can quickly become a bottleneck.

Anyway, this change reduced shuffle overhead by a factor of two. Things are starting to look pretty snappy for many-small-task workloads.

Initial Spark/Dask Dataframe Comparisons

I would like to run a small benchmark comparing Dask and Spark DataFrames. I spent a bit of the last couple of days using Spark locally on the NYC Taxi data and futzing with cluster deployment tools to set up Spark clusters on EC2 for basic benchmarking. I ran across flintrock, which has been highly recommended to me a few times.

I’ve been thinking about how to do benchmarks in an unbiased way. Comparative benchmarks are useful to have around to motivate projects to grow and learn from each other. However in today’s climate where open source software developers have a vested interest, benchmarks often focus on a projects’ strengths and hide their deficiencies. Even with the best of intentions and practices, a developer is likely to correct for deficiencies on the fly. They’re much more able to do this for their own project than for others’. Benchmarks end up looking more like sales documents than trustworthy research.

My tentative plan is to reach out to a few Spark devs and see if we can collaborate on a problem set and hardware before running computations and comparing results.

Benchmarks with airspeed velocity

Rich Postelnik is building on work from Tom Augspurger to build out benchmarks for Dask using airspeed velocity at dask-benchmarks. Building out benchmarks is a great way to get involved if anyone is interested.

Pre-pre-release

I intend to publish a pre-release for a 0.X.0 version bump of dask/dask and dask/distributed sometime next week.

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Dask just grew to version 0.13.0. This is a signifcant release for arrays, dataframes, and the distributed scheduler. This blogpost outlines some of the major changes since the last release November 4th.

Python 3.6 support
Algorithmic and API improvements for DataFrames
Dataframe to Array conversions for Machine Learning
Parquet support
Scheduling Performance and Worker Rewrite
Pervasive Visual Diagnostics with Embedded Bokeh Servers
Windows continuous integration
Custom serialization

You can install new versions using Conda or Pip

conda install -c conda-forge dask distributed

pip install dask[complete] distributed --upgrade

Python 3.6 Support

Dask and all necessary dependencies are now available on Conda Forge for Python 3.6.

Algorithmic and API Improvements for DataFrames

Thousand-core Dask deployments have become significantly more common in the last few months. This has highlighted scaling issues in some of the Dask.array and Dask.dataframe algorithms, which were originally designed for single workstations. Algorithmic and API changes can be grouped into the following two categories:

Filling out the Pandas API
Algorithms that needed to be changed or added due to scaling issues

Dask Dataframes now include a fuller set of the Pandas API, including the following:

Inplace operations like df['x'] = df.y + df.z
The full Groupby-aggregate syntax like df.groupby(...).aggregate({'x': 'sum', 'y': ['min', max']})
Resample on dataframes as well as series
Pandas’ new rolling syntax df.x.rolling(10).mean()
And much more

Additionally, collaboration with some of the larger Dask deployments has highlighted scaling issues in some algorithms, resulting in the following improvements:

Tree reductions for groupbys, aggregations, etc.
Multi-output-partition aggregations for groupby-aggregations with millions of groups, drop_duplicates, etc..
Approximate algorithms for nunique
etc..

These same collaborations have also yielded better handling of open file descriptors, changes upstream to Tornado, and upstream changes to the conda-forge CPython recipe itself to increase the default file descriptor limit on Windows up from 512.

Dataframe to Array Conversions

You can now convert Dask dataframes into Dask arrays. This is mostly to support efforts of groups building statistics and machine learning applications, where this conversion is common. For example you can load a terabyte of CSV or Parquet data, do some basic filtering and manipulation, and then convert to a Dask array to do more numeric work like SVDs, regressions, etc..

importdask.dataframeasddimportdask.arrayasdadf=dd.read_csv('s3://...')# Read raw datax=df.values# Convert to dask.arrayu,s,v=da.linalg.svd(x)# Perform serious numerics

This should help machine learning and statistics developers generally, as many of the more sophisticated algorithms can be more easily implemented with the Dask array model than can be done with distributed dataframes. This change was done specifically to support the nascent third-party dask-glm project by Chris White at Capital One.

Previously this was hard because Dask.array wanted to know the size of every chunk of data, which Dask dataframes can’t provide (because, for example, it is impossible to lazily tell how many rows are in a CSV file without actually looking through it). Now that Dask.arrays have relaxed this requirement they can also support other unknown shape operations, like indexing an array with another array.

y=x[x>0]

Parquet Support

Dask.dataframe now supports Parquet, a columnar binary store for tabular data commonly used in distributed clusters and the Hadoop ecosystem.

importdask.dataframeasdddf=dd.read_parquet('myfile.parquet')# Read from Parquetdf.to_parquet('myfile.parquet',compression='snappy')# Write to Parquet

This is done through the new fastparquet library, a Numba-accelerated version of the Pure Python parquet-python. Fastparquet was built and is maintained by Martin Durant. It’s also exciting to see the Parquet-cpp project gain Python support through Arrow and work by Wes McKinney and Uwe Korn. Parquet has gone from inaccessible in Python to having multiple competing implementations, which is a wonderful and exciting change for the “Big Data” Python ecosystem.

Scheduling Performance and Worker Rewrite

The internals of the distributed scheduler and workers are significantly modified. Users shouldn’t experience much change here except for general performance enhancement, more upcoming features, and much deeper visual diagnostics through Bokeh servers.

We’ve pushed some of the scheduling logic from the scheduler onto the workers. This lets us do two things:

We keep a much larger backlog of tasks on the workers. This allows workers to optimize and saturate their hardware more effectively. As a result, complex computations end up being significantly faster.
We can more easily deliver on a rising number of requests for complex scheduling features. For example, GPU users will be happy to learn that you can now specify abstract resource constraints like “this task requires a GPU” and “this worker has four GPUs” and the scheduler and workers will allocate tasks accordingly. This is just one example of a feature that was easy to implement after the scheduler/worker redesign and is now available.

Pervasive Visual Diagnostics with Embedded Bokeh Servers

While optimizing scheduler performance we built several new visual diagnostics using Bokeh. There is now a Bokeh Server running within the scheduler and within every worker.

Current Dask.distributed users will be familiar with the current diagnostic dashboards:

These plots provide intuition about the state of the cluster and the computations currently in flight. These dashboards are generally well loved.

There are now many more of these, though more focused on internal state and timings that will be of interest to developers and power users than to a typical users. Here are a couple of the new pages (of which there are seven) that show various timings and counters of various parts of the worker and scheduler internals.

The previous Bokeh dashboards were served from a separate process that queried the scheduler periodically (every 100ms). Now there are new Bokeh servers within every worker and a new Bokeh server within the scheduler process itself rather than in a separate process. Because these servers are embedded they have direct access to the state of the scheduler and workers which significantly reduces barriers for us to build out new visuals. However, this also adds some load to the scheduler, which can often be compute bound. These pages are available at new ports, 8788 for the scheduler and 8789 for the worker by default.

Custom Serialization

This is actually a change that occurred in the last release, but I haven’t written about it and it’s important, so I’m including it here.

Previously inter-worker communication of data was accomplished with Pickle/Cloudpickle and optional generic compression like LZ4/Snappy. This was robust and worked mostly fine, but left out some exotic data types and did not provide optimal performance.

Now we can serialize different types with special consideration. This allows special types, like NumPy arrays, to pass through without unnecessary memory copies and also allows us to use more exotic data-type specific compression techniques like Blosc.

It also allows Dask to serialize some previously unserializable types. In particular this was intended to solve the Dask.array climate science community’s concern about HDF5 and NetCDF files which (correctly) are unpicklable and so restricted to single-machine use.

This is also the first step towards two frequently requested features (neither of these exist yet):

Better support for GPU-GPU specific serialization options. We are now a large step closer to generalizing away our assumption of TCP Sockets as the universal communication mechanism.
Passing data between workers of different runtime languages. By embracing other protocols than Pickle we begin to allow for the communication of data between workers of different software environments.

What’s Next

So what should we expect to see in the future for Dask?

Communication: Now that workers are more fully saturated we’ve found that communication issues are arising more frequently as bottlenecks. This might be because everything else is nearing optimal or it might be because of the increased contention in the workers now that they are idle less often. Many of our new diagnostics are intended to measure components of the communication pipeline.
Third Party Tools: We’re seeing a nice growth of utilities like dask-drmaa for launching clusters on DRMAA job schedulers (SGE, SLURM, LSF) and dask-glm for solvers for GLM-like machine-learning algorithms. I hope that external projects like these become the main focus of Dask development going forward as Dask penetrates new domains.
Blogging: I’ll be launching a few fun blog posts throughout the next couple of weeks. Stay tuned.

Learn More

You can install or upgrade using Conda or Pip

conda install -c conda-forge dask distributed

pip install dask[complete] distributed --upgrade

You can learn more about Dask and its distributed scheduler at these websites:

Acknowledgements

Since the last main release the following developers have contributed to the core Dask repostiory (parallel algorithms, arrays, dataframes, etc..)

Alexander C. Booth
Antoine Pitrou
Christopher Prohm
Frederic Laliberte
Jim Crist
Martin Durant
Matthew Rocklin
Mike Graham
Rolando (Max) Espinoza
Sinhrks
Stuart Archibald

And the following developers have contributed to the Dask/distributed repository (distributed scheduling, network communication, etc..)

Antoine Pitrou
jakirkham
Jeff Reback
Jim Crist
Martin Durant
Matthew Rocklin
rbubley
Stephan Hoyer
strets123
Travis E. Oliphant