Quantcast
Channel: Working notes by Matthew Rocklin - SciPy
Viewing all 100 articles
Browse latest View live

Distributed Pandas on a Cluster with Dask DataFrames

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Dask Dataframe extends the popular Pandas library to operate on big data-sets on a distributed cluster. We show its capabilities by running through common dataframe operations on a common dataset. We break up these computations into the following sections:

  1. Introduction: Pandas is intuitive and fast, but needs Dask to scale
  2. Read CSV and Basic operations
    1. Read CSV
    2. Basic Aggregations and Groupbys
    3. Joins and Correlations
  3. Shuffles and Time Series
  4. Parquet I/O
  5. Final thoughts
  6. What we could have done better

Accompanying Plots

Throughout this post we accompany computational examples with profiles of exactly what task ran where on our cluster and when. These profiles are interactive Bokeh plots that include every task that every worker in our cluster runs over time. For example the following computation read_csv computation produces the following profile:

>>>df=dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv')

If you are reading this through a syndicated website like planet.python.org or through an RSS reader then these plots will not show up. You may want to visit http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes directly.

Dask.dataframe breaks up reading this data into many small tasks of different types. For example reading bytes and parsing those bytes into pandas dataframes. Each rectangle corresponds to one task. The y-axis enumerates each of the worker processes. We have 64 processes spread over 8 machines so there are 64 rows. You can hover over any rectangle to get more information about that task. You can also use the tools in the upper right to zoom around and focus on different regions in the computation. In this computation we can see that workers interleave reading bytes from S3 (light green) and parsing bytes to dataframes (dark green). The entire computation took about a minute and most of the workers were busy the entire time (little white space). Inter-worker communication is always depicted in red (which is absent in this relatively straightforward computation.)

Introduction

Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale. That component is missing. Generally people move to Spark DataFrames on HDFS or a proper relational database to resolve this scaling issue. Dask is a Python library for parallel and distributed computing that aims to fill this need for parallelism among the PyData projects (NumPy, Pandas, Scikit-Learn, etc.). Dask dataframes combine Dask and Pandas to deliver a faithful “big data” version of Pandas operating in parallel over a cluster.

I’ve written about this topic before. This blogpost is newer and will focus on performance and newer features like fast shuffles and the Parquet format.

CSV Data and Basic Operations

I have an eight node cluster on EC2 of m4.2xlarges (eight cores, 30GB RAM each). Dask is running on each node with one process per core.

We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3. We look at that data briefly with s3fs

>>>imports3fs>>>s3=S3FileSystem()>>>s3.ls('dask-data/nyc-taxi/2015/')['dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-02.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-03.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-04.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-05.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-06.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-07.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-08.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-09.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-10.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-11.csv','dask-data/nyc-taxi/2015/yellow_tripdata_2015-12.csv']

This data is too large to fit into Pandas on a single computer. However, it can fit in memory if we break it up into many small pieces and load these pieces onto different computers across a cluster.

We connect a client to our Dask cluster, composed of one centralized dask-scheduler process and several dask-worker processes running on each of the machines in our cluster.

fromdask.distributedimportClientclient=Client('scheduler-address:8786')

And we load our CSV data using dask.dataframe which looks and feels just like Pandas, even though it’s actually coordinating hundreds of small Pandas dataframes. This takes about a minute to load and parse.

importdask.dataframeasdddf=dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv',parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],storage_options={'anon':True})df=client.persist(df)

This cuts up our 12 CSV files on S3 into a few hundred blocks of bytes, each 64MB large. On each of these 64MB blocks we then call pandas.read_csv to create a few hundred Pandas dataframes across our cluster, one for each block of bytes. Our single Dask Dataframe object, df, coordinates all of those Pandas dataframes. Because we’re just using Pandas calls it’s very easy for Dask dataframes to use all of the tricks from Pandas. For example we can use most of the keyword arguments from pd.read_csv in dd.read_csv without having to relearn anything.

This data is about 20GB on disk or 60GB in RAM. It’s not huge, but is also larger than we’d like to manage on a laptop, especially if we value interactivity. The interactive image above is a trace over time of what each of our 64 cores was doing at any given moment. By hovering your mouse over the rectangles you can see that cores switched between downloading byte ranges from S3 and parsing those bytes with pandas.read_csv.

Our dataset includes every cab ride in the city of New York in the year of 2015, including when and where it started and stopped, a breakdown of the fare, etc.

>>>df.head()
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
022015-01-15 19:05:392015-01-15 19:23:4211.59-73.99389640.7501111N-73.97478540.750618112.01.00.53.250.00.317.05
112015-01-10 20:33:382015-01-10 20:53:2813.30-74.00164840.7242431N-73.99441540.759109114.50.50.52.000.00.317.80
212015-01-10 20:33:382015-01-10 20:43:4111.80-73.96334140.8027881N-73.95182040.82441329.50.50.50.000.00.310.80
312015-01-10 20:33:392015-01-10 20:35:3110.50-74.00908740.7138181N-74.00432640.71998623.50.50.50.000.00.34.80
412015-01-10 20:33:392015-01-10 20:52:5813.00-73.97117640.7624281N-74.00418140.742653215.00.50.50.000.00.316.30

Basic Aggregations and Groupbys

As a quick exercise, we compute the length of the dataframe. When we call len(df) Dask.dataframe translates this into many len calls on each of the constituent Pandas dataframes, followed by communication of the intermediate results to one node, followed by a sum of all of the intermediate lengths.

>>>len(df)146112989

This takes around 400-500ms. You can see that a few hundred length computations happened quickly on the left, followed by some delay, then a bit of data transfer (the red bar in the plot), and a final summation call.

More complex operations like simple groupbys look similar, although sometimes with more communications. Throughout this post we’re going to do more and more complex computations and our profiles will similarly become more and more rich with information. Here we compute the average trip distance, grouped by number of passengers. We find that single and double person rides go far longer distances on average. We acheive this one big-data-groupby by performing many small Pandas groupbys and then cleverly combining their results.

>>>df.groupby(df.passenger_count).trip_distance.mean().compute()passenger_count02.279183115.541413211.81587131.62005247.48106653.06601962.97715895.45976373.30305483.866298Name:trip_distance,dtype:float64

As a more complex operation we see how well New Yorkers tip by hour of day and by day of week.

df2=df[(df.tip_amount>0)&(df.fare_amount>0)]# filter out bad rowsdf2['tip_fraction']=df2.tip_amount/df2.fare_amount# make new columndayofweek=(df2.groupby(df2.tpep_pickup_datetime.dt.dayofweek).tip_fraction.mean())hour=(df2.groupby(df2.tpep_pickup_datetime.dt.hour).tip_fraction.mean())

tip fraction by hour

We see that New Yorkers are generally pretty generous, tipping around 20%-25% on average. We also notice that they become very generous at 4am, tipping an average of 38%.

This more complex operation uses more of the Dask dataframe API (which mimics the Pandas API). Pandas users should find the code above fairly familiar. We remove rows with zero fare or zero tip (not every tip gets recorded), make a new column which is the ratio of the tip amount to the fare amount, and then groupby the day of week and hour of day, computing the average tip fraction for each hour/day.

Dask evaluates this computation with thousands of small Pandas calls across the cluster (try clicking the wheel zoom icon in the upper right of the image above and zooming in). The answer comes back in about 3 seconds.

Joins and Correlations

To show off more basic functionality we’ll join this Dask dataframe against a smaller Pandas dataframe that includes names of some of the more cryptic columns. Then we’ll correlate two derived columns to determine if there is a relationship between paying Cash and the recorded tip.

>>>payments=pd.Series({1:'Credit Card',2:'Cash',3:'No Charge',4:'Dispute',5:'Unknown',6:'Voided trip'})>>>df2=df.merge(payments,left_on='payment_type',right_index=True)>>>df2.groupby(df2.payment_name).tip_amount.mean().compute()payment_nameCash0.000217CreditCard2.757708Dispute-0.011553Nocharge0.003902Unknown0.428571Name:tip_amount,dtype:float64

We see that while the average tip for a credit card transaction is $2.75, the average tip for a cash transaction is very close to zero. At first glance it seems like cash tips aren’t being reported. To investigate this a bit further lets compute the Pearson correlation between paying cash and having zero tip. Again, this code should look very familiar to Pandas users.

zero_tip=df2.tip_amount==0cash=df2.payment_name=='Cash'dd.concat([zero_tip,cash],axis=1).corr().compute()
tip_amountpayment_name
tip_amount1.0000000.943123
payment_name0.9431231.000000

So we see that standard operations like row filtering, column selection, groupby-aggregations, joining with a Pandas dataframe, correlations, etc. all look and feel like the Pandas interface. Additionally, we’ve seen through profile plots that most of the time is spent just running Pandas functions on our workers, so Dask.dataframe is, in most cases, adding relatively little overhead. These little functions represented by the rectangles in these plots are just pandas functions. For example the plot above has many rectangles labeled merge if you hover over them. This is just the standard pandas.merge function that we love and know to be very fast in memory.

Shuffles and Time Series

Distributed dataframe experts will know that none of the operations above require a shuffle. That is we can do most of our work with relatively little inter-node communication. However not all operations can avoid communication like this and sometimes we need to exchange most of the data between different workers.

For example if our dataset is sorted by customer ID but we want to sort it by time then we need to collect all the rows for January over to one Pandas dataframe, all the rows for February over to another, etc.. This operation is called a shuffle and is the base of computations like groupby-apply, distributed joins on columns that are not the index, etc..

You can do a lot with dask.dataframe without performing shuffles, but sometimes it’s necessary. In the following example we sort our data by pickup datetime. This will allow fast lookups, fast joins, and fast time series operations, all common cases. We do one shuffle ahead of time to make all future computations fast.

We set the index as the pickup datetime column. This takes anywhere from 25-40s and is largely network bound (60GB, some text, eight machines with eight cores each on AWS non-enhanced network). This also requires running something like 16000 tiny tasks on the cluster. It’s worth zooming in on the plot below.

>>>df=c.persist(df.set_index('tpep_pickup_datetime'))

This operation is expensive, far more expensive than it was with Pandas when all of the data was in the same memory space on the same computer. This is a good time to point out that you should only use distributed tools like Dask.datframe and Spark after tools like Pandas break down. We should only move to distributed systems when absolutely necessary. However, when it does become necessary, it’s nice knowing that Dask.dataframe can faithfully execute Pandas operations, even if some of them take a bit longer.

As a result of this shuffle our data is now nicely sorted by time, which will keep future operations close to optimal. We can see how the dataset is sorted by pickup time by quickly looking at the first entries, last entries, and entries for a particular day.

>>>df.head()# has the first entries of 2015
VendorIDtpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
tpep_pickup_datetime
2015-01-01 00:00:0022015-01-01 00:00:0031.56-74.00132040.7290571N-74.01020840.71966217.50.50.50.00.00.38.8
2015-01-01 00:00:0022015-01-01 00:00:0011.68-73.99154740.7500691N0.0000000.000000210.00.00.50.00.00.310.8
2015-01-01 00:00:0012015-01-01 00:11:2654.00-73.97143640.7602011N-73.92118140.768269213.50.50.50.00.00.014.5
>>>df.tail()# has the last entries of 2015
VendorIDtpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
tpep_pickup_datetime
2015-12-31 23:59:5612016-01-01 00:09:2511.00-73.97390040.7428931N-73.98957140.75054918.00.50.51.850.00.311.15
2015-12-31 23:59:5812016-01-01 00:05:1922.00-73.96527140.7602811N-73.93951440.75238827.50.50.50.000.00.38.80
2015-12-31 23:59:5922016-01-01 00:10:2611.96-73.99755940.7256931N-74.01712040.70532228.50.50.50.000.00.39.80
>>>df.loc['2015-05-05'].head()# has the entries for just May 5th
VendorIDtpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
tpep_pickup_datetime
2015-05-0522015-05-05 00:00:0011.20-73.98194140.7664601N-73.97277140.75800726.51.00.50.000.000.38.30
2015-05-0512015-05-05 00:10:1211.70-73.99467540.7505071N-73.98024740.73856019.00.50.52.570.000.312.87
2015-05-0512015-05-05 00:07:5012.50-74.00293040.7336811N-74.01360340.70236229.50.50.50.000.000.310.80

Because we know exactly which Pandas dataframe holds which data we can execute row-local queries like this very quickly. The total round trip from pressing enter in the interpreter or notebook is about 40ms. For reference, 40ms is the delay between two frames in a movie running at 25 Hz. This means that it’s fast enough that human users perceive this query to be entirely fluid.

Time Series

Additionally, once we have a nice datetime index all of Pandas’ time series functionality becomes available to us.

For example we can resample by day:

>>>(df.passenger_count.resample('1d').mean().compute().plot())

resample by day

We observe a strong periodic signal here. The number of passengers is reliably higher on the weekends.

We can perform a rolling aggregation in about a second:

>>>s=client.persist(df.passenger_count.rolling(10).mean())

Because Dask.dataframe inherits the Pandas index all of these operations become very fast and intuitive.

Parquet

Pandas’ standard “fast” recommended storage solution has generally been the HDF5 data format. Unfortunately the HDF5 file format is not ideal for distributed computing, so most Dask dataframe users have had to switch down to CSV historically. This is unfortunate because CSV is slow, doesn’t support partial queries (you can’t read in just one column), and also isn’t supported well by the other standard distributed Dataframe solution, Spark. This makes it hard to move data back and forth.

Fortunately there are now two decent Python readers for Parquet, a fast columnar binary store that shards nicely on distributed data stores like the Hadoop File System (HDFS, not to be confused with HDF5) and Amazon’s S3. The already fast Parquet-cpp project has been growing Python and Pandas support through Arrow, and the Fastparquet project, which is an offshoot from the pure-python parquet library has been growing speed through use of NumPy and Numba.

Using Fastparquet under the hood, Dask.dataframe users can now happily read and write to Parquet files. This increases speed, decreases storage costs, and provides a shared format that both Dask dataframes and Spark dataframes can understand, improving the ability to use both computational systems in the same workflow.

Writing our Dask dataframe to S3 can be as simple as the following:

df.to_parquet('s3://dask-data/nyc-taxi/tmp/parquet')

However there are also a variety of options we can use to store our data more compactly through compression, encodings, etc.. Expert users will probably recognize some of the terms below.

df=df.astype({'VendorID':'uint8','passenger_count':'uint8','RateCodeID':'uint8','payment_type':'uint8'})df.to_parquet('s3://dask-data/nyc-taxi/tmp/parquet',compression='snappy',has_nulls=False,object_encoding='utf8',fixed_text={'store_and_fwd_flag':1})

We can then read our nicely indexed dataframe back with the dd.read_parquet function:

>>>df2=dd.read_parquet('s3://dask-data/nyc-taxi/tmp/parquet')

The main benefit here is that we can quickly compute on single columns. The following computation runs in around 6 seconds, even though we don’t have any data in memory to start (recall that we started this blogpost with a minute-long call to read_csv.and Client.persist)

>>>df2.passenger_count.value_counts().compute()110299104522090137257939001361351076512395142981071040853723981819169Name:passenger_count,dtype:int64

Final Thoughts

With the recent addition of faster shuffles and Parquet support, Dask dataframes become significantly more attractive. This blogpost gave a few categories of common computations, along with precise profiles of their execution on a small cluster. Hopefully people find this combination of Pandas syntax and scalable computing useful.

Now would also be a good time to remind people that Dask dataframe is only one module among many within the Dask project. Dataframes are nice, certainly, but Dask’s main strength is its flexibility to move beyond just plain dataframe computations to handle even more complex problems.

Learn More

If you’d like to learn more about Dask dataframe, the Dask distributed system, or other components you should look at the following documentation:

  1. http://dask.pydata.org/en/latest/
  2. http://distributed.readthedocs.io/en/latest/

The workflows presented here are captured in the following notebooks (among other examples):

  1. NYC Taxi example, shuffling, others
  2. Parquet

What we could have done better

As always with computational posts we include a section on what went wrong, or what could have gone better.

  1. The 400ms computation of len(df) is a regression from previous versions where this was closer to 100ms. We’re getting bogged down somewhere in many small inter-worker communications.
  2. It would be nice to repeat this computation at a larger scale. Dask deployments in the wild are often closer to 1000 cores rather than the 64 core cluster we have here and datasets are often in the terrabyte scale rather than our 60 GB NYC Taxi dataset. Unfortunately representative large open datasets are hard to find.
  3. The Parquet timings are nice, but there is still room for improvement. We seem to be making many small expensive queries of S3 when reading Thrift headers.
  4. It would be nice to support both Python Parquet readers, both the Numba solution fastparquet and the C++ solution parquet-cpp

Distributed NumPy on a Cluster with Dask Arrays

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

This page includes embedded large profiles. It may look better on the actual site TODO: link to live site (rather than through syndicated pages like planet.python) and it may take a while to load on non-broadband connections (total size is around 20MB)

Summary

We analyze a stack of images in parallel with NumPy arrays distributed across a cluster of machines on Amazon’s EC2 with Dask array. This is a model application shared among many image analysis groups ranging from satellite imagery to bio-medical applications. We go through a series of common operations:

  1. Inspect a sample of images locally with Scikit Image
  2. Construct a distributed Dask.array around all of our images
  3. Process and re-center images with Numba
  4. Transpose data to get a time-series for every pixel, compute FFTs

This last step is quite fun. Even if you skim through the rest of this article I recommend checking out the last section.

Inspect Dataset

I asked a colleague at the US National Institutes for Health (NIH) for a biggish imaging dataset. He came back with the following message:

*Electron microscopy may be generating the biggest ndarray datasets in the field - terabytes regularly. Neuroscience needs EM to see connections between neurons, because the critical features of neural synapses (connections) are below the diffraction limit of light microscopes. This type of research has been called “connectomics”. Many groups are looking at machine vision approaches to follow small neuron parts from one slice to the next. *

This data is from drosophila: http://emdata.janelia.org/. Here is an example 2d slice of the data http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000.

importskimage.ioimportmatplotlib.pyplotaspltsample=skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000'skimage.io.imshow(sample)

Sample electron microscopy image from stack

The last number in the URL is an index into a large stack of about 10000 images. We can change that number to get different slices through our 3D dataset.

samples=[skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d'%i)foriin[1000,2000,3000,4000,5000,6000,7000,8000,9000]]fig,axarr=plt.subplots(1,9,sharex=True,sharey=True,figsize=(24,2.5))fori,sampleinenumerate(samples):axarr[i].imshow(sample,cmap='gray')

Sample electron microscopy images over time

We see that our field of interest wanders across the frame over time and drops off in the beginning and at the end.

Create a Distributed Array

Even though our data is spread across many files, we still want to think of it as a single logical 3D array. We know how to get any particular 2D slice of that array using Scikit-image. Now we’re going to use Dask.array to stitch all of those Scikit-image calls into a single distributed array.

importdask.arrayasdafromdaskimportdelayedimread=delayed(skimage.io.imread,pure=True)# Lazy version of imreadurls=['http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d'%iforiinrange(10000)]# A list of our URLslazy_values=[imread(url)forurlinurls]# Lazily evaluate imread on each urlarrays=[da.from_delayed(lazy_value,# Construct a small Dask arraydtype=sample.dtype,# for every lazy valueshape=sample.shape)forlazy_valueinlazy_values]stack=da.stack(arrays,axis=0)# Stack all small Dask arrays into one
>>>stackdask.array<shape=(10000,2000,2000),dtype=uint8,chunksize=(1,2000,2000)>
>>>stack=stack.rechunk((20,2000,2000))# combine chunks to reduce overhead>>>stackdask.array<shape=(10000,2000,2000),dtype=uint8,chunksize=(20,2000,2000)>

So here we’ve constructed a lazy Dask.array from 10 000 delayed calls to skimage.io.imread. We haven’t done any actual work yet, we’ve just constructed a parallel array that knows how to get any particular slice of data by downloading the right image if necessary. This gives us a full NumPy-like abstraction on top of all of these remote images. For example we can now download a particular image just by slicing our Dask array.

>>>stack[5000,:,:].compute()array([[0,0,0,...,0,0,0],[0,0,0,...,0,0,0],[0,0,0,...,0,0,0],...,[0,0,0,...,0,0,0],[0,0,0,...,0,0,0],[0,0,0,...,0,0,0]],dtype=uint8)>>>stack[5000,:,:].mean().compute()11.49902425

However we probably don’t want to operate too much further without connecting to a cluster. That way we can just download all of the images once into distributed RAM and start doing some real computations. I happen to have ten m4.2xlarges on Amazon’s EC2 (8 cores, 30GB RAM each) running Dask workers. So we’ll connect to those.

fromdask.distributedimportClient,progressclient=Client('schdeduler-address:8786')>>>client<Client:scheduler="scheduler-address:8786"processes=10cores=80>

I’ve replaced the actual address of my scheduler (something like 54.183.180.153 with `scheduler-address. Let’s go ahead and bring in all of our images, persisting the array into concrete data in memory.

stack=client.persist(stack)

This starts downloads of our 10 000 images across our 10 workers. When this completes we have 10 000 NumPy arrays spread around on our cluster, coordinated by our single logical Dask array. This takes a while, about five minutes. We’re mostly network bound here (Janelia’s servers are not co-located with our compute nodes). Here is a parallel profile of the computation as an interactive Bokeh plot.

There will be a few of these profile plots throughout the blogpost, so you might want to familiarize yoursel with them now. Every horizontal rectangle in this plot corresponds to a single Python function running somewhere in our cluster over time. Because we called skimage.io.imread 10 000 times there are 10 000 purple rectangles. Their position along the y-axis denotes which of the 80 cores in our cluster that they ran on and their position along the x-axis denotes their start and stop times. You can hover over each rectangle (function) for more information on what kind of task it was, how long it took, etc.. In the image below, purple rectangles are skimage.io.imread calls and red rectangles are data transfer between workers in our cluster. Click the magnifying glass icons in the upper right of the image to enable zooming tools.

Now that we have persisted our Dask array in memory our data is based on hundreds of concrete in-memory NumPy arrays across the cluster, rather than based on hundreds of lazy scikit-image calls. Now we can do all sorts of fun distributed array computations more quickly.

For example we can easily see our field of interest move across the frame by averaging across time:

skimage.io.imshow(stack.mean(axis=0).compute())

Avergage image over time

Or we can see when the field of interest is actually present within the frame by averaging across x and y

plt.plot(stack.mean(axis=[1,2]).compute())

Image brightness over time

By looking at the profile plots for each case we can see that averaging over time involves much more inter-node communication, which can be quite expensive in this case.

Recenter Images with Numba

In order to remove the spatial offset across time we’re going to compute a centroid for each slice and then crop the image around that center. I looked up centroids in the Scikit-Image docs and came across a function that did way more than what I was looking for, so I just quickly coded up a solution in Pure Python and then JIT-ed it with Numba (which makes this run at C-speeds).

fromnumbaimportjit@jit(nogil=True)defcentroid(im):n,m=im.shapetotal_x=0total_y=0total=0foriinrange(n):forjinrange(m):total+=im[i,j]total_x+=i*im[i,j]total_y+=j*im[i,j]iftotal>0:total_x/=totaltotal_y/=totalreturntotal_x,total_y>>>centroid(sample)# this takes around 9ms(748.7325324581344,802.4893005160851)
defrecenter(im):x,y=centroid(im.squeeze())x,y=int(x),int(y)ifx<500:x=500ify<500:y=500ifx>1500:x=1500ify>1500:y=1500returnim[...,x-500:x+500,y-500:y+500]plt.figure(figsize=(8,8))skimage.io.imshow(recenter(sample))

Recentered image

Now we map this function across our distributed array.

importnumpyasnpdefrecenter_block(block):""" Recenter a short stack of images """returnnp.stack([recenter(block[i])foriinrange(block.shape[0])])recentered=stack.map_blocks(recenter,chunks=(20,1000,1000),# chunk size changesdtype=a.dtype)recentered=client.persist(recentered)

This profile provides a good opportunity to talk about a scheduling failure; things went a bit wrong here. Towards the beginning we quickly recenter several images (Numba is fast), taking around 300-400ms for each block of twenty images. However as some workers finish all of their allotted tasks, the scheduler erroneously starts to load balance, moving images from busy workers to idle workers. Unfortunately the network at this time appeared to be much slower than expected and so the move + compute elsewhere strategy ended up being much slower than just letting the busy workers finish their work. The scheduler keeps track of expected compute times and transfer times precisely to avoid mistakes like this one. These sorts of issues are rare, but do occur on occasion.

We check our work by averaging our re-centered images across time and displaying that to the screen. We see that our images are better centered with each other as expected.

skimage.io.imshow(recentered.mean(axis=0))

Recentered time average

This shows how easy it is to create fast in-memory code with Numba and then scale it out with Dask.array. The two projects complement each other nicely, giving us near-optimal performance with intuitive code across a cluster.

Rechunk to Time Series by Pixel

We’re now going to rearrange our data from being partitioned by time slice, to being partitioned by pixel. This will allow us to run computations like Fast Fourier Transforms (FFTs) on each time series efficiently. Switching the chunk pattern back and forth like this is generally a very difficult operation for distributed arrays because every slice of the array contributes to every time-series. We have N-squared communication.

This analysis may not be appropriate for this data (we won’t learn any useful science from doing this), but it represents a very frequently asked question, so I wanted to include it.

Currently our Dask array has chunkshape (20, 1000, 1000), meaning that our data is collected into 500 NumPy arrays across the cluster, each of size (20, 1000, 1000).

>>>recentereddask.array<shape=(10000,1000,1000),dtype=uint8,chunksize=(20,1000,1000)>

But we want to change this shape so that the chunks cover the entire first axis. We want all data for any particular pixel to be in the same NumPy array, not spread across hundreds of different NumPy arrays. We could solve this by rechunking so that each pixel is its own block like the following:

>>>rechunked=recentered.rechunk((10000,1,1))

However this would result in one million chunks (there are one million pixels) which will result in a bit of scheduling overhead. Instead we’ll collect our time-series into 10 x 10 groups of one hundred pixels. This will help us to reduce overhead.

>>># rechunked = recentered.rechunk((10000, 1, 1))  # Too many chunks>>>rechunked=recentered.rechunk((10000,10,10))# Use larger chunks

Now we compute the FFT of each pixel, take the absolute value and square to get the power spectrum. Finally to conserve space we’ll down-grade the dtype to float32 (our original data is only 8-bit anyway).

x=da.fft.fft(rechunked,axis=0)power=abs(x**2).astype('float32')power=client.persist(power,optimize_graph=False)

This is a fun profile to inspect; it includes both the rechunking and the subsequent FFTs. We’ve included a real-time trace during execution, the full profile, as well as some diagnostics plots from a single worker. These plots total up to around 20MB. I sincerely apologize to those without broadband access.

Here is a real time plot of the computation finishing over time:

Dask task stream of rechunk + fft

And here is a single interactive plot of the entire computation after it completes. Zoom with the tools in the upper right. Hover over rectangles to get more information. Remember that red is communication.

Screenshots of the diagnostic dashboard of a single worker during this computation.

Worker communications during FFTWorker communications during FFT

This computation starts with a lot of communication while we rechunk and realign our data (recent optimizations here by Antoine Pitrou in dask #417). Then we transition into doing thousands of small FFTs and other arithmetic operations. All of the plots above show a nice transition from heavy communication to heavy processing with some overlap each way (once some complex blocks are available we get to start overlapping communication and computation). Inter-worker communication was around 100-300 MB/s (typical for Amazon’s EC2) and CPU load remained high. We’re using our hardware.

Finally we can inspect the results. We see that the power spectrum is very boring in the corner, and has typical activity towards the center of the image.

plt.semilogy(1+power[:,0,0].compute())

Power spectrum near edge

plt.semilogy(1+power[:,500,500].compute())

Power spectrum at center

Final Thoughts

This blogpost showed a non-trivial image processing workflow, emphasizing the following points:

  1. Construct a Dask array from lazy SKImage calls.
  2. Use NumPy syntax with Dask.array to aggregate distributed data across a cluster.
  3. Build a centroid function with Numba. Use Numba and Dask together to clean up an image stack.
  4. Rechunk to facilitate time-series operations. Perform FFTs.

Hopefully this example has components that look similar to what you want to do with your data on your hardware. We would love to see more applications like this out there in the wild.

What we could have done better

As always with all computationally focused blogposts we’ll include a section on what went wrong and what we could have done better with more time.

  1. Communication is too expensive: Interworker communications that should be taking 200ms are taking up to 10 or 20 seconds. We need to take a closer look at our communications pipeline (which normally performs just fine on other computations) to see if something is acting up. Disucssion here dask/distributed #776 and early work here dask/distributed #810.
  2. Faulty Load balancing: We discovered a case where our load-balancing heuristics misbehaved, incorrectly moving data between workers when it would have been better to let everything alone. This is likely due to the oddly low bandwidth issues observed above.
  3. Loading from disk blocks network I/O: While doing this we discovered an issue where loading large amounts of data from disk can block workers from responding to network requests (dask/distributed #774)
  4. Larger datasets: It would be fun to try this on a much larger dataset to see how the solutions here scale.

Dask Development Log

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Stability enhancements for the distributed scheduler and micro-release
  2. NASA Grant writing
  3. Dask-EC2 script
  4. Dataframe categorical flexibility (work in progress)
  5. Communication refactor (work in progress)

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.

NASA Grant Writing

I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.

Dask-EC2 startup

The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:

pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
            --keypair /path/to/ssh-key \
            --type m4.2xlarge
            --count 8

This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen `dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.

If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.

Dataframe Categorical Flexibility

Categoricals can significantly improve performance on text-based data. Currently Dask’s dataframes support categoricals, but they expect to know all of the categories up-front. This is easy if this set is small, like the ["Healthy", "Sick"] categories that might arise in medical research, but requires a full dataset read if the categories are not known ahead of time, like the names of all of the patients.

Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.

Communication Refactor

Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.

Custom Parallel Algorithms on a Cluster with Dask

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post describes Dask as a computational task scheduler that fits somewhere on a spectrum between big data computing frameworks like Hadoop/Spark and task schedulers like Airflow/Celery/Luigi. We see how, by combining elements from both of these types of systems Dask is able to handle complex data science problems particularly well.

This post is in contrast to two recent posts on structured parallel collections:

  1. Distributed DataFrames
  2. Distributed Arrays

Big Data Collections

Most distributed computing systems like Hadoop or Spark or SQL databases implement a small but powerful set of parallel operations like map, reduce, groupby, and join. As long as you write your programs using only those operations then the platforms understand your program and serve you well. Most of the time this is great because most big data problems are pretty simple.

However, as we explore new complex algorithms or messier data science problems, these large parallel operations start to become insufficiently flexible. For example, consider the following data loading and cleaning problem:

  1. Load data from 100 different files (this is a simple map operation)
  2. Also load a reference dataset from a SQL database (not parallel at all, but could run alongside the map above)
  3. Normalize each of the 100 datasets against the reference dataset (sort of like a map, but with another input)
  4. Consider a sliding window of every three normalized datasets (Might be able to hack this with a very clever join? Not sure.)
  5. Of all of the 98 outputs of the last stage, consider all pairs. (Join or cartesian product) However, because we don’t want to compute all ~10000 possibilities, let’s just evaluate a random sample of these pairs
  6. Find the best of all of these possibilities (reduction)

In sequential for-loopy code this might look like the following:

filenames=['mydata-%d.dat'%iforiinrange(100)]data=[load(fn)forfninfilenames]reference=load_from_sql('sql://mytable')processed=[process(d,reference)fordindata]rolled=[]foriinrange(len(processed)-2):a=processed[i]b=processed[i+1]c=processed[i+2]r=roll(a,b,c)rolled.append(r)compared=[]foriinrange(200):a=random.choice(rolled)b=random.choice(rolled)c=compare(a,b)compared.append(c)best=reduction(compared)

This code is clearly parallelizeable, but it’s not clear how to write it down as a MapReduce program, a Spark computation, or a SQL query. These tools generally fail when asked to express complex or messy problems. We can still use Hadoop/Spark to solve this problem, but we are often forced to change and simplify our objectives a bit. (This problem is not particularly complex, and I suspect that there are clever ways to do it, but it’s not trivial and often inefficient.)

Task Schedulers

So instead people use task schedulers like Celery, Luigi, or Airflow. These systems track hundreds of tasks, each of which is just a normal Python function that runs on some normal Python data. The task scheduler tracks dependencies between tasks and so runs as many as it can at once if they don’t depend on each other.

This is a far more granular approach than the Big-Bulk-Collection approach of MapReduce and Spark. However systems like Celery, Luigi, and Airflow are also generally less efficient. This is both because they know less about their computations (map is much easier to schedule than an arbitrary graph) and because they just don’t have machinery for inter-worker communication, efficient serialization of custom datatypes, etc..

Dask Mixes Task Scheduling with Efficient Computation

Dask is both a big data system like Hadoop/Spark that is aware of resilience, inter-worker communication, live state, etc. and also a general task scheduler like Celery, Luigi, or Airflow, capable of arbitrary task execution.

Many Dask users use something like Dask dataframe, which generates these graphs automatically, and so never really observe the task scheduler aspect of Dask This is, however, the core of what distinguishes Dask from other systems like Hadoop and Spark. Dask is incredibly flexible in the kinds of algorithms it can run. This is because, at its core, it can run any graph of tasks and not just map, reduce, groupby, join, etc.. Users can do this natively, without having to subclass anything or extend Dask to get this extra power.

There are significant performance advantages to this. For example:

  1. Dask.dataframe can easily represent nearest neighbor computations for fast time-series algorithms
  2. Dask.array can implement complex linear algebra solvers or SVD algorithms from the latest research
  3. Complex Machine Learning algorithms are often easier to implement in Dask, allowing it to be more efficient through smarter algorithms, as well as through scalable computing.
  4. Complex hierarchies from bespoke data storage solutions can be explicitly modeled and loaded in to other Dask systems

This doesn’t come for free. Dask’s scheduler has to be very intelligent to smoothly schedule arbitrary graphs while still optimizing for data locality, worker failure, minimal communication, load balancing, scarce resources like GPUs and more. It’s a tough job.

Dask.delayed

So let’s go ahead and run the data ingestion job described with Dask.

We craft some fake functions to simulate actual work:

importrandomfromtimeimportsleepdefload(address):sleep(random.random()/2)defload_from_sql(address):sleep(random.random()/2+0.5)defprocess(data,reference):sleep(random.random()/2)defroll(a,b,c):sleep(random.random()/5)defcompare(a,b):sleep(random.random()/10)defreduction(seq):sleep(random.random()/1)

We annotate these functions with dask.delayed, which changes a function so that instead of running immediately it captures its inputs and puts everything into a task graph for future execution.

fromdaskimportdelayedload=delayed(load)load_from_sql=delayed(load_from_sql)process=delayed(process)roll=delayed(roll)compare=delayed(compare)reduction=delayed(reduction)

Now we just call our normal Python for-loopy code from before. However now rather than run immediately our functions capture a computational graph that can be run elsewhere.

filenames=['mydata-%d.dat'%iforiinrange(100)]data=[load(fn)forfninfilenames]reference=load_from_sql('sql://mytable')processed=[process(d,reference)fordindata]rolled=[]foriinrange(len(processed)-2):a=processed[i]b=processed[i+1]c=processed[i+2]r=roll(a,b,c)rolled.append(r)compared=[]foriinrange(200):a=random.choice(rolled)b=random.choice(rolled)c=compare(a,b)compared.append(c)best=reduction(compared)

Here is an image of that graph for a smaller input of only 10 files and 20 random pairs

Custom ETL Dask Graph

We can connect to a small cluster with 20 cores

fromdask.distributedimportClientclient=Client('scheduler-address:8786')

We compute the result and see the trace of the computation running in real time.

result=best.compute()

Custom ETL Task Stream

The completed Bokeh image below is interactive. You can pan and zoom by selecting the tools in the upper right. You can see every task, which worker it ran on and how long it took by hovering over the rectangles.

We see that we use all 20 cores well. Intermediate results are transferred between workers as necessary (these are the red rectangles). We can scale this up as necessary. Dask scales to thousands of cores.

Final Thoughts

Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow-style and yet run them with the scalability promises of Hadoop/Spark allows for a pleasant freedom to write comfortably and yet still compute scalably. This ability opens up new possibilities both to support more sophisticated algorithms and also to handle messy situations that arise in the real world (enterprise data systems are sometimes messy) while still remaining within the bounds of “normal and supported” Dask operation.

Dask Development Log

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-17 and 2016-01-30. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Micro-release of distributed scheduler
  2. as_completed for asynchronous algorithms
  3. Testing ZeroMQ and communication refactor
  4. Persist, and Dask-GLM

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.2, which includes some important performance improvements for communicating multi-dimensional arrays, cleaner scheduler shutdown of workers for adaptive deployments, an improved as_completed iterator that can accept new futures in flight, and a few other smaller features.

The feature here that excites me the most is improved communication of multi-dimensional arrays across the network. In a recent blogpost about image processing on a cluster we noticed that communication bandwidth was far lower than expected. This led us to uncover a flaw in our compression heuristics. Dask doesn’t compress all data, instead it takes a few 10kB samples of the data, compresses them, and if that goes well, decides to compress the entire thing. Unfortunately, due to our mishandling of memoryviews we ended up taking much larger samples than 1kB when dealing with multi-dimensional arrays.

XArray release

This improvement is particularly timely because a new release of XArray (a project that wraps around Dask.array for large labeled arrays) is now available with better data ingestion support for NetCDF data on distributed clusters. This opens up distributed array computing to Dask’s first (and possibly largest) scientific user community, atmospheric scientists.

as_completed accepts futures.

In addition to arrays, dataframes, and the delayed decorator, Dask.distributed also implements the concurrent.futures interface from the standard library (except that Dask’s version parallelizes across a cluster and has a few other benefits). Part of this interface is the as_completed function, which takes in a list of futures and yields those futures in the order in which they finish. This enables the construction of fairly responsive asynchronous computations. As soon as some work finishes you can look at the result and submit more work based on the current state.

That has been around in Dask for some time.

What’s new is that you can now push more futures into as_completed

futures=client.map(my_function,sequence)ac=as_completed(futures)forfutureinac:result=future.result()# future is already finished, so this is fastifcondition:new_future=client.submit(function,*args,**kwargs)ac.add(new_future)# <<---- This is the new ability

So the as_completed iterator can keep going for quite a while with a set of futures always in flight. This relatively simple change allows for the easy expression of a broad set of asynchronous algorithms.

ZeroMQ and Communication Refactor

As part of a large effort, Antoine Pitrou has been refactoring Dask’s communication system. One sub-goal of this refactor was to allow us to explore other transport mechanisms than Tornado IOStreams in a pluggable way.

One such alternative is ZeroMQ sockets. We’ve gotten both incredibly positive and incredibly negative praise/warnings about using ZeroMQ. It’s not a great fit because Dask mostly just does point-to-point communication, so we don’t benefit from all of ZeroMQ’s patterns, which now become more of a hindrance than a benefit. However, we were interested in the performance impact of managing all of our network communication in a completely separately managed C++ thread unaffected by GIL issues.

Whether or you hate or love ZeroMQ you can now pick and choose. Antoine’s branch allows for easy swapping between transport mechanisms and opens the possibility for more in the future like intra-process communication with Queues, MPI, etc.. This doesn’t affect most users, but some Dask deployments are on supercomputers with exotic networks capable of very fast speeds. The possibility that we might tap into Infiniband someday and have the ability to manage data locally without copies (Tornado does not achieve this) is very attractive to some user communities.

After very preliminary benchmarking we’ve found that ZeroMQ offers a small speedup, but results in a lack of stability in the cluster under complex workloads (likely our fault, not ZeroMQs, but still). ZeroMQ support is strictly experimental and will likely stay that way for a long time. Readers should not get excited about this.

Perist and Dask-GLM

In collaboration with researchers at Capital One we’ve been working on a set of parallel solvers for first and second order methods, such as are commonly used in a broad class of statistical and machine learning algorithms.

One challenge in this process has been building algorithms that are simultaneously optimal for both the single machine and distributed schedulers. The distributeed scheduler requires that we think about where data is, on the client or on the cluster, where for the single machine scheudler this is less of a problem. The distrbuted scheduler appropriately has a new verb, persist, which keeps data as a Dask collection, but triggers all of the internal computation

compute(daskarray)->numpyarraypersist(daskarray)->daskarray

We have now mirrored this verb to the single machine scheduler in dask/dask #1927 and we get very nice performance on dask-glm’s algorithms in both cases now.

Working with the developers at Capital One has been very rewarding. I would like to find more machine learning groups that fit the following criteria:

  1. Are focused on performance
  2. Need parallel computing
  3. Are committed to built good open source software
  4. Are sufficiently expert in their field to understand correct algorithms

If you know such a person or such a group, either in industry or just a grad student in university, please encourage them to raise an issue at http://github.com/dask/dask/issues/new . I will likely write a larger blogpost on this topic in the near future.

Two Easy Ways to Use Scikit Learn and Dask

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post describes two simple ways to use Dask to parallelize Scikit-Learn operations either on a single computer or across a cluster.

  1. Use the Dask Joblib backend
  2. Use the dklearn projects drop-in replacements for Pipeline, GridSearchCV, and RandomSearchCV

For the impatient, these look like the following:

### Joblibfromjoblibimportparallel_backendwithparallel_backend('dask.distributed',scheduler_host='scheduler-address:8786'):# your now-cluster-ified sklearn code here### Dask-learn pipeline and GridSearchCV drop-in replacements# from sklearn.grid_search import GridSearchCVfromdklearn.grid_searchimportGridSearchCV# from sklearn.pipeline import Pipelinefromdklearn.pipelineimportPipeline

However, neither of these techniques are perfect. These are the easiest things to try, but not always the best solutions. This blogpost focuses on low-hanging fruit.

Joblib

Scikit-Learn already parallelizes across a multi-core CPU using Joblib, a simple but powerful and mature library that provides an extensible map operation. Here is a simple example of using Joblib on its own without sklearn:

# Sequential codefromtimeimportsleepdefslowinc(x):sleep(1)# take a bit of time to simulate real workreturnx+1>>>[slowinc(i)foriinrange(10)]# this takes 10 seconds[1,2,3,4,5,6,7,8,9,10]# Parallel codefromjoblibimportParallel,delayed>>>Parallel(n_jobs=4)(delayed(slowinc)(i)foriinrange(10))# this takes 3 seconds[1,2,3,4,5,6,7,8,9,10]

Dask users will recognize the delayed function modifier. Dask stole the delayed decorator from Joblib.

Many of Scikit-learn’s parallel algorithms use Joblib internally. If we can extend Joblib to clusters then we get some added parallelism from joblib-enabled Scikit-learn functions immediately.

Distributed Joblib

Fortunately Joblib provides an interface for other parallel systems to step in and act as an execution engine. We can do this with the parallel_backend context manager to run with hundreds or thousands of cores in a nearby cluster:

importdistributed.joblibfromjoblibimportparallel_backendwithparallel_backend('dask.distributed',scheduler_host='scheduler-address:8786'):print(Parallel()(delayed(slowinc)(i)foriinlist(range(100))))

The main value for Scikit-learn users here is that Scikit-learn already uses joblib.Parallel within its code, so this trick works with the Scikit-learn code that you already have.

So we can use Joblib to parallelize normally on our multi-core processor:

estimator=GridSearchCV(n_jobs=4,...)# use joblib on local multi-core processor

or we can use Joblib together with Dask.distributed to parallelize across a multi-node cluster:

withparallel_backend('dask.distributed',scheduler_host='scheduler-address:8786'):estimator=GridSearchCV(...)# use joblib with Dask cluster

(There will be a more thorough example towards the end)

Limitations

Joblib is used throughout many algorithms in Scikit-learn, but not all. Generally any operation that accepts an n_jobs= parameter is a possible choice.

From Dask’s perspective Joblib’s interface isn’t ideal. For example it will always collect intermediate results back to the main process, rather than leaving them on the cluster until necessary. For computationally intense operations this is fine but does add some unnecessary communication overhead. Also Joblib doesn’t allow for operations more complex than a parallel map, so the range of algorithms that this can parallelize is somewhat limited.

Still though, given the wide use of Joblib-accelerated workflows (particularly within Scikit-learn) this is a simple thing to try if you have a cluster nearby with a possible large payoff.

Dask-learn Pipeline and Gridsearch

In July 2016, Jim Crist built and wrote about a small project, dask-learn. This project was a collaboration with SKLearn developers and an attempt to see which parts of Scikit-learn were trivially and usefully parallelizable. By far the most productive thing to come out of this work were Dask variants of Scikit-learn’s Pipeline, GridsearchCV, and RandomSearchCV objects that better handle nested parallelism. Jim observed significant speedups over SKLearn code by using these drop-in replacements.

So if you replace the following imports you may get both better single-threaded performance and the ability to scale out to a cluster:

# from sklearn.grid_search import GridSearchCVfromdklearn.grid_searchimportGridSearchCV# from sklearn.pipeline import Pipelinefromdklearn.pipelineimportPipeline

Here is a simple example from Jim’s more in-depth blogpost:

fromsklearn.datasetsimportmake_classificationX,y=make_classification(n_samples=10000,n_features=500,n_classes=2,n_redundant=250,random_state=42)fromsklearnimportlinear_model,decompositionfromsklearn.pipelineimportPipelinefromdklearn.pipelineimportPipelinelogistic=linear_model.LogisticRegression()pca=decomposition.PCA()pipe=Pipeline(steps=[('pca',pca),('logistic',logistic)])#Parameters of pipelines can be set using ‘__’ separated parameter names:grid=dict(pca__n_components=[50,100,150,250],logistic__C=[1e-4,1.0,10,1e4],logistic__penalty=['l1','l2'])# from sklearn.grid_search import GridSearchCVfromdklearn.grid_searchimportGridSearchCVestimator=GridSearchCV(pipe,grid)estimator.fit(X,y)

SKLearn performs this computation in around 40 seconds while the dask-learn drop-in replacements take around 10 seconds. Also, if you add the following lines to connect to a running cluster the whole thing scales out:

fromdask.distributedimportClientc=Client('scheduler-address:8786')

Here is a live Bokeh plot of the computation on a tiny eight process “cluster” running on my own laptop. I’m using processes here to highlight the costs of communication between processes (red). It’s actually about 30% faster to run this computation within the same single process.

Conclusion

This post showed a couple of simple mechanisms for scikit-learn users to accelerate their existing workflows with Dask. These aren’t particularly sophisticated, nor are they performance-optimal, but they are easy to understand and easy to try out. In a future blogpost I plan to cover more complex ways in which Dask can accelerate sophisticated machine learning workflows.

What we could have done better

As always, I include a brief section on what went wrong or what we could have done better with more time.

  • See the bottom of Jim’s post for a more thorough explanation of “what we could have done better” for dask-learn’s pipeline and gridsearch
  • Joblib + Dask.distributed interaction is convenient, but leaves some performance on the table. It’s not clear how Dask can help the sklearn codebase without being too invasive.
  • It would have been nice to spin up an actual cluster on parallel hardware for this post. I wrote this quickly (in a few hours) so decided to skip this. If anyone wants to write a follow-on experiment I would be happy to publish it.

Experiment with Dask and TensorFlow

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post briefly describes potential interactions between Dask and TensorFlow and then goes through a concrete example using them together for distributed training with a moderately complex architecture.

This post was written in haste, see disclaimers below.

Introduction

Dask and TensorFlow both provide distributed computing in Python. TensorFlow excels at deep learning applications while Dask is more generic. We can combine both together in a few applications:

  1. Simple data parallelism: hyper-parameter searches during training and predicting already-trained models against large datasets are both trivial to distribute with Dask as they would be trivial to distribute with any distributed computing system (Hadoop/Spark/Flink/etc..) We won’t discuss this topic much. It should be straightforward.
  2. Deployment: A common pain point with TensorFlow is that setup isn’t well automated. This plagues all distributed systems, especially those that are run on a wide variety of cluster managers (see cluster deployment blogpost for more information). Fortunately, if you already have a Dask cluster running it’s trivial to stand up a distributed TensorFlow network on top of it running within the same processes.
  3. Pre-processing: We pre-process data with dask.dataframe or dask.array, and then hand that data off to TensorFlow for training. If Dask and TensorFlow are co-located on the same processes then this movement is efficient. Working together we can build efficient and general use deep learning pipelines.

In this blogpost we look very briefly at the first case of simple parallelism. Then go into more depth on an experiment that uses Dask and TensorFlow in a more complex situation. We’ll find we can accomplish a fairly sophisticated workflow easily, both due to how sensible TensorFlow is to set up and how flexible Dask can be in advanced situations.

Motivation and Disclaimers

Distributed deep learning is fundamentally changing the way humanity solves some very hard computing problems like natural language translation, speech-to-text transcription, image recognition, etc.. However, distributed deep learning also suffers from public excitement, which may distort our image of its utility. Distributed deep learning is not always the correct choice for most problems. This is for two reasons:

  1. Focusing on single machine computation is often a better use of time. Model design, GPU hardware, etc. can have a more dramatic impact than scaling out. For newcomers to deep learning, watching online video lecture series may be a better use of time than reading this blogpost.
  2. Traditional machine learning techniques like logistic regression, and gradient boosted trees can be more effective than deep learning if you have finite data. They can also sometimes provide valuable interpretability results.

Regardless, there are some concrete take-aways, even if distributed deep learning is not relevant to your application:

  1. TensorFlow is straightforward to set up from Python
  2. Dask is sufficiently flexible out of the box to support complex settings and workflows
  3. We’ll see an example of a typical distributed learning approach that generalizes beyond deep learning.

Additionally the author does not claim expertise in deep learning and wrote this blogpost in haste.

Simple Parallelism

Most parallel computing is simple. We easily apply one function to lots of data, perhaps with slight variation. In the case of deep learning this can enable a couple of common workflows:

  1. Build many different models, train each on the same data, choose the best performing one. Using dask’s concurrent.futures interface, this looks something like the following:

    # Hyperparameter searchclient=Client('dask-scheduler-address:8786')scores=client.map(train_and_evaluate,hyper_param_list,data=data)best=client.submit(max,scores)best.result()
  2. Given an already-trained model, use it to predict outcomes on lots of data. Here we use a big data collection like dask.dataframe:

    # Distributed predictiondf=dd.read_parquet('...')...# do some preprocessing heredf['outcome']=df.map_partitions(predict)

These techniques are relatively straightforward if you have modest exposure to Dask and TensorFlow (or any other machine learning library like scikit-learn), so I’m going to ignore them for now and focus on more complex situations.

Interested readers may find this blogpost on TensorFlow and Spark of interest. It is a nice writeup that goes over these two techniques in more detail.

A Distributed TensorFlow Application

We’re going to replicate this TensorFlow example which uses multiple machines to train a model that fits in memory using parameter servers for coordination. Our TensorFlow network will have three different kinds of servers:

distributed TensorFlow training graph

  1. Workers: which will get updated parameters, consume training data, and use that data to generate updates to send back to the parameter servers
  2. Parameter Servers: which will hold onto model parameters, synchronizing with the workers as necessary
  3. Scorer: which will periodically test the current parameters against validation/test data and emit a current cross_entropy score to see how well the system is running.

This is a fairly typical approach when the model can fit in one machine, but when we want to use multiple machines to accelerate training or because data volumes are too large.

We’ll use TensorFlow to do all of the actual training and scoring. We’ll use Dask to do everything else. In particular, we’re about to do the following:

  1. Prepare data with dask.array
  2. Set up TensorFlow workers as long-running tasks
  3. Feed data from Dask to TensorFlow while scores remain poor
  4. Let TensorFlow handle training using its own network

Prepare Data with Dask.array

For this toy example we’re just going to use the mnist data that comes with TensorFlow. However, we’ll artificially inflate this data by concatenating it to itself many times across a cluster:

defget_mnist():fromtensorflow.examples.tutorials.mnistimportinput_datamnist=input_data.read_data_sets('/tmp/mnist-data',one_hot=True)returnmnist.train.images,mnist.train.labelsimportdask.arrayasdafromdaskimportdelayeddatasets=[delayed(get_mnist)()foriinrange(20)]# 20 versions of same datasetimages=[d[0]fordindatasets]labels=[d[1]fordindatasets]images=[da.from_delayed(im,shape=(55000,784),dtype='float32')foriminimages]labels=[da.from_delayed(la,shape=(55000,10),dtype='float32')forlainlabels]images=da.concatenate(images,axis=0)labels=da.concatenate(labels,axis=0)>>>imagesdask.array<concate...,shape=(1100000,784),dtype=float32,chunksize=(55000,784)>images,labels=c.persist([images,labels])# persist data in memory

This gives us a moderately large distributed array of around a million tiny images. If we wanted to we could inspect or clean up this data using normal dask.array constructs:

im=images[1].compute().reshape((28,28))plt.imshow(im,cmap='gray')

mnist number 3

im=images.mean(axis=0).compute().reshape((28,28))plt.imshow(im,cmap='gray')

mnist mean

im=images.var(axis=0).compute().reshape((28,28))plt.imshow(im,cmap='gray')

mnist var

This shows off how one can use Dask collections to clean up and provide pre-processing and feature generation on data in parallel before sending it to TensorFlow. In our simple case we won’t actually do any of this, but it’s useful in more real-world situations.

Finally, after doing our preprocessing on the distributed array of all of our data we’re going to collect images and labels together and batch them into smaller chunks. Again we use some dask.array constructs and dask.delayed when things get messy.

images=images.rechunk((10000,784))labels=labels.rechunk((10000,10))images=images.to_delayed().flatten().tolist()labels=labels.to_delayed().flatten().tolist()batches=[delayed([im,la])forim,lainzip(images,labels)]batches=c.compute(batches)

Now we have a few hundred pairs of NumPy arrays in distributed memory waiting to be sent to a TensorFlow worker.

Setting up TensorFlow workers alongside Dask workers

Dask workers are just normal Python processes. TensorFlow can launch itself from a normal Python process. We’ve made a small function here that launches TensorFlow servers alongside Dask workers using Dask’s ability to run long-running tasks and maintain user-defined state. All together, this is about 80 lines of code (including comments and docstrings) and allows us to define our TensorFlow network on top of Dask as follows:

$ pip install git+https://github.com/mrocklin/dask-tensorflow
fromdask.distibutedimportClient# we already had this aboveclient=Client('dask-scheduler-address:8786')fromdask_tensorflowimportstart_tensorflowtf_spec,dask_spec=start_tensorflow(client,ps=1,worker=4,scorer=1)>>>tf_spec.as_dict(){'ps':['192.168.100.1:2227'],'scorer':['192.168.100.2:2222'],'worker':['192.168.100.3:2223','192.168.100.4:2224','192.168.100.5:2225','192.168.100.6:2226']}>>>dask_spec{'ps':['tcp://192.168.100.1:34471'],'scorer':['tcp://192.168.100.2:40623'],'worker':['tcp://192.168.100.3:33075','tcp://192.168.100.4:37123','tcp://192.168.100.5:32839','tcp://192.168.100.6:36822']}

This starts three groups of TensorFlow servers in the Dask worker processes. TensorFlow will manage its own communication but co-exist right alongside Dask in the same machines and in the same shared memory spaces (note that in the specs above the IP addresses match but the ports differ).

This also sets up a normal Python queue along which Dask can safely send information to TensorFlow. This is how we’ll send those batches of training data between the two services.

Define TensorFlow Model and Distribute Roles

Now is the part of the blogpost where my expertise wanes. I’m just going to copy-paste-and-modify a canned example from the TensorFlow documentation. This is a simplistic model for this problem and it’s entirely possible that I’m making transcription errors. But still, it should get the point across. You can safely ignore most of this code. Dask stuff gets interesting again towards the bottom:

importmathimporttempfileimporttimefromqueueimportEmptyIMAGE_PIXELS=28hidden_units=100learning_rate=0.01sync_replicas=Falsereplicas_to_aggregate=len(dask_spec['worker'])defmodel(server):worker_device="/job:%s/task:%d"%(server.server_def.job_name,server.server_def.task_index)task_index=server.server_def.task_indexis_chief=task_index==0withtf.device(tf.train.replica_device_setter(worker_device=worker_device,ps_device="/job:ps/cpu:0",cluster=tf_spec)):global_step=tf.Variable(0,name="global_step",trainable=False)# Variables of the hidden layerhid_w=tf.Variable(tf.truncated_normal([IMAGE_PIXELS*IMAGE_PIXELS,hidden_units],stddev=1.0/IMAGE_PIXELS),name="hid_w")hid_b=tf.Variable(tf.zeros([hidden_units]),name="hid_b")# Variables of the softmax layersm_w=tf.Variable(tf.truncated_normal([hidden_units,10],stddev=1.0/math.sqrt(hidden_units)),name="sm_w")sm_b=tf.Variable(tf.zeros([10]),name="sm_b")# Ops: located on the worker specified with task_indexx=tf.placeholder(tf.float32,[None,IMAGE_PIXELS*IMAGE_PIXELS])y_=tf.placeholder(tf.float32,[None,10])hid_lin=tf.nn.xw_plus_b(x,hid_w,hid_b)hid=tf.nn.relu(hid_lin)y=tf.nn.softmax(tf.nn.xw_plus_b(hid,sm_w,sm_b))cross_entropy=-tf.reduce_sum(y_*tf.log(tf.clip_by_value(y,1e-10,1.0)))opt=tf.train.AdamOptimizer(learning_rate)ifsync_replicas:ifreplicas_to_aggregateisNone:replicas_to_aggregate=num_workerselse:replicas_to_aggregate=replicas_to_aggregateopt=tf.train.SyncReplicasOptimizer(opt,replicas_to_aggregate=replicas_to_aggregate,total_num_replicas=num_workers,name="mnist_sync_replicas")train_step=opt.minimize(cross_entropy,global_step=global_step)ifsync_replicas:local_init_op=opt.local_step_init_opifis_chief:local_init_op=opt.chief_init_opready_for_local_init_op=opt.ready_for_local_init_op# Initial token and chief queue runners required by the sync_replicas modechief_queue_runner=opt.get_chief_queue_runner()sync_init_op=opt.get_init_tokens_op()init_op=tf.global_variables_initializer()train_dir=tempfile.mkdtemp()ifsync_replicas:sv=tf.train.Supervisor(is_chief=is_chief,logdir=train_dir,init_op=init_op,local_init_op=local_init_op,ready_for_local_init_op=ready_for_local_init_op,recovery_wait_secs=1,global_step=global_step)else:sv=tf.train.Supervisor(is_chief=is_chief,logdir=train_dir,init_op=init_op,recovery_wait_secs=1,global_step=global_step)sess_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False,device_filters=["/job:ps","/job:worker/task:%d"%task_index])# The chief worker (task_index==0) session will prepare the session,# while the remaining workers will wait for the preparation to complete.ifis_chief:print("Worker %d: Initializing session..."%task_index)else:print("Worker %d: Waiting for session to be initialized..."%task_index)sess=sv.prepare_or_wait_for_session(server.target,config=sess_config)ifsync_replicasandis_chief:# Chief worker will start the chief queue runner and call the init op.sess.run(sync_init_op)sv.start_queue_runners(sess,[chief_queue_runner])returnsess,x,y_,train_step,global_step,cross_entropydefps_task():withlocal_client()asc:c.worker.tensorflow_server.join()defscoring_task():withlocal_client()asc:# Scores Channelscores=c.channel('scores',maxlen=10)# Make Modelserver=c.worker.tensorflow_serversess,_,_,_,_,cross_entropy=model(c.worker.tensorflow_server)# Testing Datafromtensorflow.examples.tutorials.mnistimportinput_datamnist=input_data.read_data_sets('/tmp/mnist-data',one_hot=True)test_data={x:mnist.validation.images,y_:mnist.validation.labels}# Main LoopwhileTrue:score=sess.run(cross_entropy,feed_dict=test_data)scores.append(float(score))time.sleep(1)defworker_task():withlocal_client()asc:scores=c.channel('scores')num_workers=replicas_to_aggregate=len(dask_spec['worker'])server=c.worker.tensorflow_serverqueue=c.worker.tensorflow_queue# Make modelsess,x,y_,train_step,global_step,_=model(c.worker.tensorflow_server)# Main loopwhilenotscoresorscores.data[-1]>1000:try:batch=queue.get(timeout=0.5)exceptEmpty:continuetrain_data={x:batch[0],y_:batch[1]}sess.run([train_step,global_step],feed_dict=train_data)

The last three functions defined here, ps_task, scorer_task and worker_task are functions that we want to run on each of our three groups of TensorFlow server types. The parameter server task just starts a long-running task and passively joins the TensorFlow network:

defps_task():withlocal_client()asc:c.worker.tensorflow_server.join()

The scorer task opens up an inter-worker channel of communication named “scores”, creates the TensorFlow model, then every second scores the current state of the model against validation data. It reports the score on the inter-worker channel:

defscoring_task():withlocal_client()asc:scores=c.channel('scores')#  inter-worker channel# Make Modelsess,_,_,_,_,cross_entropy=model(c.worker.tensorflow_server)...whileTrue:score=sess.run(cross_entropy,feed_dict=test_data)scores.append(float(score))time.sleep(1)

The worker task makes the model, listens on the Dask-TensorFlow Queue for new training data, and continues training until the last reported score is good enough.

defworker_task():withlocal_client()asc:scores=c.channel('scores')queue=c.worker.tensorflow_queue# Make modelsess,x,y_,train_step,global_step,_=model(c.worker.tensorflow_server)whilescores.data[-1]>1000:batch=queue.get()train_data={x:batch[0],y_:batch[1]}sess.run([train_step,global_step],feed_dict=train_data)

We launch these tasks on the Dask workers that have the corresponding TensorFlow servers (see tf_spec and dask_spec above):

ps_tasks=[c.submit(ps_task,workers=worker)forworkerindask_spec['ps']]worker_tasks=[c.submit(worker_task,workers=addr,pure=False)foraddrindask_spec['worker']]scorer_task=c.submit(scoring_task,workers=dask_spec['scorer'][0])

This starts long-running tasks that just sit there, waiting for external stimulation:

long running TensorFlow tasks

Finally we construct a function to dump each of our batches of data from our Dask.array (from the very beginning of this post) into the Dask-TensorFlow queues on our workers. We make sure to only run these tasks where the Dask-worker has a corresponding TensorFlow training worker:

fromdistributed.worker_clientimportget_workerdeftransfer_dask_to_tensorflow(batch):worker=get_worker()worker.tensorflow_queue.put(batch)dump=c.map(transfer_dask_to_tensorflow,batches,workers=dask_spec['worker'],pure=False)

If we want to we can track progress in our local session by subscribing to the same inter-worker channel:

scores=c.channel('scores')

We can use this to repeatedly dump data into the workers over and over again until they converge.

whilescores.data[-1]>1000:dump=c.map(transfer_dask_to_tensorflow,batches,workers=dask_spec['worker'],pure=False)wait(dump)

Conclusion

We discussed a non-trivial way to use TensorFlow to accomplish distributed machine learning. We used Dask to support TensorFlow in a few ways:

  1. Trivially setup the TensorFlow network
  2. Prepare and clean data
  3. Coordinate progress and stopping criteria

We found it convenient that Dask and TensorFlow could play nicely with each other. Dask supported TensorFlow without getting in the way. The fact that both libraries play nicely within Python and the greater PyData stack (NumPy/Pandas) makes it trivial to move data between them without costly or complex tricks.

Additionally, we didn’t have to work to integrate these two systems. There is no need for a separate collaborative effort to integrate Dask and TensorFlow at a core level. Instead, they are designed in such a way so as to foster this type of interaction without special attention or effort.

This is also the first blogpost that I’ve written that, from a Dask perspective, uses some more complex features like long running tasks or publishing state between workers with channels. These more advanced features are invaluable when creating more complex/bespoke parallel computing systems, such as are often found within companies.

What we could have done better

From a deep learning perspective this example is both elementary and incomplete. It would have been nice to train on a dataset that was larger and more complex than MNIST. Also it would be nice to see the effects of training over time and the performance of using different numbers of workers. In defense of this blogpost I can only claim that Dask shouldn’t affect any of these scaling results, because TensorFlow is entirely in control at these stages and TensorFlow already has plenty of published scaling information.

Generally speaking though, this experiment was done in a weekend afternoon and the blogpost was written in a few hours shortly afterwards. If anyone is interested in performing and publishing about a more serious distributed deep learning experiment with TensorFlow and Dask I would be happy to support them on the Dask side. I think that there is plenty to learn here about best practices.

Acknowledgements

The following individuals contributed to the construction of this blogpost:

  • Stephan Hoyer contributed with conversations about how TensorFlow is used in practice and with concrete experience on deployment.
  • Will Warner and Erik Welch both provided valuable editing and language recommendations

Dask Development Log

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Profiling experiments with Dask-GLM
  2. Subsequent graph optimizations, both non-linear fusion and avoiding repeatedly creating new graphs
  3. Tensorflow and Keras experiments
  4. XGBoost experiments
  5. Dask tutorial refactor
  6. Google Cloud Storage support
  7. Cleanup of Dask + SKLearn project

Dask-GLM and iterative algorithms

Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent, BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving problems like logistic regression, but also several others. The mathematical side of this work is mostly done by Chris White and Hussain Sultan at Capital One.

We’ve been using this project also to see how Dask can scale out machine learning algorithms. To this end we ran a few benchmarks here: https://github.com/dask/dask-glm/issues/26 . This just generates and solves some random problems, but at larger scales.

What we found is that some algorithms, like ADMM perform beautifully, while for others, like gradient descent, scheduler overhead can become a substantial bottleneck at scale. This is mostly just because the actual in-memory NumPy operations are so fast; any sluggishness on Dask’s part becomes very apparent. Here is a profile of gradient descent:

Notice all the white space. This is Dask figuring out what to do during different iterations. We’re now working to bring this down to make all of the colored parts of this graph squeeze together better. This will result in general overhead improvements throughout the project.

Graph Optimizations - Aggressive Fusion

We’re approaching this in two ways:

  1. More aggressively fuse tasks together so that there are fewer blocks for the scheduler to think about
  2. Avoid repeated work when generating very similar graphs

In the first case, Dask already does standard task fusion. For example, if you have the following to tasks:

x=f(w)y=g(x)z=h(y)

Dask (along with every other compiler-like project since the 1980’s) already turns this into the following:

z=h(g(f(w)))

What’s tricky with a lot of these mathematical or optimization algorithms though is that they are mostly, but not entirely linear. Consider the following example:

y=exp(x)-1/x

Visualized as a node-link diagram, this graph looks like a diamond like the following:

         o  exp(x) - 1/x
        / \
exp(x) o   o   1/x
        \ /
         o  x

Graphs like this generally don’t get fused together because we could compute both exp(x) and 1/x in parallel. However when we’re bound by scheduling overhead and when we have plenty of parallel work to do, we’d prefer to fuse these into a single task, even though we lose some potential parallelism. There is a tradeoff here and we’d like to be able to exchange some parallelism (of which we have a lot) for less overhead.

PR here dask/dask #1979 by Erik Welch (Erik has written and maintained most of Dask’s graph optimizations).

Graph Optimizations - Structural Sharing

Additionally, we no longer make copies of graphs in dask.array. Every collection like a dask.array or dask.dataframe holds onto a Python dictionary holding all of the tasks that are needed to construct that array. When we perform an operation on a dask.array we get a new dask.array with a new dictionary pointing to a new graph. The new graph generally has all of the tasks of the old graph, plus a few more. As a result, we frequently make copies of the underlying task graph.

y=(x+1)assertset(y.dask).issuperset(x.dask)

Normally this doesn’t matter (copying graphs is usually cheap) but it can become very expensive for large arrays when you’re doing many mathematical operations.

Now we keep dask graphs in a custom mapping (dict-like object) that shares subgraphs with other arrays. As a result, we rarely make unnecessary copies and some algorithms incur far less overhead. Work done in dask/dask #1985.

TensorFlow and Keras experiments

Two weeks ago I gave a talk with Stan Seibert (Numba developer) on Deep Learning (Stan’s bit) and Dask (my bit). As part of that talk I decided to launch tensorflow from Dask and feed Tensorflow from a distributed Dask array. See this blogpost for more information.

That experiment was nice in that it showed how easy it is to deploy and interact with other distributed servies from Dask. However from a deep learning perspective it was immature. Fortunately, it succeeded in attracting the attention of other potential developers (the true goal of all blogposts) and now Brett Naul is using Dask to manage his GPU workloads with Keras. Brett contributed code to help Dask move around Keras models. He seems to particularly value Dask’s ability to manage resources to help him fully saturate the GPUs on his workstation.

XGBoost experiments

After deploying Tensorflow we asked what would it take to do the same for XGBoost, another very popular (though very different) machine learning library. The conversation for that is here: dmlc/xgboost #2032 with prototype code here mrocklin/dask-xgboost. As with TensorFlow, the integration is relatively straightforward (if perhaps a bit simpler in this case). The challenge for me is that I have little concrete experience with the applications that these libraries were designed to solve. Feedback and collaboration from open source developers who use these libraries in production is welcome.

Dask tutorial refactor

The dask/dask-tutorial project on Github was originally written or PyData Seattle in July 2015 (roughly 19 months ago). Dask has evolved substantially since then but this is still our only educational material. Fortunately Martin Durant is doing a pretty serious rewrite, both correcting parts that are no longer modern API, and also adding in new material around distributed computing and debugging.

Google Cloud Storage

Dask developers (mostly Martin) maintain libraries to help Python users connect to distributed file systems like HDFS (with hdfs3, S3 (with s3fs, and Azure Data Lake (with adlfs), which subsequently become usable from Dask. Martin has been working on support for Google Cloud Storage (with gcsfs) with another small project that uses the same API.

Cleanup of Dask+SKLearn project

Last year Jim Crist published threegreatblogposts about using Dask with SKLearn. The result was a small library dask-learn that had a variety of features, some incredibly useful, like a cluster-ready Pipeline and GridSearchCV, other less so. Because of the experimental nature of this work we had labeled the library “not ready for use”, which drew some curious responses from potential users.

Jim is now busy dusting off the project, removing less-useful parts and generally reducing scope to strictly model-parallel algorithms.


Dask Release 0.14.0

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Dask just released version 0.14.0. This release contains some significant internal changes as well as the usual set of increased API coverage and bug fixes. This blogpost outlines some of the major changes since the last release January, 27th 2017.

  1. Structural sharing of graphs between collections
  2. Refactor communications system
  3. Many small dataframe improvements
  4. Top-level persist function

You can install new versions using Conda or Pip

conda install -c conda-forge dask distributed

or

pip install dask[complete] distributed --upgrade

Share Graphs between Collections

Dask collections (arrays, bags, dataframes, delayed) hold onto task graphs that have all of the tasks necessary to create the desired result. For larger datasets or complex calculations these graphs may have thousands, or sometimes even millions of tasks. In some cases the overhead of handling these graphs can become significant.

This is especially true because dask collections don’t modify their graphs in place, they make new graphs with updated computations. Copying graph data structures with millions of nodes can take seconds and interrupt interactive workflows.

To address this dask.arrays and dask.delayed collections now use special graph data structures with structural sharing. This significantly cuts down on the amount of overhead when building repetitive computations.

importdask.arrayasdax=da.ones(1000000,chunks=(1000,))# 1000 chunks of size 1000

Version 0.13.0

%timeforiinrange(100):x=x+1CPUtimes:user2.69s,sys:96ms,total:2.78sWalltime:2.78s

Version 0.14.0

%timeforiinrange(100):x=x+1CPUtimes:user756ms,sys:8ms,total:764msWalltime:763ms

The difference in this toy problem is moderate but for real world cases this can difference can grow fairly large. This was also one of the blockers identified by the climate science community stopping them from handling petabyte scale analyses.

We chose to roll this out for arrays and delayed first just because those are the two collections that typically produce large task graphs. Dataframes and bags remain as before for the time being.

Communications System

Dask communicates over TCP sockets. It uses Tornado’s IOStreams to handle non-blocking communication, framing, etc.. We’ve run into some performance issues with Tornado when moving large amounts of data. Some of this has been improved upstream in Tornado directly, but we still want the ability to optionally drop Tornado’s byte-handling communication stack in the future. This is especially important as dask gets used in institutions with faster and more exotic interconnects (supercomputers). We’ve been asked a few times to support other transport mechanisms like MPI.

The first step (and probably hardest step) was to make Dask’s communication system is pluggable so that we can use different communication options without significant source-code changes. We managed this a month ago and now it is possible to add other transports to Dask relatively easily. TCP remains the only real choice today though there is also an experimental ZeroMQ option (which provides little-to-no performance benefit over TCP) as well as a fully in-memory option in development.

For users the main difference you’ll see is that tcp:// is now prepended many places. For example:

$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.1.115:8786
...

Variety of Dataframe Changes

As usual the Pandas API has been more fully covered by community contributors. Some representative changes include the following:

  1. Support non-uniform categoricals: We no longer need to do a full pass through the data when categorizing a column. Instead we categorize each partition independently (even if they have different category values) and then unify these categories only when necessary

    df['x']=df['x'].astype('category')# this is now fast
  2. Groupby cumulative reductions

    df.groupby('x').cumsum()
  3. Support appending to Parquet collections

    df.to_parquet('/path/to/foo.parquet',append=True)
  4. A new string and HTML representation of dask.dataframes. Typically Pandas prints dataframes on the screen by rendering the first few rows of data. However, Because Dask.dataframes are lazy we don’t have this data and so typically render some metadata about the dataframe

    >>>df# version 0.13.0dd.DataFrame<make-ti...,npartitions=366,divisions=(Timestamp('2000-01-0100:00:00',freq='D'),Timestamp('2000-01-02 00:00:00',freq='D'),Timestamp('2000-01-03 00:00:00',freq='D'),...,Timestamp('2000-12-3100:00:00',freq='D'),Timestamp('2001-01-01 00:00:00',freq='D'))>

    This rendering, while informative, can be improved. Now we render dataframes as a Pandas dataframe, but place metadata in the dataframe instead of the actual data.

    >>>df# version 0.14.0DaskDataFrameStructure:xyznpartitions=3662000-01-01float64float64int642000-01-02.....................2000-12-31.........2001-01-01.........DaskName:make-timeseries,366tasks

    Additionally this renders nicely as an HTML table in a Jupyter notebook

Variety of Distributed System Changes

There have also been a wide variety of changes to the distributed system. I’ll include a representative sample here to give a flavor of what has been happening:

  1. Ensure first-come-first-served priorities when dealing with multiple clients
  2. Send small amounts of data through Channels. Channels are a way for multiple clients/users connected to the same scheduler to publish and exchange data between themselves. Previously they only transmitted Futures (which could in trun point to larger data living on the cluster). However we found that it was useful to communicate small bits of metadata as well, for example to signal progress or stopping critera between clients collaborating on the same workloads. Now you can publish any msgpack serializable data on Channels.

    # Publishing Clientscores=client.channel('scores')scores.append(123.456)# Subscribing Clientscores=client.channel('scores')whilescores.data[-1]<THRESHOLD:...continueworking...
  3. We’re better at estimating the size in data of SciPy Sparse matrices and Keras models. This allows Dask to make smarter choices about when it should and should not move data around for load balancing. Additionally Dask can now also serialize Keras models.
  4. To help people deploying on clusters that have a shared network file system (as is often the case in scientific or academic institutions) the scheduler and workers can now communicate connection information using the --scheduler-file keyword

    dask-scheduler --scheduler-file /path/to/scheduler.json
    dask-worker --scheduler-file /path/to/scheduler.json
    dask-worker --scheduler-file /path/to/scheduler.json
    
    >>> client = Client(scheduler_file='/path/to/scheudler.json')
    

    Previously we needed to communicate the address of the scheduler, which could be challenging when we didn’t know on which node the scheduler would be run.

Other

There are a number of smaller details not mentioned in this blogpost. For more information visit the changelogs and documentation

Additionally a great deal of Dask work over the last month has happened outside of these core dask repositories.

You can install or upgrade using Conda or Pip

conda install -c conda-forge dask distributed

or

pip install dask[complete] distributed --upgrade

Acknowledgements

Since the last 0.13.0 release on January 27th the following developers have contributed to the dask/dask repository:

  • Antoine Pitrou
  • Chris Barber
  • Daniel Davis
  • Elmar Ritsch
  • Erik Welch
  • jakirkham
  • Jim Crist
  • John Crickett
  • jspreston
  • Juan Luis Cano Rodríguez
  • kayibal
  • Kevin Ernst
  • Markus Gonser
  • Matthew Rocklin
  • Martin Durant
  • Nir
  • Sinhrks
  • Talmaj Marinc
  • Vlad Frolov
  • Will Warner

And the following developers have contributed to the dask/distributed repository:

  • Antoine Pitrou
  • Ben Schreck
  • bmaisonn
  • Brett Naul
  • Demian Wassermann
  • Israel Saeta Pérez
  • John Crickett
  • Joseph Crail
  • Malte Gerken
  • Martin Durant
  • Matthew Rocklin
  • Min RK
  • strets123

Biased Benchmarks

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

Performing benchmarks to compare software is surprisingly difficult to do fairly, even assuming the best intentions of the author. Technical developers can fall victim to a few natural human failings:

  1. We judge other projects by our own objectives rather than the objectives under which that project was developed
  2. We fail to use other projects with the same expertise that we have for our own
  3. We naturally gravitate towards cases at which our project excels
  4. We improve our software during the benchmarking process
  5. We don’t release negative results

We discuss each of these failings in the context of current benchmarks I’m working on comparing Dask and Spark Dataframes.

Introduction

Last week I started comparing performance between Dask Dataframes (a project that I maintain) and Spark Dataframes (the current standard). My initial results showed that Dask.dataframes were overall much faster, somewhere like 5x.

These results were wrong. They’re weren’t wrong in a factual sense, the experiments that I ran were clear and reproducible, but there was so much bias in how I selected, set up, and ran those experiments that the end result was misleading. After checking results myself and then having other experts come in and check my results I now see much more sensible numbers. At the moment both projects within a factor of two most of the time, with some interesting exceptions either way.

This blogpost outlines the ways in which library authors can fool themselves when performing benchmarks, using my recent experience as an anecdote. I hope that this encourages authors to check themselves, and encourages readers to be more critical of numbers that they see in the future.

This problem exists as well in academic research. For a pop-science rendition I recommend “The Experiment Experiment” on the Planet Money Podcast.

Skewed Objectives

Feature X is so important. I wonder how the competition fares?

Every project is developed with different applications in mind and so has different strengths and weaknesses. If we approach a benchmark caring only about our original objectives and dismissing the objectives of the other projects then we’ll very likely trick ourselves.

For example consider reading CSV files. Dask’s CSV reader is based off of Pandas’ CSV reader, which was the target of a lot of effort and love; this is because CSV was so important to the finance community where Pandas grew up. Spark’s CSV solution is less awesome, but that’s less about the quality of Spark and more a statement to the fact that Spark users tend not to use CSV. If they’re using text-based formats they’re much more likely to use line-delimited JSON, which is typical in Spark’s common use cases (web diagnostics, click logs, etc..) Pandas/Dask came from the scientific and finance worlds where CSV is king while Spark came from the web world where JSON reigns.

Conversely, Dask.dataframe hasn’t bothered to hook up the pandas.read_json function to Dask.dataframe yet. Surprisingly it never comes up. Both projects can correctly say that the other project’s solution to what they consider the standard text-based file format is less-than-awesome. Comparing performance here either way will likely lead to misguided conclusions.

So when benchmarking data ingestion maybe we look around a bit, see that both claim to support Parquet well, and use that as the basis for comparison.

Skewed Experience

Whoa, this other project has a lot of configuration parameters! Lets just use the defaults

Software is often easy to set up, but requires experience set up optimally. Authors are naturally more adept at setting up their own software than the software of their competition.

My original (and flawed) solution to this was to “just use the defaults” on both projects. Given my inability to tune Spark (there are several dozen parameters) I decided to also not tune Dask and run under default settings. I figured that this would be a good benchmark not only of the software, but on choices for sane defaults, which is a good design principle in itself.

However this failed spectacularly because I was making unconscious decisions like the size of machines that I was using for the experiment, CPU/memory ratios, etc.. It turns out that Spark’s defaults are optimized for very small machines (or more likely, small YARN containers) and use only 1GB of memory per executor by default while Dask is typically run on larger boxes or has the full use of a single machine in a single shared-memory process. My standard cluster configurations were biased towards Dask before I even considered running a benchmark.

Similarly the APIs of software projects are complex and for any given problem there is often both a fast way and a general-but-slow way. Authors naturally choose the fast way on their own system but inadvertently choose the general way that comes up first when reading documentation for the other project. It often takes months of hands-on experience to understand a project well enough to definitely say that you’re not doing things in a dumb way.

In both cases here I think the only solution is to collaborate with someone that primarily uses the other system.

Preference towards strengths

Oh hey, we’re doing really well here. This is great! Lets dive into this a bit more.

It feels great to see your project doing well. This emotional pleasure response is powerful. It’s only natural that we pursue that feeling more, exploring different aspects of it. This can skew our writing as well. We’ll find that we’ve decided to devote 80% of the text to what originally seemed like a small set of features, but which now seems like the main point.

It’s important that we define a set of things we’re going to study ahead of time and then stick to those things. When we run into cases where our project fails we should take that as an opportunity to raise an issue for future (though not current) development.

Tuning during experimentation

Oh, I know why this is slow. One sec, let me change something in the code.

I’m doing this right now. Dask dataframe shuffles are generally slower than Spark dataframe shuffles. On numeric data this used to be around a 2x difference, now it’s more like a 1.2x difference (at least on my current problem and machine). Overall this is great, seeing that another project was beating Dask motivated me to dive in (see dask/distributed #932) and this will result in a better experience for users in the future. As a developer this is also how I operate. I define a benchmark, profile my code, identify hot-spots, and optimize. Business as usual.

However as an author of a comparative benchmark this also somewhat dishonest; I’m not giving the Spark developers the same opportunity to find and fix similar performance issues in their software before I publish my results. I’m also giving a biased picture to my readers. I’ve made all of the pieces that I’m going to show off fast while neglecting the others. Picking benchmarks, optimizing the project to make them fast, and then publishing those results gives the incorrect impression that the entire project has been optimized to that level.

Omission

So, this didn’t go as planned. Lets wait a few months until the next release.

There is no motivation to publish negative results. Unless of course you’ve just written a blogpost announcing that you plan to release benchmarks in the near future. Then you’re really forced to release numbers, even if they’re mixed.

That’s ok, mixed numbers can be informative. They build trust and community. And we all talk about open source community driven software, so these should be welcome.

Straight up bias

Look, we’re not in grad-school any more. We’ve got to convince companies to actually use this stuff.

Everything we’ve discussed so far assumes best intentions, that the author is acting in good faith, but falling victim to basic human failings.

However many developers today (including myself) are paid and work for for-profit companies that need to make money. To an increasing extent making this money depends on community mindshare, which means publishing benchmarks that sway users to our software. Authors have bosses that they’re trying to impress or the content and tone of an article may be influenced by other people within the company other than the stated author.

I’ve been pretty lucky working with Continuum Analytics (my employer) in that they’ve been pretty hands-off with technical writing. For other employers that may be reading, we’ve actually had an easier time getting business because of the honest tone in these blogposts in some cases. Potential clients generally have the sense that we’re trustworthy.

Technical honesty goes a surprisingly long way towards implying technical proficiency.

Developing Convex Optimization Algorithms in Dask

$
0
0

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

We build distributed optimization algorithms with Dask.  We show both simple examples and also benchmarks from a nascent dask-glm library for generalized linear models.  We also talk about the experience of learning Dask to do this kind of work.

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Introduction

Many machine learning and statistics models (such as logistic regression) depend on convex optimization algorithms like Newton’s method, stochastic gradient descent, and others.  These optimization algorithms are both pragmatic (they’re used in many applications) and mathematically interesting.  As a result these algorithms have been the subject of study by researchers and graduate students around the world for years both in academia and in industry.

Things got interesting about five or ten years ago when datasets grew beyond the size of working memory and “Big Data” became a buzzword.  Parallel and distributed solutions for these algorithms have become the norm, and a researcher’s skillset now has to extend beyond linear algebra and optimization theory to include parallel algorithms and possibly even network programming, especially if you want to explore and create more interesting algorithms.

However, relatively few people understand both mathematical optimization theory and the details of distributed systems. Typically algorithmic researchers depend on the APIs of distributed computing libraries like Spark or Flink to implement their algorithms. In this blogpost we explore the extent to which Dask can be helpful in these applications. We approach this from two perspectives:

  1. Algorithmic researcher (Chris): someone who knows optimization and iterative algorithms like Conjugate Gradient, Dual Ascent, or GMRES but isn’t so hot on distributed computing topics like sockets, MPI, load balancing, and so on
  2. Distributed systems developer (Matt): someone who knows how to move bytes around and keep machines busy but doesn’t know the right way to do a line search or handle a poorly conditioned matrix

Prototyping Algorithms in Dask

Given knowledge of algorithms and of NumPy array computing it is easy to write parallel algorithms with Dask. For a range of complicated algorithmic structures we have two straightforward choices:

  1. Use parallel multi-dimensional arrays to construct algorithms from common operations like matrix multiplication, SVD, and so on. This mirrors mathematical algorithms well but lacks some flexibility.
  2. Create algorithms by hand that track operations on individual chunks of in-memory data and dependencies between them. This is very flexible but requires a bit more care.

Coding up either of these options from scratch can be a daunting task, but with Dask it can be as simple as writing NumPy code.

Let’s build up an example of fitting a large linear regression model using both built-in array parallelism and fancier, more customized parallelization features that Dask offers. The dask.array module helps us to easily parallelize standard NumPy functionality using the same syntax – we’ll start there.

Data Creation

Dask has many ways to create dask arrays; to get us started quickly prototyping let’s create some random data in a way that should look familiar to NumPy users.

importdaskimportdask.arrayasdaimportnumpyasnpfromdask.distributedimportClientclient=Client()## create inputs with a bunch of independent normalsbeta=np.random.random(100)# random beta coefficients, no interceptX=da.random.normal(0,1,size=(1000000,100),chunks=(100000,100))y=X.dot(beta)+da.random.normal(0,1,size=1000000,chunks=(100000,))## make sure all chunks are ~equally sizedX,y=dask.persist(X,y)client.rebalance([X,y])

Observe that X is a dask array stored in 10 chunks, each of size (100000, 100). Also note that X.dot(beta) runs smoothly for both numpy and dask arrays, so we can write code that basically works in either world.

Caveat: If X is a numpy array and beta is a dask array, X.dot(beta) will output an in-memorynumpy array. This is usually not desirable as you want to carefully choose when to load something into memory. One fix is to use multipledispatch to handle odd edge cases; for a starting example, check out the dot code here.

Dask also has convenient visualization features built in that we will leverage; below we visualize our data in its 10 independent chunks:

Create data for dask-glm computations

Array Programming

If you can write iterative array-based algorithms in NumPy, then you can write iterative parallel algorithms in Dask

As we’ve already seen, Dask inherits much of the NumPy API that we are familiar with, so we can write simple NumPy-style iterative optimization algorithms that will leverage the parallelism dask.array has built-in already. For example, if we want to naively fit a linear regression model on the data above, we are trying to solve the following convex optimization problem:

Recall that in non-degenerate situations this problem has a closed-form solution that is given by:

We can compute $\beta^*$ using the above formula with Dask:

## naive solutionbeta_star=da.linalg.solve(X.T.dot(X),X.T.dot(y))>>>abs(beta_star.compute()-beta).max()0.0024817567237768179

Sometimes a direct solve is too costly, and we want to solve the above problem using only simple matrix-vector multiplications. To this end, let’s take this one step further and actually implement a gradient descent algorithm which exploits parallel matrix operations. Recall that gradient descent iteratively refines an initial estimate of beta via the update:

where $\alpha$ can be chosen based on a number of different “step-size” rules; for the purposes of exposition, we will stick with a constant step-size:

## quick step-size calculation to guarantee convergence_,s,_=np.linalg.svd(2*X.T.dot(X))step_size=1/s-1e-8## define some parametersmax_steps=100tol=1e-8beta_hat=np.zeros(100)# initial guessforkinrange(max_steps):Xbeta=X.dot(beta_hat)func=((y-Xbeta)**2).sum()gradient=2*X.T.dot(Xbeta-y)## Updateobeta=beta_hatbeta_hat=beta_hat-step_size*gradientnew_func=((y-X.dot(beta_hat))**2).sum()beta_hat,func,new_func=dask.compute(beta_hat,func,new_func)# <--- Dask code## Check for convergencechange=np.absolute(beta_hat-obeta).max()ifchange<tol:break>>>abs(beta_hat-beta).max()0.0024817567259038942

It’s worth noting that almost all of this code is exactly the same as the equivalent NumPy code. Because Dask.array and NumPy share the same API it’s pretty easy for people who are already comfortable with NumPy to get started with distributed algorithms right away. The only thing we had to change was how we produce our original data (da.random.normal instead of np.random.normal) and the call to dask.compute at the end of the update state. The dask.compute call tells Dask to go ahead and actually evaluate everything we’ve told it to do so far (Dask is lazy by default). Otherwise, all of the mathematical operations, matrix multiplies, slicing, and so on are exactly the same as with Numpy, except that Dask.array builds up a chunk-wise parallel computation for us and Dask.distributed can execute that computation in parallel.

To better appreciate all the scheduling that is happening in one update step of the above algorithm, here is a visualization of the computation necessary to compute beta_hat and the new function value new_func:

Gradient descent step Dask graph

Each rectangle is an in-memory chunk of our distributed array and every circle is a numpy function call on those in-memory chunks. The Dask scheduler determines where and when to run all of these computations on our cluster of machines (or just on the cores of our laptop).

Array Programming + dask.delayed

Now that we’ve seen how to use the built-in parallel algorithms offered by Dask.array, let’s go one step further and talk about writing more customized parallel algorithms. Many distributed “consensus” based algorithms in machine learning are based on the idea that each chunk of data can be processed independently in parallel, and send their guess for the optimal parameter value to some master node. The master then computes a consensus estimate for the optimal parameters and reports that back to all of the workers. Each worker then processes their chunk of data given this new information, and the process continues until convergence.

From a parallel computing perspective this is a pretty simple map-reduce procedure. Any distributed computing framework should be able to handle this easily. We’ll use this as a very simple example for how to use Dask’s more customizable parallel options.

One such algorithm is the Alternating Direction Method of Multipliers, or ADMM for short. For the sake of this post, we will consider the work done by each worker to be a black box.

We will also be considering a regularized version of the problem above, namely:

At the end of the day, all we will do is:

  • create NumPy functions which define how each chunk updates its parameter estimates
  • wrap those functions in dask.delayed
  • call dask.compute and process the individual estimates, again using NumPy

First we need to define some local functions that the chunks will use to update their individual parameter estimates, and import the black box local_update step from dask_glm; also, we will need the so-called shrinkage operator (which is the proximal operator for the $l1$-norm in our problem):

fromdask_glm.algorithmsimportlocal_updatedeflocal_f(beta,X,y,z,u,rho):return((y-X.dot(beta))**2).sum()+(rho/2)*np.dot(beta-z+u,beta-z+u)deflocal_grad(beta,X,y,z,u,rho):return2*X.T.dot(X.dot(beta)-y)+rho*(beta-z+u)defshrinkage(beta,t):returnnp.maximum(0,beta-t)-np.maximum(0,-beta-t)## set some algorithm parametersmax_steps=10lamduh=7.2rho=1.0(n,p)=X.shapenchunks=X.npartitionsXD=X.to_delayed().flatten().tolist()# A list of pointers to remote numpy arraysyD=y.to_delayed().flatten().tolist()# ... one for each chunk# the initial consensus estimatez=np.zeros(p)# an array of the individual "dual variables" and parameter estimates,# one for each chunk of datau=np.array([np.zeros(p)foriinrange(nchunks)])betas=np.array([np.zeros(p)foriinrange(nchunks)])forkinrange(max_steps):# process each chunk in parallel, using the black-box 'local_update' magicnew_betas=[dask.delayed(local_update)(xx,yy,bb,z,uu,rho,f=local_f,fprime=local_grad)forxx,yy,bb,uuinzip(XD,yD,betas,u)]new_betas=np.array(dask.compute(*new_betas))# everything else is NumPy code occurring at "master"beta_hat=0.9*new_betas+0.1*z# create consensus estimatezold=z.copy()ztilde=np.mean(beta_hat+np.array(u),axis=0)z=shrinkage(ztilde,lamduh/(rho*nchunks))# update dual variablesu+=beta_hat-z>>># Number of coefficients zeroed out due to L1 regularization>>>print((z==0).sum())12

There is of course a little bit more work occurring in the above algorithm, but it should be clear that the distributed operations are not one of the difficult pieces. Using dask.delayed we were able to express a simple map-reduce algorithm like ADMM with similarly simple Python for loops and delayed function calls. Dask.delayed is keeping track of all of the function calls we wanted to make and what other function calls they depend on. For example all of the local_update calls can happen independent of each other, but the consensus computation blocks on all of them.

We hope that both parallel algorithms shown above (gradient descent, ADMM) were straightforward to someone reading with an optimization background. These implementations run well on a laptop, a single multi-core workstation, or a thousand-node cluster if necessary. We’ve been building somewhat more sophisticated implementations of these algorithms (and others) in dask-glm. They are more sophisticated from an optimization perspective (stopping criteria, step size, asynchronicity, and so on) but remain as simple from a distributed computing perspective.

Experiment

We compare dask-glm implementations against Scikit-learn on a laptop, and then show them running on a cluster.

Reproducible notebook is available here

We’re building more sophisticated versions of the algorithms above in dask-glm.  This project has convex optimization algorithms for gradient descent, proximal gradient descent, Newton’s method, and ADMM.  These implementations extend the implementations above by also thinking about stopping criteria, step sizes, and other niceties that we avoided above for simplicity.

In this section we show off these algorithms by performing a simple numerical experiment that compares the numerical performance of proximal gradient descent and ADMM alongside Scikit-Learn’s LogisticRegression and SGD implementations on a single machine (a personal laptop) and then follows up by scaling the dask-glm options to a moderate cluster.  

Disclaimer: These experiments are crude. We’re using artificial data, we’re not tuning parameters or even finding parameters at which these algorithms are producing results of the same accuracy. The goal of this section is just to give a general feeling of how things compare.

We create data

## size of problem (no. observations)N=8e6chunks=1e6seed=20009beta=(np.random.random(15)-0.5)*3X=da.random.random((N,len(beta)),chunks=chunks)y=make_y(X,beta=np.array(beta),chunks=chunks)X,y=dask.persist(X,y)client.rebalance([X,y])

And run each of our algorithms as follows:

# Dask-GLM Proximal Gradientresult=proximal_grad(X,y,lamduh=alpha)# Dask-GLM ADMMX2=X.rechunk((1e5,None)).persist()# ADMM prefers smaller chunksy2=y.rechunk(1e5).persist()result=admm(X2,y2,lamduh=alpha)# Scikit-Learn LogisticRegressionnX,ny=dask.compute(X,y)# sklearn wants numpy arraysresult=LogisticRegression(penalty='l1',C=1).fit(nX,ny).coef_# Scikit-Learn Stochastic Gradient Descentresult=SGDClassifier(loss='log',penalty='l1',l1_ratio=1,n_iter=10,fit_intercept=False).fit(nX,ny).coef_

We then compare with the $L_{\infty}$ norm (largest different value).

abs(result-beta).max()

Times and $L_\infty$ distance from the true “generative beta” for these parameters are shown in the table below:

AlgorithmErrorDuration (s)
Proximal Gradient0.0227128
ADMM0.012534.7
LogisticRegression0.013279
SGDClassifier0.045629.4

Again, please don’t take these numbers too seriously: these algorithms all solve regularized problems, so we don’t expect the results to necessarily be close to the underlying generative beta (even asymptotically). The numbers above are meant to demonstrate that they all return results which were roughly the same distance from the beta above. Also, Dask-glm is using a full four-core laptop while SKLearn is restricted to use a single core.

In the sections below we include profile plots for proximal gradient and ADMM. These show the operations that each of eight threads was doing over time. You can mouse-over rectangles/tasks and zoom in using the zoom tools in the upper right. You can see the difference in complexity of the algorithms. ADMM is much simpler from Dask’s perspective but also saturates hardware better for this chunksize.

Profile Plot for Proximal Gradient Descent

Profile Plot for ADMM

The general takeaway here is that dask-glm performs comparably to Scikit-Learn on a single machine. If your problem fits in memory on a single machine you should continue to use Scikit-Learn and Statsmodels. The real benefit to the dask-glm algorithms is that they scale and can run efficiently on data that is larger-than-memory by operating from disk on a single computer or on a cluster of computers working together.

Cluster Computing

As a demonstration, we run a larger version of the data above on a cluster of eight m4.2xlarges on EC2 (8 cores and 30GB of RAM each.)

We create a larger dataset with 800,000,000 rows and 15 columns across eight processes.

N=8e8chunks=1e7seed=20009beta=(np.random.random(15)-0.5)*3X=da.random.random((N,len(beta)),chunks=chunks)y=make_y(X,beta=np.array(beta),chunks=chunks)X,y=dask.persist(X,y)

We then run the same proximal_grad and admm operations from before:

# Dask-GLM Proximal Gradientresult=proximal_grad(X,y,lamduh=alpha)# Dask-GLM ADMMX2=X.rechunk((1e6,None)).persist()# ADMM prefers smaller chunksy2=y.rechunk(1e6).persist()result=admm(X2,y2,lamduh=alpha)

Proximal grad completes in around seventeen minutes while ADMM completes in around four minutes. Profiles for the two computations are included below:

Profile Plot for Proximal Gradient Descent

We include only the first few iterations here. Otherwise this plot is several megabytes.

Link to fullscreen plot

Profile Plot for ADMM

Link to fullscreen plot

These both obtained similar $L_{\infty}$ errors to what we observed before.

AlgorithmErrorDuration (s)
Proximal Gradient0.03061020
ADMM0.00159270

Although this time we had to be careful about a couple of things:

  1. We explicitly deleted the old data after rechunking (ADMM prefers different chunksizes than proximal_gradient) because our full dataset, 100GB, is close enough to our total distributed RAM (240GB) that it’s a good idea to avoid keeping replias around needlessly. Things would have run fine, but spilling excess data to disk would have negatively affected performance.
  2. We set the OMP_NUM_THREADS=1 environment variable to avoid over-subscribing our CPUs. Surprisingly not doing so led both to worse performance and to non-deterministic results. An issue that we’re still tracking down.

Analysis

The algorithms in Dask-GLM are new and need development, but are in a usable state by people comfortable operating at this technical level. Additionally, we would like to attract other mathematical and algorithmic developers to this work. We’ve found that Dask provides a nice balance between being flexible enough to support interesting algorithms, while being managed enough to be usable by researchers without a strong background in distributed systems. In this section we’re going to discuss the things that we learned from both Chris’ (mathematical algorithms) and Matt’s (distributed systems) perspective and then talk about possible future work. We encourage people to pay attention to future work; we’re open to collaboration and think that this is a good opportunity for new researchers to meaningfully engage.

Chris’s perspective

  1. Creating distributed algorithms with Dask was surprisingly easy; there is still a small learning curve around when to call things like persist, compute, rebalance, and so on, but that can’t be avoided. Using Dask for algorithm development has been a great learning environment for understanding the unique challenges associated with distributed algorithms (including communication costs, among others).
  2. Getting the particulars of algorithms correct is non-trivial; there is still work to be done in better understanding the tolerance settings vs. accuracy tradeoffs that are occurring in many of these algorithms, as well as fine-tuning the convergence criteria for increased precision.
  3. On the software development side, reliably testing optimization algorithms is hard. Finding provably correct optimality conditions that should be satisfied which are also numerically stable has been a challenge for me.
  4. Working on algorithms in isolation is not nearly as fun as collaborating on them; please join the conversation and contribute!
  5. Most importantly from my perspective, I’ve found there is a surprisingly large amount of misunderstanding in “the community” surrounding what optimization algorithms do in the world of predictive modeling, what problems they each individually solve, and whether or not they are interchangeable for a given problem. For example, Newton’s method can’t be used to optimize an l1-regularized problem, and the coefficient estimates from an l1-regularized problem are fundamentally (and numerically) different from those of an l2-regularized problem (and from those of an unregularized problem). My own personal goal is that the API for dask-glm exposes these subtle distinctions more transparently and leads to more thoughtful modeling decisions “in the wild”.

Matt’s perspective

This work triggered a number of concrete changes within the Dask library:

  1. We can convert Dask.dataframes to Dask.arrays. This is particularly important because people want to do pre-processing with dataframes but then switch to efficient multi-dimensional arrays for algorithms.
  2. We had to unify the single-machine scheduler and distributed scheduler APIs a bit, notably adding a persist function to the single machine scheduler. This was particularly important because Chris generally prototyped on his laptop but we wanted to write code that was effective on clusters.
  3. Scheduler overhead can be a problem for the iterative dask-array algorithms (gradient descent, proximal gradient descent, BFGS). This is particularly a problem because NumPy is very fast. Often our tasks take only a few milliseconds, which makes Dask’s overhead of 200us per task become very relevant (this is why you see whitespace in the profile plots above). We’ve started resolving this problem in a few ways like more aggressive task fusion and lower overheads generally, but this will be a medium-term challenge. In practice for dask-glm we’ve started handling this just by choosing chunksizes well. I suspect that for the dask-glm in particular we’ll just develop auto-chunksize heuristics that will mostly solve this problem. However we expect this problem to recur in other work with scientists on HPC systems who have similar situations.
  4. A couple of things can be tricky for algorithmic users:
    1. Placing the calls to asynchronously start computation (persist, compute). In practice Chris did a good job here and then I came through and tweaked things afterwards. The web diagnostics ended up being crucial to identify issues.
    2. Avoiding accidentally calling NumPy functions on dask.arrays and vice versa. We’ve improved this on the dask.array side, and they now operate intelligently when given numpy arrays. Changing this on the NumPy side is harder until NumPy protocols change (which is planned).

Future work

There are a number of things we would like to do, both in terms of measurement and for the dask-glm project itself. We welcome people to voice their opinions (and join development) on the following issues:

  1. Asynchronous Algorithms
  2. User APIs
  3. Extend GLM families
  4. Write more extensive rigorous algorithm testing - for satisfying provable optimality criteria, and for robustness to various input data
  5. Begin work on smart initialization routines

What is your perspective here, gentle reader? Both Matt and Chris can use help on this project. We hope that some of the issues above provide seeds for community engagement. We welcome other questions, comments, and contributions either as github issues or comments below.

Acknowledgements

Thanks also go to Hussain Sultan (Capital One) and Tom Augspurger for collaboration on Dask-GLM and to Will Warner (Continuum) for reviewing and editing this post.

Dask Release 0.14.1

$
0
0

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Arrays

Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays.

Automatic chunking and operation on NumPy arrays

Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array.

>>>x=np.ones(10)# a numpy array>>>y=da.arange(10,chunks=(5,))# a dask array>>>z=x+y# combined become a dask.array>>>zdask.array<add,shape=(10,),dtype=float64,chunksize=(5,)>>>>z.compute()array([1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])

Reshape

Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions.

>>>x=da.ones((2,3,4,5,6),chunks=(2,2,2,2,2))>>>x.reshape((6,2,2,30,1))dask.array<reshape,shape=(6,2,2,30,1),dtype=float64,chunksize=(3,1,2,6,1)>

This operation ends up being quite useful in a number of distributed array cases.

Optimize Slicing to Minimize Communication

Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting.

Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care.

Micro-optimizations

With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward.

DataFrames

Set_index

Set_index is smarter in two ways:

  1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic.
  2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case).

Shuffle performance

We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on.

Fastparquet

The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes.

Importantly, Fastparquet now supports Python 2.

We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames).

dask/fastparquet #87

Distributed Scheduler

Replay remote exceptions

Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong:

defdiv(x,y):returnx/y>>>future=client.submit(div,1,0)>>>future<Future:status:error,key:div-4a34907f5384bcf9161498a635311aeb>>>>future.result()# getting result re-raises exception locally<ipython-input-3-398a43a7781e>indiv()1defdiv(x,y):---->2returnx/yZeroDivisionError:divisionbyzero

Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain.

>>>client.recreate_error_locally(future)<ipython-input-3-398a43a7781e>indiv(x,y)1defdiv(x,y):---->2returnx/yZeroDivisionError:divisionbyzero

Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on.

In[8]:%debug><ipython-input-3-398a43a7781e>(2)div()1defdiv(x,y):---->2returnx/yipdb>ppx1ipdb>ppy0

dask/distributed #894

Async/await syntax

Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming.

Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases.

You can wait on a dask Future

asyncdeff():future=client.submit(func,*args,**kwargs)result=awaitfuture

You can put the as_completed iterator into an async for loop

asyncforfutureinas_completed(futures):result=awaitfuture...dostuffwithresult...

And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.)

results=client.gather(futures)# synchronous...results=awaitclient._gather(futures)# asynchronous

If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator.

dask/distributed #952

Inproc transport

In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues.

This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler.

This is on by default if you create a cluster with LocalCluster without using Nanny processes.

>>>fromdask.distributedimportLocalCluster,Client>>>cluster=LocalCluster(nanny=False)>>>client=Client(cluster)>>>client<Client:scheduler='inproc://192.168.1.115/8437/1'processes=1cores=4>>>>fromthreadingimportLock# Not serializable>>>lock=Lock()# Won't survive going over a socket>>>[future]=client.scatter([lock])# Yet we can send to a worker>>>future.result()# ... and back<unlocked_thread.lockobjectat0x7fb7f12d08a0>

dask/distributed #919

Connection pooling for inter-worker communications

Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication.

This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem.

Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop.

In[10]:%%time...:foriinrange(1000):...:future=client.submit(inc,i)...:result=future.result()...:CPUtimes:user4.96s,sys:348ms,total:5.31sWalltime:11.1s

There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor.

In[14]:e=ProcessPoolExecutor(8)In[15]:%%time...:foriinrange(1000):...:future=e.submit(inc,i)...:result=future.result()...:CPUtimes:user320ms,sys:56ms,total:376msWalltime:442ms

Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds.

dask/distributed #935

There has been activity around Dask and machine learning:

  • dask-learn is undergoing some performance enhancements. It turns out that when you offer distributed grid search people quickly want to scale up their computations to hundreds of thousands of trials.
  • dask-glm now has a few decent algorithms for convex optimization. The authors of this wrote a blogpost very recently if you’re interested: Developing Convex Optimization Algorithms in Dask
  • dask-xgboost lets you hand off distributed data in Dask dataframes or arrays and hand it directly to a distributed XGBoost system (that Dask will nicely set up and tear down for you). This was a nice example of easy hand-off between two distributed services running in the same processes.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th

  • Antoine Pitrou
  • Brian Martin
  • Elliott Sales de Andrade
  • Erik Welch
  • Francisco de la Peña
  • jakirkham
  • Jim Crist
  • Jitesh Kumar Jha
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Markus Gonser
  • Talmaj

The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th

  • Antoine Pitrou
  • Ben Schreck
  • Elliott Sales de Andrade
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

Dask and Pandas and XGBoost

$
0
0

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Summary

This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training.

More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them.

Introduction

XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees. It is used widely in business and is one of the most popular solutions in Kaggle competitions. For larger datasets or faster training, XGBoost also comes with its own distributed computing system that lets it scale to multiple machines on a cluster. Fantastic. Distributed gradient boosted trees are in high demand.

However before we can use distributed XGBoost we need to do three things:

  1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling
  2. Set up XGBoost master and workers
  3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster

This ends up being surprisingly easy. This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.

After this example we’ll talk about the general design and what this means for other distributed systems.

Example

We have a ten-node cluster with eight cores each (m4.2xlarges on EC2)

importdaskfromdask.distributedimportClient,progress>>>client=Client('172.31.33.0:8786')>>>client.restart()<Client:scheduler='tcp://172.31.33.0:8786'processes=10cores=80>

We load the Airlines dataset using dask.dataframe (just a bunch of Pandas dataframes spread across a cluster) and do a bit of preprocessing:

importdask.dataframeasdd# Subset of the columns to usecols=['Year','Month','DayOfWeek','Distance','DepDelay','CRSDepTime','UniqueCarrier','Origin','Dest']# Create the dataframedf=dd.read_csv('s3://dask-data/airline-data/20*.csv',usecols=cols,storage_options={'anon':True})df=df.sample(frac=0.2)# XGBoost requires a bit of RAM, we need a larger clusteris_delayed=(df.DepDelay.fillna(16)>15)# column of labelsdeldf['DepDelay']# Remove delay information from training dataframedf['CRSDepTime']=df['CRSDepTime'].clip(upper=2399)df,is_delayed=dask.persist(df,is_delayed)# start work in the background

This loaded a few hundred pandas dataframes from CSV data on S3. We then had to downsample because how we are going to use XGBoost in the future seems to require a lot of RAM. I am not an XGBoost expert. Please forgive my ignorance here. At the end we have two dataframes:

  • df: Data from which we will learn if flights are delayed
  • is_delayed: Whether or not those flights were delayed.

Data scientists familiar with Pandas will probably be familiar with the code above. Dask.dataframe is very similar to Pandas, but operates on a cluster.

>>>df.head()
YearMonthDayOfWeekCRSDepTimeUniqueCarrierOriginDestDistance
182193200012800WNLAXOAK337
834242000161650DLSJCSLC585
3467812000151140AAORDLAX1745
3759352000121940DLPHLATL665
3093732000141028COMCIIAH643
>>>is_delayed.head()182193False83424False346781False375935False309373FalseName:DepDelay,dtype:bool

Categorize and One Hot Encode

XGBoost doesn’t want to work with text data like destination=”LAX”. Instead we create new indicator columns for each of the known airports and carriers. This expands our data into many boolean columns. Fortunately Dask.dataframe has convenience functions for all of this baked in (thank you Pandas!)

>>>df2=dd.get_dummies(df.categorize()).persist()

This expands our data out considerably, but makes it easier to train on.

>>>len(df2.columns)685

Split and Train

Great, now we’re ready to split our distributed dataframes

data_train,data_test=df2.random_split([0.9,0.1],random_state=1234)labels_train,labels_test=is_delayed.random_split([0.9,0.1],random_state=1234)

Start up a distributed XGBoost instance, and train on this data

%%timeimportdask_xgboostasdxgbparams={'objective':'binary:logistic','nround':1000,'max_depth':16,'eta':0.01,'subsample':0.5,'min_child_weight':1,'tree_method':'hist','grow_policy':'lossguide'}bst=dxgb.train(client,params,data_train,labels_train)CPUtimes:user355ms,sys:29.7ms,total:385msWalltime:54.5s

Great, so we were able to train an XGBoost model on this data in about a minute using our ten machines. What we get back is just a plain XGBoost Booster object.

>>>bst<xgboost.core.Boosterat0x7fa1c18c4c18>

We could use this on normal Pandas data locally

importxgboostasxgbpandas_df=data_test.head()dtest=xgb.DMatrix(pandas_df)>>>bst.predict(dtest)array([0.464578,0.46631625,0.47434333,0.47245741,0.46194169],dtype=float32)

Of we can use dask-xgboost again to train on our distributed holdout data, getting back another Dask series.

>>>predictions=dxgb.predict(client,bst,data_test).persist()>>>predictionsDaskSeriesStructure:npartitions=93Nonefloat32None......None...None...Name:predictions,dtype:float32DaskName:_predict_part,93tasks

Evaluate

We can bring these predictions to the local process and use normal Scikit-learn operations to evaluate the results.

>>>fromsklearn.metricsimportroc_auc_score,roc_curve>>>print(roc_auc_score(labels_test.compute(),...predictions.compute()))0.654800768411
fpr,tpr,_=roc_curve(labels_test.compute(),predictions.compute())# Taken fromhttp://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-pyplt.figure(figsize=(8,8))lw=2plt.plot(fpr,tpr,color='darkorange',lw=lw,label='ROC curve')plt.plot([0,1],[0,1],color='navy',lw=lw,linestyle='--')plt.xlim([0.0,1.0])plt.ylim([0.0,1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver operating characteristic example')plt.legend(loc="lower right")plt.show()

We might want to play with our parameters above or try different data to improve our solution. The point here isn’t that we predicted airline delays well, it was that if you are a data scientist who knows Pandas and XGBoost, everything we did above seemed pretty familiar. There wasn’t a whole lot of new material in the example above. We’re using the same tools as before, just at a larger scale.

Analysis

OK, now that we’ve demonstrated that this works lets talk a bit about what just happened and what that means generally for cooperation between distributed services.

What dask-xgboost does

The dask-xgboost project is pretty small and pretty simple (200 TLOC). Given a Dask cluster of one central scheduler and several distributed workers it starts up an XGBoost scheduler in the same process running the Dask scheduler and starts up an XGBoost worker within each of the Dask workers. They share the same physical processes and memory spaces. Dask was built to support this kind of situation, so this is relatively easy.

Then we ask the Dask.dataframe to fully materialize in RAM and we ask where all of the constituent Pandas dataframes live. We tell each Dask worker to give all of the Pandas dataframes that it has to its local XGBoost worker and then just let XGBoost do its thing. Dask doesn’t power XGBoost, it’s just sets it up, gives it data, and lets it do it’s work in the background.

People often ask what machine learning capabilities Dask provides, how they compare with other distributed machine learning libraries like H2O or Spark’s MLLib. For gradient boosted trees the 200-line dask-xgboost package is the answer. Dask has no need to make such an algorithm because XGBoost already exists, works well and provides Dask users with a fully featured and efficient solution.

Because both Dask and XGBoost can live in the same Python process they can share bytes between each other without cost, can monitor each other, etc.. These two distributed systems co-exist together in multiple processes in the same way that NumPy and Pandas operate together within a single process. Sharing distributed processes with multiple systems can be really beneficial if you want to use multiple specialized services easily and avoid large monolithic frameworks.

Connecting to Other distributed systems

A while ago I wrote a similar blogpost about hosting TensorFlow from Dask in exactly the same way that we’ve done here. It was similarly easy to setup TensorFlow alongside Dask, feed it data, and let TensorFlow do its thing.

Generally speaking this “serve other libraries” approach is how Dask operates when possible. We’re only able to cover the breadth of functionality that we do today because we lean heavily on the existing open source ecosystem. Dask.arrays use Numpy arrays, Dask.dataframes use Pandas, and now the answer to gradient boosted trees with Dask is just to make it really really easy to use distributed XGBoost. Ta da! We get a fully featured solution that is maintained by other devoted developers, and the entire connection process was done over a weekend (see dmlc/xgboost #2032 for details).

Since this has come out we’ve had requests to support other distributed systems like Elemental and to do general hand-offs to MPI computations. If we’re able to start both systems with the same set of processes then all of this is pretty doable. Many of the challenges of inter-system collaboration go away when you can hand numpy arrays between the workers of one system to the workers of the other system within the same processes.

Acknowledgements

Thanks to Tianqi Chen and Olivier Grisel for their help when building and testingdask-xgboost. Thanks to Will Warner for his help in editing this post.

Streaming Python Prototype

$
0
0

This work is supported by Continuum Analytics, and the Data Driven Discovery Initiative from the Moore Foundation.

This blogpost is about experimental software. The project may change or be abandoned without warning. You should not depend on anything within this blogpost.

This week I built a small streaming library for Python. This was originally an exercise to help me understand streaming systems like Storm, Flink, Spark-Streaming, and Beam, but the end result of this experiment is not entirely useless, so I thought I’d share it. This blogpost will talk about my experience building such a system and what I valued when using it. Hopefully it elevates interest in streaming systems among the Python community.

Background with Iterators

Python has sequences and iterators. We’re used to mapping, filtering and aggregating over lists and generators happily.

seq=[1,2,3,4,5]seq=map(inc,L)seq=filter(iseven,L)>>>sum(seq)# 2 + 4 + 612

If these iterators are infinite, for example if they are coming from some infinite data feed like a hardware sensor or stock market signal then most of these pieces still work except for the final aggregation, which we replace with an accumulating aggregation.

defget_data():i=0whileTrue:i+=1yieldiseq=get_data()seq=map(inc,seq)seq=filter(iseven,seq)seq=accumulate(lambdatotal,x:total+x,seq)>>>next(seq)# 22>>>next(seq)# 2 + 46>>>next(seq)# 2 + 4 + 612

This is usually a fine way to handle infinite data streams. However this approach becomes awkward if you don’t want to block on calling next(seq) and have your program hang until new data comes in. This approach also becomes awkward when you want to branch off your sequence to multiple outputs and consume from multiple inputs. Additionally there are operations like rate limiting, time windowing, etc. that occur frequently but are tricky to implement if you are not comfortable using threads and queues. These complications often push people to a computation model that goes by the name streaming.

To introduce streaming systems in this blogpost I’ll use my new tiny library, currently called streams (better name to come in the future). However if you decide to use streaming systems in your workplace then you should probably use some other more mature library instead. Common recommendations include the following:

  • ReactiveX (RxPy)
  • Flink
  • Storm (Streamparse)
  • Beam
  • Spark Streaming

Streams

We make a stream, which is an infinite sequence of data into which we can emit values and from which we can subscribe to make new streams.

fromstreamsimportStreamsource=Stream()

From here we replicate our example above. This follows the standard map/filter/reduce chaining API.

s=(source.map(inc).filter(iseven).accumulate(lambdatotal,x:total+x))

Note that we haven’t pushed any data into this stream yet, nor have we said what should happen when data leaves. So that we can look at results, lets make a list and push data into it when data leaves the stream.

results=[]s.sink(results.append)# call the append method on every element leaving the stream

And now lets push some data in at the source and see it arrive at the sink:

>>>forxin[1,2,3,4,5]:...source.emit(x)>>>results[2,6,12]

We’ve accomplished the same result as our infinite iterator, except that rather than pulling data with next we push data through with source.emit. And we’ve done all of this at only a 10x slowdown over normal Python iteators :) (this library takes a few microseconds per element rather than CPython’s normal 100ns overhead).

This will get more interesting in the next few sections.

Branching

This approach becomes more interesting if we add multiple inputs and outputs.

source=Stream()s=source.map(inc)evens=s.filter(iseven)evens.accumulate(add)odds=s.filter(isodd)odds.accumulate(sub)

Or we can combine streams together

second_source=Stream()s=combine_latest(second_source,odds).map(sum)

So you may have multiple different input sources updating at different rates and you may have multiple outputs, perhaps some going to a diagnostics dashboard, others going to long-term storage, others going to a database, etc.. A streaming library makes it relatively easy to set up infrastructure and pipe everything to the right locations.

Time and Back Pressure

When dealing with systems that produce and consume data continuously you often want to control the flow so that the rates of production are not greater than the rates of consumption. For example if you can only write data to a database at 10MB/s or if you can only make 5000 web requests an hour then you want to make sure that the other parts of the pipeline don’t feed you too much data, too quickly, which would eventually lead to a buildup in one place.

To deal with this, as our operations push data forward they also accept Tornado Futures as a receipt.

Upstream: Hey Downstream! Here is some data for you
Downstream: Thanks Upstream!  Let me give you a Tornado future in return.
            Make sure you don't send me any more data until that future
            finishes.
Upstream: Got it, Thanks!  I will pass this to the person who gave me the
          data that I just gave to you.

Under normal operation you don’t need to think about Tornado futures at all (many Python users aren’t familiar with asynchronous programming) but it’s nice to know that the library will keep track of balancing out flow. The code below uses @gen.coroutine and yield common for Tornado coroutines. This is similar to the async/await syntax in Python 3. Again, you can safely ignore it if you’re not familiar with asynchronous programming.

@gen.coroutinedefwrite_to_database(data):withconnect('my-database:1234/table')asdb:yielddb.write(data)source=Stream()(source.map(...).accumulate(...).sink(write_to_database))# <- sink produces a Tornado futurefordataininfinite_feed:yieldsource.emit(data)# <- that future passes through everything#    and ends up here to be waited on

There are also a number of operations to help you buffer flow in the right spots, control rate limiting, etc..

source=Stream()source.timed_window(interval=0.050)# Capture all records of the last 50ms into batches.filter(len)# Remove empty batches.map(...)# Do work on each batch.buffer(10)# Allow ten batches to pile up here.sink(write_to_database)# Potentially rate-limiting stage

I’ve written enough little utilities like timed_window and buffer to discover both that in a full system you would want more of these, and that they are easy to write. Here is the definition of timed_window

classtimed_window(Stream):def__init__(self,interval,child,loop=None):self.interval=intervalself.buffer=[]self.last=gen.momentStream.__init__(self,child,loop=loop)self.loop.add_callback(self.cb)defupdate(self,x,who=None):self.buffer.append(x)returnself.last@gen.coroutinedefcb(self):whileTrue:L,self.buffer=self.buffer,[]self.last=self.emit(L)yieldself.lastyieldgen.sleep(self.interval)

If you are comfortable with Tornado coroutines or asyncio then my hope is that this should feel natural.

Recursion and Feedback

By connecting the sink of one stream to the emit function of another we can create feedback loops. Here is stream that produces the Fibonnacci sequence. To stop it from overwhelming our local process we added in a rate limiting step:

fromstreamsimportStreamsource=Stream()s=source.sliding_window(2).map(sum)L=s.sink_to_list()# store result in a lists.rate_limit(0.5).sink(source.emit)# pipe output back to inputsource.emit(0)# seed with initial valuessource.emit(1)
>>>L[1,2,3,5]>>>L# wait a couple seconds, then check again[1,2,3,5,8,13,21,34]>>>L# wait a couple seconds, then check again[1,2,3,5,8,13,21,34,55,89]

Note: due to the time rate-limiting functionality this example relied on an event loop running somewhere in another thread. This is the case for example in a Jupyter notebook, or if you have a Dask Client running.

Things that this doesn’t do

If you are familiar with streaming systems then you may say the following:

Lets not get ahead of ourselves; there’s way more to a good streaming system than what is presented here. You need to handle parallelism, fault tolerance, out-of-order elements, event/processing times, etc..

… and you would be entirely correct. What is presented here is not in any way a competitor to existing systems like Flink for production-level data engineering problems. There is a lot of logic that hasn’t been built here (and its good to remember that this project was built at night over a week).

Although some of those things, and in particular the distributed computing bits, we may get for free.

Distributed computing

So, during the day I work on Dask, a Python library for parallel and distributed computing. The core task schedulers within Dask are more than capable of running these kinds of real-time computations. They handle far more complex real-time systems every day including few-millisecond latencies, node failures, asynchronous computation, etc.. People use these features today inside companies, but they tend to roll their own system rather than use a high-level API (indeed, they chose Dask because their system was complex enough or private enough that rolling their own was a necessity). Dask lacks any kind of high-level streaming API today.

Fortunately, the system we described above can be modified fairly easily to use a Dask Client to submit functions rather than run them locally.

fromdask.distributedimportClientclient=Client()# start Dask in the backgroundsource.to_dask().scatter()# send data to a cluster.map(...)# this happens on the cluster.accumulate(...)# this happens on the cluster.gather()# gather results back to local machine.sink(...)# This happens locally

Other things that this doesn’t do, but could with modest effort

There are a variety of ways that we could improve this with modest cost:

  1. Streams of sequences: We can be more efficient if we pass not individual elements through a Stream, but rather lists of elements. This will let us lose the microseconds of overhead that we have now per element and let us operate at pure Python (100ns) speeds.
  2. Streams of NumPy arrays / Pandas dataframes: Rather than pass individual records we might pass bits of Pandas dataframes through the stream. So for example rather than filtering elements we would filter out rows of the dataframe. Rather than compute at Python speeds we can compute at C speeds. We’ve built a lot of this logic before for dask.dataframe. Doing this again is straightforward but somewhat time consuming.
  3. Annotate elements: we want to pass through event time, processing time, and presumably other metadata
  4. Convenient Data IO utilities: We would need some convenient way to move data in and out of Kafka and other common continuous data streams.

None of these things are hard. Many of them are afternoon or weekend projects if anyone wants to pitch in.

Reasons I like this project

This was originally built strictly for educational purposes. I (and hopefully you) now know a bit more about streaming systems, so I’m calling it a success. It wasn’t designed to compete with existing streaming systems, but still there are some aspects of it that I like quite a bit and want to highlight.

  1. Lightweight setup: You can import it and go without setting up any infrastructure. It can run (in a limited way) on a Dask cluster or on an event loop, but it’s also fully operational in your local Python thread. There is no magic in the common case. Everything up until time-handling runs with tools that you learn in an introductory programming class.
  2. Small and maintainable:The codebase is currently a few hundred lines. It is also, I claim, easy for other people to understand. Here is the code for filter:

    classfilter(Stream):def__init__(self,predicate,child):self.predicate=predicateStream.__init__(self,child)defupdate(self,x,who=None):ifself.predicate(x):returnself.emit(x)
  3. Composable with Dask: Handling distributed computing is tricky to do well. Fortunately this project can offload much of that worry to Dask. The dividing line between the two systems is pretty clear and, I think, could lead to a decently powerful and maintainable system if we spend time here.
  4. Low performance overhead: Because this project is so simple it has overheads in the few-microseconds range when in a single process.
  5. Pythonic: All other streaming systems were originally designed for Java/Scala engineers. While they have APIs that are clearly well thought through they are sometimes not ideal for Python users or common Python applications.

Future Work

This project needs both users and developers.

I find it fun and satisfying to work on and so encourage others to play around. The codebase is short and, I think, easily digestible in an hour or two.

This project was built without a real use case (see the project’s examples directory for a basic Daskified web crawler). It could use patient users with real-world use cases to test-drive things and hopefully provide PRs adding necessary features.

I genuinely don’t know if this project is worth pursuing. This blogpost is a test to see if people have sufficient interest to use and contribute to such a library or if the best solution is to carry on with any of the fine solutions that already exist.

pip install git+https://github.com/mrocklin/streams

Asynchronous Optimization Algorithms with Dask

$
0
0

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

In a previous post we built convex optimization algorithms with Dask that ran efficiently on a distributed cluster and were important for a broad class of statistical and machine learning algorithms.

We now extend that work by looking at asynchronous algorithms. We show the following:

  1. APIs within Dask to build asynchronous computations generally, not just for machine learning and optimization
  2. Reasons why asynchronous algorithms are valuable in machine learning
  3. A concrete asynchronous algorithm (Async ADMM) and its performance on a toy dataset

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Reproducible notebook available here

Asynchronous vs Blocking Algorithms

When we say asynchronous we contrast it against synchronous or blocking.

In a blocking algorithm you send out a bunch of work and then wait for the result. Dask’s normal .compute() interface is blocking. Consider the following computation where we score a bunch of inputs in parallel and then find the best:

importdaskscores=[dask.delayed(score)(x)forxinL]# many lazy calls to the score functionbest=dask.delayed(max)(scores)best=best.compute()# Trigger all computation and wait until complete

This blocks. We can’t do anything while it runs. If we’re in a Jupyter notebook we’ll see a little asterisk telling us that we have to wait.

A Jupyter notebook cell blocking on a dask computation

In a non-blocking or asynchronous algorithm we send out work and track results as they come in. We are still able to run commands locally while our computations run in the background (or on other computers in the cluster). Dask has a variety of asynchronous APIs, but the simplest is probably the concurrent.futures API where we submit functions and then can wait and act on their return.

fromdask.distributedimportClient,as_completedclient=Client('scheduler-address:8786')# Send out several computationsfutures=[client.submit(score,x)forxinL]# Find max as results arrivebest=0forfutureinas_completed(futures):score=future.result()ifscore>best:best=score

These two solutions are computationally equivalent. They do the same work and run in the same amount of time. The blocking dask.delayed solution is probably simpler to write down but the non-blocking futures + as_completed solution lets us be more flexible.

For example, if we get a score that is good enough then we might stop early. If we find that certain kinds of values are giving better scores than others then we might submit more computations around those values while cancelling others, changing our computation during execution.

This ability to monitor and adapt a computation during execution is one reason why people choose asynchronous algorithms. In the case of optimization algorithms we are doing a search process and frequently updating parameters. If we are able to update those parameters more frequently then we may be able to slightly improve every subsequently launched computation. Asynchronous algorithms enable increased flow of information around the cluster in comparison to more lock-step batch-iterative algorithms.

Asynchronous ADMM

In our last blogpost we showed a simplified implementation of Alternating Direction Method of Multipliers (ADMM) with dask.delayed. We saw that in a distributed context it performed well when compared to a more traditional distributed gradient descent. This algorithm works by solving a small optimization problem on every chunk of our data using our current parameter estimates, bringing these back to the local process, combining them, and then sending out new computation on updated parameters.

Now we alter this algorithm to update asynchronously, so that our parameters change continuously as partial results come in in real-time. Instead of sending out and waiting on batches of results, we now consume and emit a constant stream of tasks with slightly improved parameter estimates.

We show three algorithms in sequence:

  1. Synchronous: The original synchronous algorithm
  2. Asynchronous-single: updates parameters with every new result
  3. Asynchronous-batched: updates with all results that have come in since we last updated.

Setup

We create fake data

n,k,chunksize=50000000,100,50000beta=np.random.random(k)# random beta coefficients, no interceptzero_idx=np.random.choice(len(beta),size=10)beta[zero_idx]=0# set some parameters to 0X=da.random.normal(0,1,size=(n,k),chunks=(chunksize,k))y=X.dot(beta)+da.random.normal(0,2,size=n,chunks=(chunksize,))# add noiseX,y=persist(X,y)# trigger computation in the background

We define local functions for ADMM. These correspond to solving an l1-regularized Linear regression problem:

deflocal_f(beta,X,y,z,u,rho):return((y-X.dot(beta))**2).sum()+(rho/2)*np.dot(beta-z+u,beta-z+u)deflocal_grad(beta,X,y,z,u,rho):return2*X.T.dot(X.dot(beta)-y)+rho*(beta-z+u)defshrinkage(beta,t):returnnp.maximum(0,beta-t)-np.maximum(0,-beta-t)local_update2=partial(local_update,f=local_f,fprime=local_grad)lamduh=7.2# regularization parameter# algorithm parametersrho=1.2abstol=1e-4reltol=1e-2z=np.zeros(p)# the initial consensus estimate# an array of the individual "dual variables" and parameter estimates,# one for each chunk of datau=np.array([np.zeros(p)foriinrange(nchunks)])betas=np.array([np.zeros(p)foriinrange(nchunks)])

Finally because ADMM doesn’t want to work on distributed arrays, but instead on lists of remote numpy arrays (one numpy array per chunk of the dask.array) we convert each our Dask.arrays into a list of dask.delayed objects:

XD=X.to_delayed().flatten().tolist()# a list of numpy arrays, one for each chunkyD=y.to_delayed().flatten().tolist()

Synchronous ADMM

In this algorithm we send out many tasks to run, collect their results, update parameters, and repeat. In this simple implementation we continue for a fixed amount of time but in practice we would want to check some convergence criterion.

start=time.time()whiletime()-start<MAX_TIME:# process each chunk in parallel, using the black-box 'local_update' functionbetas=[delayed(local_update2)(xx,yy,bb,z,uu,rho)forxx,yy,bb,uuinzip(XD,yD,betas,u)]betas=np.array(da.compute(*betas))# collect results back# Update Parametersztilde=np.mean(betas+np.array(u),axis=0)z=shrinkage(ztilde,lamduh/(rho*nchunks))u+=betas-z# update dual variables# track convergence metricsupdate_metrics()

Asynchronous ADMM

In the asynchronous version we send out only enough tasks to occupy all of our workers. We collect results one by one as they finish, update parameters, and then send out a new task.

# Submit enough tasks to occupy our current workersstarting_indices=np.random.choice(nchunks,size=ncores*2,replace=True)futures=[client.submit(local_update,XD[i],yD[i],betas[i],z,u[i],rho,f=local_f,fprime=local_grad)foriinstarting_indices]index=dict(zip(futures,starting_indices))# An iterator that returns results as they come inpool=as_completed(futures,with_results=True)start=time.time()count=0whiletime()-start<MAX_TIME:# Get next completed resultfuture,local_beta=next(pool)i=index.pop(future)betas[i]=local_betacount+=1# Update parameters (this could be made more efficient)ztilde=np.mean(betas+np.array(u),axis=0)ifcount<nchunks:# artificially inflate beta in the beginningztilde*=nchunks/(count+1)z=shrinkage(ztilde,lamduh/(rho*nchunks))update_metrics()# Submit new task to the clusteri=random.randint(0,nchunks-1)u[i]+=betas[i]-znew_future=client.submit(local_update2,XD[i],yD[i],betas[i],z,u[i],rho)index[new_future]=ipool.add(new_future)

Batched Asynchronous ADMM

With enough distributed workers we find that our parameter-updating loop on the client can be the limiting factor. After profiling it seems that our client was bound not by updating parameters, but rather by computing the performance metrics that we are going to use for the convergence plots below (so not actually a limitation in practice). However we decided to leave this in because it is good practice for what is likely to occur in larger clusters, where the single machine that updates parameters is possibly overwhelmed by a high volume of updates from the workers. To resolve this, we build in batching.

Rather than update our parameters one by one, we update them with however many results have come in so far. This provides a natural defense against a slow client. This approach smoothly shifts our algorithm back over to the synchronous solution when the client becomes overwhelmed. (though again, at this scale we’re fine).

Conveniently, the as_completed iterator has a .batches() method that iterates over all of the results that have come in so far.

# ... same setup as beforepool=as_completed(new_betas,with_results=True)batches=pool.batches()# <<<--- this is newwhiletime()-start<MAX_TIME:# Get all tasks that have come in since we checked last timebatch=next(batches)# <<<--- this is newforfuture,resultinbatch:i=index.pop(future)betas[i]=resultcount+=1ztilde=np.mean(betas+np.array(u),axis=0)ifcount<nchunks:ztilde*=nchunks/(count+1)z=shrinkage(ztilde,lamduh/(rho*nchunks))update_metrics()# Submit as many new tasks as we collectedfor_inbatch:# <<<--- this is newi=random.randint(0,nchunks-1)u[i]+=betas[i]-znew_fut=client.submit(local_update2,XD[i],yD[i],betas[i],z,u[i],rho)index[new_fut]=ipool.add(new_fut)

Visual Comparison of Algorithms

To show the qualitative difference between the algorithms we include profile plots of each. Note the following:

  1. Synchronous has blocks of full CPU use followed by blocks of no use
  2. The Asynchrhonous methods are more smooth
  3. The Asynchronous single-update method has a lot of whitespace / time when CPUs are idling. This is artifiical and because our code that tracks convergence diagnostics for our plots below is wasteful and inside the client inner-loop
  4. We intentionally leave in this wasteful code so that we can reduce it by batching in the third plot, which is more saturated.

You can zoom in using the tools to the upper right of each plot. You can view the full profile in a full window by clicking on the “View full page” link.

Synchronous

View full page

Asynchronous single-update

View full page

Asynchronous batched-update

View full page

Plot Convergence Criteria

Primal residual for async-admmPrimal residual for async-admm

Analysis

To get a better sense of what these plots convey, recall that optimization problems always come in pairs: the primal problem is typically the main problem of interest, and the dual problem is a closely related problem that provides information about the constraints in the primal problem. Perhaps the most famous example of duality is the Max-flow-min-cut Theorem from graph theory. In many cases, solving both of these problems simultaneously leads to gains in performance, which is what ADMM seeks to do.

In our case, the constraint in the primal problem is that all workers must agree on the optimum parameter estimate. Consequently, we can think of the dual variables (one for each chunk of data) as measuring the “cost” of agreement for their respective chunks. Intuitively, they will start out small and grow incrementally to find the right “cost” for each worker to have consensus. Eventually, they will level out at an optimum cost.

So:

  • the primal residual plot measures the amount of disagreement; “small” values imply agreement
  • the dual residual plot measures the total “cost” of agreement; this increases until the correct cost is found

The plots then tell us the following:

  • the cost of agreement is higher for asynchronous algorithms, which makes sense because each worker is always working with a slightly out-of-date global parameter estimate, making consensus harder
  • blocked ADMM doesn’t update at all until shortly after 5 seconds have passed, whereas async has already had time to converge. (In practice with real data, we would probably specify that all workers need to report in every K updates).
  • asynchronous algorithms take a little while for the information to properly diffuse, but once that happens they converge quickly.
  • both asynchronous and synchronous converge almost immediately; this is most likely due to a high degree of homogeneity in the data (which was generated to fit the model well). Our next experiment should involve real world data.

What we could have done better

Analysis wise we expect richer results by performing this same experiment on a real world data set that isn’t as homogeneous as the current toy dataset.

Performance wise we can get much better CPU saturation by doing two things:

  1. Not running our convergence diagnostics, or making them much faster
  2. Not running full np.mean computations over all of beta when we’ve only updated a few elements. Instead we should maintain a running aggregation of these results.

With these two changes (each of which are easy) we’re fairly confident that we can scale out to decently large clusters while still saturating hardware.


Dask Development Log

$
0
0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-04-20 and 2017-04-28. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Development in Dask and Dask-related projects during the last week includes the following notable changes:

  1. Improved Joblib support, accelerating existing Scikit-Learn code
  2. A dask-glm powered LogisticRegression estimator that is scikit-learn compatible
  3. Additional Parquet support by Arrow
  4. Sparse arrays
  5. Better spill-to-disk behavior
  6. AsyncIO compatible Client
  7. TLS (SSL) support
  8. NumPy __array_ufunc__ protocol

Joblib

Scikit learn parallelizes most of their algorithms with Joblib, which provides a simple interface for embarrassingly parallel computations. Dask has been able to hijack joblib code and serve as the backend for some time now, but it had some limitations, particularly because we would repeatedly send data back and forth from a worker to client for every batch of computations.

importdistributed.joblibfromjoblibimportParallel,parallel_backendwithparallel_backend('dask.distributed',scheduler_host='HOST:PORT'):# normal Joblib code

Now there is a scatter= keyword, which allows you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each workerwithparallel_backend('dask.distributed',scheduler_host='localhost:8786',scatter=[digits.data,digits.target]):search.fit(digits.data,digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

This is particularly nice because it allows Dask and Scikit-Learn to play well together without having to introduce Dask within the Scikit-Learn codebase at all. From a maintenance perspective this combination is very attractive.

Work done by Jim Crist in dask/distributed #1022

Dask-GLM Logistic Regression

The convex optimization solvers in the dask-glm project allow us to solve common machine learning and statistics problems in parallel and at scale. Historically this young library has contained only optimization solvers and relatively little in the way of user API.

This week dask-glm grew new LogisticRegression and LinearRegression estimators that expose the scalable convex optimization algorithms within dask-glm through a Scikit-Learn compatible interface. This can both speedup solutions on a single computer or provide solutions for datasets that were previously too large to fit in memory.

fromdask_glm.estimatorsimportLogisticRegressionest=LogisticRegression()est.fit(my_dask_array,labels)

This notebook compares performance to the latest release of scikit-learn on a 5,000,000 dataset running on a single machine. Dask-glm beats scikit-learn by a factor of four, which is also roughly the number of cores on the development machine. However in response this notebook by Olivier Grisel shows the development version of scikit-learn (with a new algorithm) beating out dask-glm by a factor of six. This just goes to show you that being smarter about your algorithms is almost always a better use of time than adopting parallelism.

Work done by Tom Augspurger and Chris White in dask/dask-glm #40

Parquet with Arrow

The Parquet format is quickly becoming a standard for parallel and distributed dataframes. There are currently two Parquet reader/writers accessible from Python, fastparquet a NumPy/Numba solution, and Parquet-CPP a C++ solution with wrappers provided by Arrow. Dask.dataframe has supported parquet for a while now with fastparquet.

However, users will now have an option to use Arrow instead by switching the engine= keyword in the dd.read_parquet function.

df=dd.read_parquet('/path/to/mydata.parquet',engine='fastparquet')df=dd.read_parquet('/path/to/mydata.parquet',engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format. As a gentle reminder, you can typically get much faster query times by switching from CSV to Parquet. This is often much more effective than parallel computing.

Work by Wes McKinney in dask/dask #2223.

Sparse Arrays

There is a small multi-dimensional sparse array library here: https://github.com/mrocklin/sparse. It allows us to represent arrays compactly in memory when most entries are zero. This differs from the standard solution in scipy.sparse, which can only support arrays of dimension two (matrices) and not greater.

pip install sparse
>>>importnumpyasnp>>>x=np.random.random(size=(10,10,10,10))>>>x[x<0.9]=0>>>x.nbytes80000>>>importsparse>>>s=sparse.COO(x)>>>s<COO:shape=(10,10,10,10),dtype=float64,nnz=1074>>>>s.nbytes12888>>>sparse.tensordot(s,s,axes=((1,0,3),(2,1,0))).sum(axis=1)array([100.93868073,128.72312323,119.12997217,118.56304153,133.24522101,98.33555365,90.25304866,98.99823973,100.57555847,78.27915528])

Additionally, this sparse library more faithfully follows the numpy.ndarray API, which is exactly what dask.array expects. Because of this close API matching dask.array is able to parallelize around sparse arrays just as easily as it parallelizes around dense numpy arrays. This gives us a decent distributed multidimensional sparse array library relatively cheaply.

>>>importdask.arrayasda>>>x=da.random.random(size=(10000,10000,10000,10000),...chunks=(100,100,100,100))>>>x[x<0.9]=0>>>s=x.map_blocks(sparse.COO)# parallel array of sparse arrays

Work on the sparse library is so far by myself and Jake VanderPlas and is available here. Work connecting this up to Dask.array is in dask/dask #2234.

Better spill to disk behavior

I’ve been playing with a 50GB sample of the 1TB Criteo dataset on my laptop (this is where I’m using sparse arrays). To make computations flow a bit faster I’ve improved the performance of Dask’s spill-to-disk policies.

Now, rather than depend on (cloud)pickle we use Dask’s network protocol, which handles data more efficiently, compresses well, and has special handling for common and important types like NumPy arrays and things built out of NumPy arrays (like sparse arrays).

As a result reading and writing excess data to disk is significantly faster. When performing machine learning computations (which are fairly heavy-weight) disk access is now fast enough that I don’t notice it in practice and running out of memory doesn’t significantly impact performance.

This is only really relevant when using common types (like numpy arrays) and when your computation to disk access ratio is relatively high (such as is the case for analytic workloads), but it was a simple fix and yielded a nice boost to my personal productivity.

Work by myself in dask/distributed #946.

AsyncIO compatible Client

The Dask.distributed scheduler maintains a fully asynchronous API for use with non-blocking systems like Tornado or AsyncIO. Because Dask supports Python 2 all of our internal code is written with Tornado. While Tornado and AsyncIO can work together, this generally requires a bit of excess book-keeping, like turning Tornado futures into AsyncIO futures, etc..

Now there is an AsyncIO specific Client that only includes non-blocking methods that are AsyncIO native. This allows for more idiomatic asynchronous code in Python 3.

asyncwithAioClient('scheduler-address:8786')asc:future=c.submit(func,*args,**kwargs)result=awaitfuture

Work by Krisztián Szűcs in dask/distributed #1029.

TLS (SSL) support

TLS (previously called SSL) is a common and trusted solution to authentication and encryption. It is a commonly requested feature by companies of institutions where intra-network security is important. This is currently being worked on now at dask/distributed #1034. I encourage anyone who this may affect to engage on that pull request.

Work by Antoine Pitrou in dask/distributed #1034 and previously by Marius van Niekerk in dask/distributed #866.

NumPy __array_ufunc__

This recent change in NumPy (literally merged as I was typing this blogpost) allows other array libraries to take control of the the existing NumPy ufuncs, so if you call something like np.exp(my_dask_array) this will no longer convert to a NumPy array, but will rather call the appropriate dask.array.exp function. This is a big step towards writing generic array code that works both on NumPy arrays as well as other array projects like dask.array, xarray, bcolz, sparse, etc..

As with all large changes in NumPy this was accomplished through a collaboration of many people. PR in numpy/numpy #8247.

Dask Release 0.14.3

$
0
0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.3. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on March 22nd.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Conda packages should be on the default channel within a few days.

Arrays

Sparse Arrays

Dask.arrays now support sparse arrays and mixed dense/sparse arrays.

>>>importdask.arrayasda>>>x=da.random.random(size=(10000,10000,10000,10000),...chunks=(100,100,100,100))>>>x[x<0.99]=0>>>importsparse>>>s=x.map_blocks(sparse.COO)# parallel array of sparse arrays

In order to support sparse arrays we did two things:

  1. Made dask.array support ndarray containers other than NumPy, as long as they were API compatible
  2. Made a small sparse array library that was API compatible to the numpy.ndarray

This process was pretty easy and could be extended to other systems. This also allows for different kinds of ndarrays in the same Dask array, as long as interactions between the arrays are well defined (using the standard NumPy protocols like __array_priority__ and so on.)

Documentation: http://dask.pydata.org/en/latest/array-sparse.html

Update: there is already a pull request for Masked arrays

Reworked FFT code

The da.fft submodule has been extended to include most of the functions in np.fft, with the caveat that multi-dimensional FFTs will only work along single-chunk dimensions. Still, given that rechunking is decently fast today this can be very useful for large image stacks.

Documentation: http://dask.pydata.org/en/latest/array-api.html#fast-fourier-transforms

Constructor Plugins

You can now run arbitrary code whenever a dask array is constructed. This empowers users to build in their own policies like rechunking, warning users, or eager evaluation. A dask.array plugin takes in a dask.array and returns either a new dask array, or returns None, in which case the original will be returned.

>>>deff(x):...print('%d bytes'%x.nbytes)>>>withdask.set_options(array_plugins=[f]):...x=da.ones((10,1),chunks=(5,1))...y=x.dot(x.T)80bytes80bytes800bytes800bytes

This can be used, for example, to convert dask.array code into numpy code to identify bugs quickly:

>>>withdask.set_options(array_plugins=[lambdax:x.compute()]):...x=da.arange(5,chunks=2)>>>x# this was automatically converted into a numpy arrayarray([0,1,2,3,4])

Or to warn users if they accidentally produce an array with large chunks:

defwarn_on_large_chunks(x):shapes=list(itertools.product(*x.chunks))nbytes=[x.dtype.itemsize*np.prod(shape)forshapeinshapes]ifany(nb>1e9fornbinnbytes):warnings.warn("Array contains very large chunks")withdask.set_options(array_plugins=[warn_on_large_chunks]):...

These features were heavily requested by the climate science community, which tends to serve both highly technical computer scientists, and less technical climate scientists who were running into issues with the nuances of chunking.

DataFrames

Dask.dataframe changes are both numerous, and very small, making it difficult to give a representative accounting of recent changes within a blogpost. Typically these include small changes to either track new Pandas development, or to fix slight inconsistencies in corner cases (of which there are many.)

Still, two highlights follow:

Rolling windows with time intervals

>>>s.rolling('2s').count().compute()2017-01-0100:00:001.02017-01-0100:00:012.02017-01-0100:00:022.02017-01-0100:00:032.02017-01-0100:00:042.02017-01-0100:00:052.02017-01-0100:00:062.02017-01-0100:00:072.02017-01-0100:00:082.02017-01-0100:00:092.0dtype:float64

Read Parquet data with Arrow

Dask now supports reading Parquet data with both fastparquet (a Numpy/Numba solution) and Arrow and Parquet-CPP.

df=dd.read_parquet('/path/to/mydata.parquet',engine='fastparquet')df=dd.read_parquet('/path/to/mydata.parquet',engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format.

Graph Optimizations

Dask performs a few passes of simple linear-time graph optimizations before sending a task graph to the scheduler. These optimizations currently vary by collection type, for example dask.arrays have different optimizations than dask.dataframes. These optimizations can greatly improve performance in some cases, but can also increase overhead, which becomes very important for large graphs.

As Dask has grown into more communities, each with strong and differing performance constraints, we’ve found that we needed to allow each community to define its own optimization schemes. The defaults have not changed, but now you can override them with your own. This can be set globally or with a context manager.

defmy_optimize_function(graph,keys):""" Takes a task graph and a list of output keys, returns new graph """new_graph={...}returnnew_graphwithdask.set_options(array_optimize=my_optimize_function,dataframe_optimize=None,delayed_optimize=my_other_optimize_function):x,y=dask.compute(x,y)

Documentation: http://dask.pydata.org/en/latest/optimize.html#customizing-optimization

Speed improvements

Additionally, task fusion has been significantly accelerated. This is very important for large graphs, particularly in dask.array computations.

Web Diagnostics

The distributed scheduler’s web diagnostic page is now served from within the dask scheduler process. This is both good and bad:

  • Good: It is much easier to make new visuals
  • Bad: Dask and Bokeh now share a single CPU

Because Bokeh and Dask now share the same Tornado event loop we no longer need to send messages between them to then send out to a web browser. The Bokeh server has full access to all of the scheduler state. This lets us build new diagnostic pages more easily. This has been around for a while but was largely used for development. In this version we’ve switched the new version to be default and turned off the old one.

The cost here is that the Bokeh scheduler can take 10-20% of the CPU use. If you are running a computation that heavily taxes the scheduler then you might want to close your diagnostic pages. Fortunately, this almost never happens. The dask scheduler is typically fast enough to never get close to this limit.

Tornado difficulties

Beware that the current versions of Bokeh (0.12.5) and Tornado (4.5) do not play well together. This has been fixed in development versions, and installing with conda is fine, but if you naively pip install then you may experience bad behavior.

Joblib

The Dask.distributed Joblib backend now includes a scatter= keyword, allowing you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each workerwithparallel_backend('dask.distributed',scheduler_host='localhost:8786',scatter=[digits.data,digits.target]):search.fit(digits.data,digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

Documentation: http://distributed.readthedocs.io/en/latest/joblib.html

Preload scripts

When starting a dask.distributed scheduler or worker people often want to include a bit of custom setup code, for example to configure loggers, authenticate with some network system, and so on. This has always been possible if you start scheduler and workers from within Python but is tricky if you want to use the command line interface. Now you can write your custom code as a separate standalone script and ask the command line interface to run it for you at startup:

# scheduler-setup.pyfromdistributed.diagnostics.pluginimportSchedulerPluginclassMyPlugin(SchedulerPlugin):""" Prints a message whenever a worker is added to the cluster """defadd_worker(self,scheduler=None,worker=None,**kwargs):print("Added a new worker at",worker)defdask_setup(scheduler):plugin=MyPlugin()scheduler.add_plugin(plugin)
dask-scheduler --preload scheduler-setup.py

This makes it easier for people to adapt Dask to their particular institution.

Documentation: http://distributed.readthedocs.io/en/latest/setup.html#customizing-initialization

Network Interfaces (for infiniband)

Many people use Dask on high performance supercomputers. This hardware differs from typical commodity clusters or cloud services in several ways, including very high performance network interconnects like InfiniBand. Typically these systems also have normal ethernet and other networks. You’re probably familiar with this on your own laptop when you have both ethernet and wireless:

$ ifconfig
lo          Link encap:Local Loopback                       # Localhost
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
eth0        Link encap:Ethernet  HWaddr XX:XX:XX:XX:XX:XX   # Ethernet
            inet addr:192.168.0.101
            ...
ib0         Link encap:Infiniband                           # Fast InfiniBand
            inet addr:172.42.0.101

The default systems Dask uses to determine network interfaces often choose ethernet by default. If you are on an HPC system then this is likely not optimal. You can direct Dask to choose a particular network interface with the --interface keyword

$ dask-scheduler --interface ib0
distributed.scheduler - INFO -   Scheduler at: tcp://172.42.0.101:8786

$ dask-worker tcp://172.42.0.101:8786 --interface ib0

Efficient as_completed

The as_completed iterator returns futures in the order in which they complete. It is the base of many asynchronous applications using Dask.

>>>x,y,z=client.map(inc,[0,1,2])>>>forfutureinas_completed([x,y,z]):...print(future.result())201

It can now also wait to yield an element only after the result also arrives

>>>forfuture,resultinas_completed([x,y,z],with_results=True):...print(result)201

And also yield all futures (and results) that have finished up until this point.

>>>forfuturesinas_completed([x,y,z]).batches():...print(client.gather(futures))(2,0)(1,)

Both of these help to decrease the overhead of tight inner loops within asynchronous applications.

Example blogpost here: http://matthewrocklin.com/blog/work/2017/04/19/dask-glm-2

Co-released libraries

This release is aligned with a number of other related libraries, notably Pandas, and several smaller libraries for accessing data, including s3fs, hdfs3, fastparquet, and python-snappy each of which have seen numerous updates over the past few months. Much of the work of these latter libraries is being coordinated by Martin Durant

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.1 release on March 22nd

  • Antoine Pitrou
  • Dmitry Shachnev
  • Erik Welch
  • Eugene Pakhomov
  • Jeff Reback
  • Jim Crist
  • John A Kakirkham
  • Joris Van den Bossche
  • Martin Durant
  • Matthew Rocklin
  • Michal Ficek
  • Noah D Brenowitz
  • Stuart Archibald
  • Tom Augspurger
  • Wes McKinney
  • wikiped

The following people contributed to the dask/distributed repository since the 1.16.1 release on March 22nd

  • Antoine Pitrou
  • Bartosz Marcinkowski
  • Ben Schreck
  • Jim Crist
  • Jens Nie
  • Krisztián Szűcs
  • Lezyes
  • Luke Canavan
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

Dask Release 0.15.0

$
0
0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.0. This release contains performance and stability enhancements as well as some breaking changes. This blogpost outlines notable changes since the last release on May 5th.

As always you can conda install Dask:

conda install dask distributed

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on the defaults and conda-forge channels.

Full changelogs are available here:

Some notable changes follow.

NumPy ufuncs operate as Dask.array ufuncs

Thanks to recent changes in NumPy 1.13.0, NumPy ufuncs now operate as Dask.array ufuncs. Previously they would convert their arguments into Numpy arrays and then operate concretely.

importdask.arrayasdaimportnumpyasnpx=da.arange(10,chunks=(5,))# Before>>>np.negative(x)array([0,-1,-2,-3,-4,-5,-6,-7,-8,-9])# Now>>>np.negative(x)dask.array<negative,shape=(10,),dtype=int64,chunksize=(5,)>

To celebrate this change we’ve also improved support for more of the NumPy ufunc and reduction API, such as support for out parameters. This means that a non-trivial subset of the actual NumPy API works directly out-of-the box with dask.arrays. This makes it easier to write code that seamlessly works with either array type.

Note: the ufunc feature requires that you update NumPy to 1.13.0 or later. Packages are available through PyPI and conda on the defaults and conda-forge channels.

Asynchronous Clients

The Dask.distributed API is capable of operating within a Tornado or Asyncio event loop, which can be useful when integrating with other concurrent systems like web servers or when building some more advanced algorithms in machine learning and other fields. The API to do this used to be somewhat hidden and only known to a few and used underscores to signify that methods were asynchronous.

# Beforeclient=Client(start=False)awaitclient._start()future=client.submit(func,*args)result=awaitclient._gather(future)

These methods are still around, but the process of starting the client has changed and we now recommend using the fully public methods even in asynchronous situations (these used to block).

# Nowclient=awaitClient(asynchronous=True)future=client.submit(func,*args)result=awaitclient.gather(future)# no longer use the underscore

You can also await futures directly:

result=awaitfuture

You can use yield instead of await if you prefer Python 2.

More information is available at https://distributed.readthedocs.org/en/latest/asynchronous.html.

Single-threaded scheduler moves from dask.async to dask.local

The single-machine scheduler used to live in the dask.async module. With async becoming a keyword since Python 3.5 we’re forced to rename this. You can now find the code in dask.local. This will particularly affect anyone who was using the single-threaded scheduler, previously known as dask.async.get_sync. The term dask.get can be used to reliably refer to the single-threaded base scheduler across versions.

Retired the distributed.collections module

Early blogposts referred to functions like futures_to_dask_array which resided in the distributed.collections module. These have since been entirely replaced by better interactions between Futures and Delayed objects. This module has been removed entirely.

Always create new directories with the –local-directory flag

Dask workers create a directory where they can place temporary files. Typically this goes into your operating system’s temporary directory (/tmp on Linux and Mac).

Some users on network file systems specify this directory explicitly with the dask-worker ... --local-directory option, pointing to some other better place like a local SSD drive. Previously Dask would dump files into the provided directory. Now it will create a new subdirectory and place files there. This tends to be much more convenient for users on network file systems.

$ dask-worker scheduler-address:8786 --local-directory /scratch
$ ls /scratch
worker-1234/
$ ls /scratch/worker-1234/
user-script.py disk-storage/ ...

Bag.map no longer automatically expands tuples

Previously the map method would inspect functions and automatically expand tuples to fill arguments:

importdask.bagasdbb=db.from_sequence([(1,10),(2,20),(3,30)])>>>b.map(lambdax,y:returnx+y).compute()[11,22,33]

While convenient, this behavior gave rise to corner cases and stopped us from being able to support multi-bag mapping functions. It has since been removed. As an advantage though, you can now map two co-partitioned bags together.

a=db.from_sequence([1,2,3])b=db.from_sequence([10,20,30])>>>db.map(lambdax,y:x+y,a,b).compute()[11,22,33]

Styling

Clients and Futures have nicer HTML reprs that show up in the Jupyter notebook.

And the dashboard stays a decent width and has a new navigation bar with links to other dashboard pages. This template is now consistently applied to all dashboard pages.

Multi-client coordination

More primitives to help coordinate between multiple clients on the same cluster have been added. These include Queues and shared Variables for futures.

Joblib performance through pre-scattering

When using Dask to power Joblib computations (such as occur in Scikit-Learn) with the joblib.parallel_backend context manager, you can now pre-scatter select data to all workers. This can significantly speed up some scikit-learn computations by reducing repeated data transfer.

importdistributed.joblibfromsklearn.externals.joblibimportparallel_backend# Serialize the training data only once to each workerwithparallel_backend('dask.distributed',scheduler_host='localhost:8786',scatter=[digits.data,digits.target]):search.fit(digits.data,digits.target)

Other Array Improvements

  • Filled out the dask.array.fft module
  • Added a basic dask.array.stats module with functions like chisquare
  • Support the @ matrix multiply operator

General performance and stability

As usual, a number of bugs were identified and resolved and a number of performance optimizations were implemented. Thank you to all users and developers who continue to help identify and implement areas for improvement. Users should generally have a smoother experience.

Removed ZMQ networking backend

We have removed the experimental ZeroMQ networking backend. This was not particularly useful in practice. However it was very effective in serving as an example while we were making our network communication layer pluggable with different protocols.

The following related projects have also been released recently and may be worth updating:

  • NumPy 1.13.0
  • Pandas 0.20.2
  • Bokeh 0.12.6
  • Fastparquet 0.1.0
  • S3FS 0.1.1
  • Cloudpickle 0.3.1 (pip)
  • lz4 0.10.0 (pip)

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.3 release on May 5th:

  • Antoine Pitrou
  • Elliott Sales de Andrade
  • Ghislain Antony Vaillant
  • John A Kirkham
  • Jim Crist
  • Joseph Crail
  • Juan Nunez-Iglesias
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Samantha Hughes
  • Tom Augspurger

The following people contributed to the dask/distributed repository since the 1.16.2 release on May 5th:

  • A. Jesse Jiryu Davis
  • Antoine Pitrou
  • Brett Naul
  • Eugene Van den Bulke
  • Fabian Keller
  • Jim Crist
  • Krisztián Szűcs
  • Matthew Rocklin
  • Simon Perkins
  • Thomas Arildsen
  • Viacheslav Ostroukh

Programmatic Bokeh Servers

$
0
0

This work is supported by Continuum Analytics

This was cross posted to the Bokeh blog here. Please consider referencing and sharing that post via social media instead of this one.

This blogpost shows how to start a very simple bokeh server application programmatically. For more complex examples, or for the more standard command line interface, see the Bokeh documentation.

Motivation

Many people know Bokeh as a tool for building web visualizations from languages like Python. However I find that Bokeh’s true value is in serving live-streaming, interactive visualizations that update with real-time data. I personally use Bokeh to serve real-time diagnostics for a distributed computing system. In this case I embed Bokeh directly into my library. I’ve found it incredibly useful and easy to deploy sophisticated and beautiful visualizations that help me understand the deep inner-workings of my system.

Most of the (excellent) documentation focuses on stand-alone applications using the Bokeh server

$ bokeh serve myapp.py

However as a developer who wants to integrate Bokeh into my application starting up a separate process from the command line doesn’t work for me. Also, I find that starting things from Python tends to be a bit simpler on my brain. I thought I’d provide some examples on how to do this within a Jupyter notebook.

Launch Bokeh Servers from a Notebook

The code below starts a Bokeh server running on port 5000 that provides a single route to / that serves a single figure with a line-plot. The imports are a bit wonky, but the amount of code necessary here is relatively small.

frombokeh.server.serverimportServerfrombokeh.applicationimportApplicationfrombokeh.application.handlers.functionimportFunctionHandlerfrombokeh.plottingimportfigure,ColumnDataSourcedefmake_document(doc):fig=figure(title='Line plot!',sizing_mode='scale_width')fig.line(x=[1,2,3],y=[1,4,9])doc.title="Hello, world!"doc.add_root(fig)apps={'/':Application(FunctionHandler(make_document))}server=Server(apps,port=5000)server.start()

We make a function make_document which is called every time someone visits our website. This function can create plots, call functions, and generally do whatever it wants. Here we make a simple line plot and register that plot with the document with the doc.add_root(...) method.

This starts a Tornado web server and creates a new image whenever someone connects, similar to libraries like Tornado, or Flask. In this case our web server piggybacks on the Jupyter notebook’s own IOLoop. Because Bokeh is built on Tornado it can play nicely with other async applications like Tornado or Asyncio.

Live Updates

I find that Bokeh’s real strength comes when you want to stream live data into the browser. Doing this by hand generally means serializing your data on the server, figuring out how web sockets work, sending the data to the client/browser and then updating plots in the browser.

Bokeh handles this by keeping a synchronized table of data on the client and the server, the ColumnDataSource. If you define plots around the column data source and then push more data into the source then Bokeh will handle the rest. Updating your plots in the browser just requires pushing more data into the column data source on the server.

In the example below every time someone connects to our server we make a new ColumnDataSource, make an update function that adds a new record into it, and set up a callback to call that function every 100ms. We then make a plot around that data source to render the data as colored circles.

Because this is a new Bokeh server we start this on a new port, though in practice if we had multiple pages we would just add them as multiple routes in the apps variable.

importrandomdefmake_document(doc):source=ColumnDataSource({'x':[],'y':[],'color':[]})defupdate():new={'x':[random.random()],'y':[random.random()],'color':[random.choice(['red','blue','green'])]}source.stream(new)doc.add_periodic_callback(update,100)fig=figure(title='Streaming Circle Plot!',sizing_mode='scale_width',x_range=[0,1],y_range=[0,1])fig.circle(source=source,x='x',y='y',color='color',size=10)doc.title="Now with live updating!"doc.add_root(fig)apps={'/':Application(FunctionHandler(make_document))}server=Server(apps,port=5001)server.start()

By changing around the figures (or combining multiple figures, text, other visual elements, and so on) you have full freedom over the visual styling of your web service. By changing around the update function you can pull data from sensors, shove in more interesting data, and so on. This toy example is meant to provide the skeleton of a simple application; hopefully you can fill in details from your application.

Real example

Here is a simple example taken from Dask’s dashboard that maintains a streaming time series plot with the number of idle and saturated workers in a Dask cluster.

defmake_document(doc):source=ColumnDataSource({'time':[time(),time()+1],'idle':[0,0.1],'saturated':[0,0.1]})x_range=DataRange1d(follow='end',follow_interval=20000,range_padding=0)fig=figure(title="Idle and Saturated Workers Over Time",x_axis_type='datetime',y_range=[-0.1,len(scheduler.workers)+0.1],height=150,tools='',x_range=x_range,**kwargs)fig.line(source=source,x='time',y='idle',color='red')fig.line(source=source,x='time',y='saturated',color='green')fig.yaxis.minor_tick_line_color=Nonefig.add_tools(ResetTool(reset_size=False),PanTool(dimensions="width"),WheelZoomTool(dimensions="width"))doc.add_root(fig)defupdate():result={'time':[time()*1000],'idle':[len(scheduler.idle)],'saturated':[len(scheduler.saturated)]}source.stream(result,10000)doc.add_periodic_callback(update,100)

You can also have buttons, sliders, widgets, and so on. I rarely use these personally though so they don’t interest me as much.

Final Thoughts

I’ve found the Bokeh server to be incredibly helpful in my work and also very approachable once you understand how to set one up (as you now do). I hope that this post serves people well. This blogpost is available as a Jupyter notebook if you want to try it out yourself.

Use Apache Parquet

$
0
0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

This is a tiny blogpost to encourage you to use Parquet instead of CSV for your dataframe computations. I’ll use Dask.dataframe here but Pandas would work just as well. I’ll also use my local laptop here, but Parquet is an excellent format to use on a cluster.

CSV is convenient, but slow

I have the NYC taxi cab dataset on my laptop stored as CSV

mrocklin@carbon:~/data/nyc/csv$ ls
yellow_tripdata_2015-01.csv  yellow_tripdata_2015-07.csv
yellow_tripdata_2015-02.csv  yellow_tripdata_2015-08.csv
yellow_tripdata_2015-03.csv  yellow_tripdata_2015-09.csv
yellow_tripdata_2015-04.csv  yellow_tripdata_2015-10.csv
yellow_tripdata_2015-05.csv  yellow_tripdata_2015-11.csv
yellow_tripdata_2015-06.csv  yellow_tripdata_2015-12.csv

This is a convenient format for humans because we can read it directly.

mrocklin@carbon:~/data/nyc/csv$ head yellow_tripdata_2015-01.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-01-15 19:05:39,2015-01-15
19:23:42,1,1.59,-73.993896484375,40.750110626220703,1,N,-73.974784851074219,40.750617980957031,1,12,1,0.5,3.25,0,0.3,17.05
1,2015-01-10 20:33:38,2015-01-10
20:53:28,1,3.30,-74.00164794921875,40.7242431640625,1,N,-73.994415283203125,40.759109497070313,1,14.5,0.5,0.5,2,0,0.3,17.8
1,2015-01-10 20:33:38,2015-01-10
20:43:41,1,1.80,-73.963340759277344,40.802787780761719,1,N,-73.951820373535156,40.824413299560547,2,9.5,0.5,0.5,0,0,0.3,10.8
1,2015-01-10 20:33:39,2015-01-10
20:35:31,1,.50,-74.009086608886719,40.713817596435547,1,N,-74.004325866699219,40.719985961914063,2,3.5,0.5,0.5,0,0,0.3,4.8
1,2015-01-10 20:33:39,2015-01-10
20:52:58,1,3.00,-73.971176147460938,40.762428283691406,1,N,-74.004180908203125,40.742652893066406,2,15,0.5,0.5,0,0,0.3,16.3
1,2015-01-10 20:33:39,2015-01-10
20:53:52,1,9.00,-73.874374389648438,40.7740478515625,1,N,-73.986976623535156,40.758193969726563,1,27,0.5,0.5,6.7,5.33,0.3,40.33
1,2015-01-10 20:33:39,2015-01-10
20:58:31,1,2.20,-73.9832763671875,40.726009368896484,1,N,-73.992469787597656,40.7496337890625,2,14,0.5,0.5,0,0,0.3,15.3
1,2015-01-10 20:33:39,2015-01-10
20:42:20,3,.80,-74.002662658691406,40.734142303466797,1,N,-73.995010375976563,40.726325988769531,1,7,0.5,0.5,1.66,0,0.3,9.96
1,2015-01-10 20:33:39,2015-01-10
21:11:35,3,18.20,-73.783042907714844,40.644355773925781,2,N,-73.987594604492187,40.759357452392578,2,52,0,0.5,0,5.33,0.3,58.13

We can use tools like Pandas or Dask.dataframe to read in all of this data. Because the data is large-ish, I’ll use Dask.dataframe

mrocklin@carbon:~/data/nyc/csv$ du -hs .
22G .
In[1]:importdask.dataframeasddIn[2]:%timedf=dd.read_csv('yellow_tripdata_2015-*.csv')CPUtimes:user340ms,sys:12ms,total:352msWalltime:377msIn[3]:df.head()Out[3]:VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_count  \
022015-01-1519:05:392015-01-1519:23:421112015-01-1020:33:382015-01-1020:53:281212015-01-1020:33:382015-01-1020:43:411312015-01-1020:33:392015-01-1020:35:311412015-01-1020:33:392015-01-1020:52:581trip_distancepickup_longitudepickup_latitudeRateCodeID  \
01.59-73.99389640.750111113.30-74.00164840.724243121.80-73.96334140.802788130.50-74.00908740.713818143.00-73.97117640.7624281store_and_fwd_flagdropoff_longitudedropoff_latitudepayment_type \
0N-73.97478540.75061811N-73.99441540.75910912N-73.95182040.82441323N-74.00432640.71998624N-74.00418140.7426532fare_amountextramta_taxtip_amounttolls_amount  \
012.01.00.53.250.0114.50.50.52.000.029.50.50.50.000.033.50.50.50.000.0415.00.50.50.000.0improvement_surchargetotal_amount00.317.0510.317.8020.310.8030.34.8040.316.30In[4]:fromdask.diagnosticsimportProgressBarIn[5]:ProgressBar().register()In[6]:df.passenger_count.sum().compute()[########################################] | 100% Completed |3min58.8sOut[6]:245566747

We were able to ask questions about this data (and learn that 250 million people rode cabs in 2016) even though it is too large to fit into memory. This is because Dask is able to operate lazily from disk. It reads in the data on an as-needed basis and then forgets it when it no longer needs it. This takes a while (4 minutes) but does just work.

However, when we read this data many times from disk we start to become frustrated by this four minute cost. In Pandas we suffered this cost once as we moved data from disk to memory. On larger datasets when we don’t have enough RAM we suffer this cost many times.

Parquet is faster

Lets try this same process with Parquet. I happen to have the same exact data stored in Parquet format on my hard drive.

mrocklin@carbon:~/data/nyc$ du -hs nyc-2016.parquet/
17G nyc-2016.parquet/

It is stored as a bunch of individual files, but we don’t actually care about that. We’ll always refer to the directory as the dataset. These files are stored in binary format. We can’t read them as humans

mrocklin@carbon:~/data/nyc$headnyc-2016.parquet/part.0.parquet<abunchofillegiblebytes>

But computers are much more able to both read and navigate this data. Lets do the same experiment from before:

In[1]:importdask.dataframeasddIn[2]:df=dd.read_parquet('nyc-2016.parquet/')In[3]:df.head()Out[3]:tpep_pickup_datetimeVendorIDtpep_dropoff_datetimepassenger_count  \
02015-01-0100:00:0022015-01-0100:00:00312015-01-0100:00:0022015-01-0100:00:00122015-01-0100:00:0012015-01-0100:11:26532015-01-0100:00:0112015-01-0100:03:49142015-01-0100:00:0322015-01-0100:21:482trip_distancepickup_longitudepickup_latitudeRateCodeID  \
01.56-74.00132040.729057111.68-73.99154740.750069124.00-73.97143640.760201130.80-73.86084740.757294142.57-73.96901740.7542691store_and_fwd_flagdropoff_longitudedropoff_latitudepayment_type  \
0N-74.01020840.71966211N0.0000000.00000022N-73.92118140.76826923N-73.86811140.75228524N-73.99413340.7616002fare_amountextramta_taxtip_amounttolls_amount  \
07.50.50.50.00.0110.00.00.50.00.0213.50.50.50.00.035.00.50.50.00.0414.50.50.50.00.0improvement_surchargetotal_amount00.38.810.310.820.014.530.06.340.315.8In[4]:fromdask.diagnosticsimportProgressBarIn[5]:ProgressBar().register()In[6]:df.passenger_count.sum().compute()[########################################] | 100% Completed |2.8sOut[6]:245566747

Same values, but now our computation happens in three seconds, rather than four minutes. We’re cheating a little bit here (pulling out the passenger count column is especially easy for Parquet) but generally Parquet will be much faster than CSV. This lets us work from disk comfortably without worrying about how much memory we have.

Convert

So do yourself a favor and convert your data

In[1]:importdask.dataframeasddIn[2]:df=dd.read_csv('csv/yellow_tripdata_2015-*.csv')In[3]:fromdask.diagnosticsimportProgressBarIn[4]:ProgressBar().register()In[5]:df.to_parquet('yellow_tripdata.parquet')[############                            ] | 30% Completed |  1min 54.7s

If you want to be more clever you can specify dtypes and compression when converting. This can definitely help give you significantly greater speedups, but just using the default settings will still be a large improvement.

Advantages

Parquet enables the following:

  1. Binary representation of data, allowing for speedy conversion of bytes-on-disk to bytes-in-memory
  2. Columnar storage, meaning that you can load in as few columns as you need without loading the entire dataset
  3. Row-chunked storage so that you can pull out data from a particular range without touching the others
  4. Per-chunk statistics so that you can find subsets quickly
  5. Compression

Parquet Versions

There are two nice Python packages with support for the Parquet format:

  1. pyarrow: Python bindings for the Apache Arrow and Apache Parquet C++ libraries
  2. fastparquet: a direct NumPy + Numba implementation of the Parquet format

Both are good. Both can do most things. Each has separate strengths. The code above used fastparquet by default but you can change this in Dask with the engine='arrow' keyword if desired.

Viewing all 100 articles
Browse latest View live