Quantcast
Channel: Working notes by Matthew Rocklin - SciPy
Viewing all 100 articles
Browse latest View live

PyData and the GIL

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr Many PyData projects release the GIL. Multi-core parallelism is alive and well.

Introduction

Machines grow more cores every year. My cheap laptop has four cores and a heavy workstation rivals a decent cluster without the hardware hassle. When I bring this up in conversation people often ask about the GIL and whether or not this poses a problem to the PyData ecosystem and shared-memory parallelism.

Q: Given the growth of shared-memory parallelism, should the PyData ecosystem be concerned about the GIL?

A: No, we should be very excited about this growth. We’re well poised to exploit it.

For those who aren’t familiar, the Global Interpreter Lock (GIL) is a CPython feature/bug that stops threads from manipulating Python objects in parallel. This cripples Pure Python shared-memory parallelism.

This sounds like a big deal but it doesn’t really affect the PyData stack (NumPy/Pandas/SciKits). Most PyData projects don’t spend much time in Python code. They spend 99% of their time in C/Fortran/Cython code. This code can often release the GIL. The following projects release the GIL at various stages:

  • NumPy
  • SciPy
  • Numba (if requested) (example docs)
  • SciKit Learn
  • Anything that mostly uses the above projects
  • if you add more in the comments then I will post them here

Our software stack has roots in scientific computing which has an amazing relationship with using all-of-the-hardware. I would like to see the development community lean in to the use of shared-memory parallelism. This feels like a large low-hanging fruit.

Quick Example with dask.array

As a quick example, we compute a large random dot product with dask.array and look at top. Dask.array computes large array operations by breaking arrays up in to many small NumPy arrays and then executing those array operations in multiple threads.

In[1]:importdask.arrayasdaIn[2]:x=da.random.random((10000,10000),blockshape=(1000,1000))In[3]:float(x.dot(x.T).sum())Out[3]:250026827523.19141

Full resource utilization with Python

Technical note: my BLAS is set to use one thread only, the parallelism in the above example is strictly due to multiple Python worker threads, and not due to parallelism in the underlying native code.

Note the 361.0% CPU utilization in the ipython process.

Because the PyData stack is fundamentally based on native compiled code, multiple Python threads can crunch data in parallel without worrying about the GIL. The GIL does not have to affect us in a significant way.

That’s not true, the GIL hurts Python in the following cases

Text

We don’t have a good C/Fortran/Cython solution for text. When given a pile-of-text-files we often switch from threads to processes and use the multiprocessing module. This limits inter-worker communication but this is rarely an issue for this kind of embarrassingly parallel work.

The multiprocessing workflow is fairly simple. I’ve written about this in the toolz docs and in a blogpost about dask.bag.

Pandas

Pandas does not yet release the GIL in computationally intensive code. It probably could though. This requires momentum from the community and some grunt-work by some of the Pandas devs. I have a small issue here and I think that Phil Cloud is looking into it.

PyData <3 Shared Memory Parallelism

If you’re looking for more speed in compute-bound applications then consider threading and heavy workstation machines. I personally find this approach to be more convenient than spinning up a cluster.


Towards Out-of-core DataFrames

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers. It is on experimental code that is not ready for users.

tl;dr Can we build dask.frame? One approach involves indexes and a lot of shuffling.

Dask arrays work

Over the last two months we’ve watched the creation of dask, a task scheduling specification, and dask.array a project to implement the out-of-core nd-arrays using blocked algorithms. (blogposts: 1, 2, 3, 4, 5, 6). This worked pretty well. Dask.array is available on the main conda channel and on PyPI and, for the most part, is a pleasant drop-in replacement for a subset of NumPy operations. I’m really happy with it.

conda install dask
or
pip install dask

There is still work to do, in particular I’d like to interact with people who have real-world problems, but for the most part dask.array feels ready.

On to dask frames

Can we do for Pandas what we’ve just done for NumPy?

Question: Can we represent a large DataFrame as a sequence of in-memory DataFrames and perform most Pandas operations using task scheduling?

Answer: I don’t know. Lets try.

Naive Approach

If represent a dask.array as an N-d grid of NumPy ndarrays, then maybe we should represent a dask.frame as a 1-d grid of Pandas DataFrames; they’re kind of like arrays.

dask.arrayNaive dask.frame

This approach supports the following operations:

  • Elementwise operations df.a + df.b
  • Row-wise filtering df[df.a > 0]
  • Reductions df.a.mean()
  • Some split-apply-combine operations that combine with a standard reduction like df.groupby('a').b.mean(). Essentially anything you can do with df.groupby(...).agg(...)

The reductions and split-apply-combine operations require some cleverness. This is how Blaze works now and how it does the does out-of-core operations in these notebooks: Blaze and CSVs, Blaze and Binary Storage.

However this approach does not support the following operations:

  • Joins
  • Split-apply-combine with more complex transform or apply combine steps
  • Sliding window or resampling operations
  • Anything involving multiple datasets

Partition on the Index values

Instead of partitioning based on the size of blocks we instead partition on value ranges of the index.

Partition on block sizePartition on index value

This opens up a few more operations

  • Joins are possible when both tables share the same index. Because we have information about index values we we know which blocks from one side need to communicate to which blocks from the other.
  • Split-apply-combine with transform/apply steps are possible when the grouper is the index. In this case we’re guaranteed that each group is in the same block. This opens up general df.gropuby(...).apply(...)
  • Rolling or resampling operations are easy on the index if we share a small amount of information between blocks as we do in dask.array for ghosting operations.

We note the following theme:

Complex operations are easy if the logic aligns with the index

And so a recipe for many complex operations becomes:

  1. Re-index your data along the proper column
  2. Perform easy computation

Re-indexing out-of-core data

To be explicit imagine we have a large time-series of transactions indexed by time and partitioned by day. The data for every day is in a separate DataFrame.

Block 1
-------
                     credit    name
time
2014-01-01 00:00:00     100     Bob
2014-01-01 01:00:00     200   Edith
2014-01-01 02:00:00    -300   Alice
2014-01-01 03:00:00     400     Bob
2014-01-01 04:00:00    -500  Dennis
...

Block 2
-------
                     credit    name
time
2014-01-02 00:00:00     300    Andy
2014-01-02 01:00:00     200   Edith
...

We want to reindex this data and shuffle all of the entries so that now we partiion on the name of the person. Perhaps all of the A’s are in one block while all of the B’s are in another.

Block 1
-------
                       time  credit
name
Alice   2014-04-30 00:00:00     400
Alice   2014-01-01 00:00:00     100
Andy    2014-11-12 00:00:00    -200
Andy    2014-01-18 00:00:00     400
Andy    2014-02-01 00:00:00    -800
...

Block 2
-------
                       time  credit
name
Bob     2014-02-11 00:00:00     300
Bob     2014-01-05 00:00:00     100
...

Re-indexing and shuffling large data is difficult and expensive. We need to find good values on which to partition our data so that we get regularly sized blocks that fit nicely into memory. We also need to shuffle entries from all of the original blocks to all of the new ones. In principle every old block has something to contribute to every new one.

We can’t just call DataFrame.sort because the entire data might not fit in memory and most of our sorting algorithms assume random access.

We do this in two steps

  1. Find good division values to partition our data. These should partition the data into blocks of roughly equal size.
  2. Shuffle our old blocks into new blocks along the new partitions found in step one.

Find divisions by external sorting

One approach to find new partition values is to pull out the new index from each block, perform an out-of-core sort, and then take regularly spaced values from that array.

  1. Pull out new index column from each block

    indexes = [block['new-column-index'] for block in blocks]
    
  2. Perform out-of-core sort on that column

    sorted_index = fancy_out_of_core_sort(indexes)
    
  3. Take values at regularly spaced intervals, e.g.

    partition_values = sorted_index[::1000000]
    

We implement this using parallel in-block sorts, followed by a streaming merge process using the heapq module. It works but is slow.

Possible Improvements

This could be accelerated through one of the following options:

  1. A streaming numeric solution that works directly on iterators of NumPy arrays (numtoolz anyone?)
  2. Not sorting at all. We only actually need approximate regularly spaced quantiles. A brief literature search hints that there might be some good solutions.

Shuffle

Now that we know the values on which we want to partition we ask each block to shard itself into appropriate pieces and shove all of those pieces into a spill-to-disk dictionary. Another process then picks up these pieces and calls pd.concat to merge them in to the new blocks.

For the out-of-core dict we’re currently using Chest. Turns out that serializing DataFrames and writing them to disk can be tricky. There are several good methods with about an order of magnitude performance difference between them.

This works but my implementation is slow

Here is an example with snippet of the NYCTaxi data (this is small)

In[1]:importdask.frameasdfrIn[2]:d=dfr.read_csv('/home/mrocklin/data/trip-small.csv',chunksize=10000)In[3]:d.head(3)# This is fastOut[3]:medallionhack_license  \
089D227B655E5C82AECF13C3F540D4CF4BA96DE419E711691B9445D6A6307C17010BD7C8F5BA12B88E0B67BED28BEA73D89FD8F69F0804BDB5549F40E9DA1BE47220BD7C8F5BA12B88E0B67BED28BEA73D89FD8F69F0804BDB5549F40E9DA1BE472vendor_idrate_codestore_and_fwd_flagpickup_datetime  \
0CMT1N2013-01-0115:11:481CMT1N2013-01-0600:18:352CMT1N2013-01-0518:49:41dropoff_datetimepassenger_counttrip_time_in_secstrip_distance  \
02013-01-0115:18:1043821.012013-01-0600:22:5412591.522013-01-0518:54:2312821.1pickup_longitudepickup_latitudedropoff_longitudedropoff_latitude0-73.97816540.757977-73.98983840.7511711-74.00668340.731781-73.99449940.7506602-74.00470740.737770-74.00983440.726002In[4]:d2=d.set_index(d.passenger_count,out_chunksize=10000)# This takes some timeIn[5]:d2.head(3)Out[5]:medallion  \
passenger_count03F3AC054811F8B1F095580C50FF1609014C52E48F9E05AA1A8E2F073BB932E9AA1FF00E5D4B15B6E896270DDB8E0697BF7hack_licensevendor_idrate_code  \
passenger_count0E00BD74D8ADB81183F9F5295DC619515VTS51307D1A2524E526EE08499973A4F832CFVTS110E8CCD187F56B3696422278EBB620EFAVTS1store_and_fwd_flagpickup_datetimedropoff_datetime  \
passenger_count0NaN2013-01-1303:25:002013-01-1303:42:001NaN2013-01-1316:12:002013-01-1316:23:001NaN2013-01-1315:05:002013-01-1315:15:00passenger_counttrip_time_in_secstrip_distance  \
passenger_count0010205.21116602.94116002.18pickup_longitudepickup_latitudedropoff_longitude  \
passenger_count0-73.98690040.743736-74.0297471-73.97675340.790123-73.9848021-73.98271940.767147-73.982170dropoff_latitudepassenger_count040.741348140.758518140.746170In[6]:d2.blockdivs# our new partition valuesOut[6]:(2,3,6)In[7]:d.blockdivs# our original partition valuesOut[7]:(10000,20000,30000,40000,50000,60000,70000,80000,90000)

Some Problems

  • First, we have to evaluate the dask as we go. Every set_index operation (and hence many groupbys and joins) forces an evaluation. We can no longer, as in the dask.array case, endlessly compound high-level operations to form more and more complex graphs and then only evaluate at the end. We need to evaluate as we go.

  • Sorting/shuffling is slow. This is for a few reasons including the serialization of DataFrames and sorting being hard.

  • How feasible is it to frequently re-index a large amount of data? When do we reach the stage of “just use a database”?

  • Pandas doesn’t yet release the GIL, so this is all single-core. See post on PyData and the GIL.

  • My current solution lacks basic functionality. I’ve skipped the easy things to first ensure sure that the hard stuff is doable.

Help!

I know less about tables than about arrays. I’m ignorant of the literature and common solutions in this field. If anything here looks suspicious then please speak up. I could really use your help.

Additionally the Pandas API is much more complex than NumPy’s. If any experienced devs out there feel like jumping in and implementing fairly straightforward Pandas features in a blocked way I’d be obliged.

Efficiently Store Pandas DataFrames

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric data but text is a pain. Categorical dtypes are a good option.

Introduction

For dask.frame I need to read and write Pandas DataFrames to disk. Both disk bandwidth and serialization speed limit storage performance.

  • Disk bandwidth, between 100MB/s and 800MB/s for a notebook hard drive, is limited purely by hardware. Not much we can do here except buy better drives.
  • Serialization cost though varies widely by library and context. We can be smart here. Serialization is the conversion of a Python variable (e.g. DataFrame) to a stream of bytes that can be written raw to disk.

Typically we use libraries like pickle to serialize Python objects. For dask.frame we really care about doing this quickly so we’re going to also look at a few alternatives.

Contenders

  • pickle - The standard library pure Python solution
  • cPickle - The standard library C solution
  • pickle.dumps(data, protocol=2) - pickle and cPickle support multiple protocols. Protocol 2 is good for numeric data.
  • json - using the standardlib json library, we encode the values and index as lists of ints/strings
  • json-no-index - Same as above except that we don’t encode the index of the DataFrame, e.g. 0, 1, ... We’ll find that JSON does surprisingly well on pure text data.
  • msgpack - A binary JSON alternative
  • CSV - The venerable pandas.read_csv and DataFrame.to_csv
  • hdfstore - Pandas’ custom HDF5 storage format

Additionally we mention but don’t include the following:

  • dill and cloudpickle- formats commonly used for function serialization. These perform about the same as cPickle
  • hickle - A pickle interface over HDF5. This does well on NumPy data but doesn’t support Pandas DataFrames well.

Experiment

Disclaimer: We’re about to issue performance numbers on a toy dataset. You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself. My benchmarks lie.

We create a DataFrame with two columns, one with numeric data, and one with text. The text column has repeated values (1000 unique values, each repeated 1000 times) while the numeric column is all unique. This is fairly typical of data that I see in the wild.

df=pd.DataFrame({'text':[str(i%1000)foriinrange(1000000)],'numbers':range(1000000)})

Now we time the various dumps and loads methods of the different serialization libraries and plot the results below.

Time costs to serialize numeric data

As a point of reference writing the serialized result to disk and reading it back again should take somewhere between 0.05s and 0.5s on standard hard drives. We want to keep serialization costs below this threshold.

Thank you to Michael Waskom for making those charts (see twitter conversation and his alternative charts)

Gist to recreate plots here: https://gist.github.com/mrocklin/4f6d06a2ccc03731dd5f

Further Disclaimer: These numbers average from multiple repeated calls to loads/dumps. Actual performance in the wild is likely worse.

Observations

We have good options for numeric data but not for text. This is unfortunate; serializing ASCII text should be cheap. We lose here because we store text in a Series with the NumPy dtype ‘O’ for generic Python objects. We don’t have a dedicated variable length string dtype. This is tragic.

For numeric data the successful systems systems record a small amount of metadata and then dump the raw bytes. The main takeaway from this is that you should use the protocol=2 keyword argument to pickle. This option isn’t well known but strongly impacts preformance.

Note: Aaron Meurer notes in the comments that for Python 3 users protocol=3 is already default. Python 3 users can trust the default protocol= setting to be efficient and should not specify protocol=2.

Time costs to serialize numeric data

Some thoughts on text

  1. Text should be easy to serialize. It’s already text!

  2. JSON-no-index serializes the text values of the dataframe (not the integer index) as a list of strings. This assumes that the data are strings which is why it’s able to outperform the others, even though it’s not an optimized format. This is what we would gain if we had a string dtype rather than relying on the NumPy Object dtype, 'O'.

  3. MsgPack is surpsingly fast compared to cPickle

  4. MsgPack is oddly unbalanced, it can dump text data very quickly but takes a while to load it back in. Can we improve msgpack load speeds?

  5. CSV text loads are fast. Hooray for pandas.read_csv.

Some thoughts on numeric data

  1. Both pickle(..., protocol=2) and msgpack dump raw bytes. These are well below disk I/O speeds. Hooray!

  2. There isn’t much reason to compare performance below this level.

Categoricals to the Rescue

Pandas recently added support for categorical data. We use categorical data when our values take on a fixed number of possible options with potentially many repeats (like stock ticker symbols.) We enumerate these possible options (AAPL: 1, GOOG: 2, MSFT: 3, ...) and use those numbers in place of the text. This works well when there are many more observations/rows than there are unique values. Recall that in our case we have one million rows but only one thousand unique values. This is typical for many kinds of data.

This is great! We’ve shrunk the amount of text data by a factor of a thousand, replacing it with cheap-to-serialize numeric data.

>>>df['text']=df['text'].astype('category')>>>df.text00112233...999997997999998998999999999Name:text,Length:1000000,dtype:categoryCategories(1000,object):[0<1<10<100...996<997<998<999]

Lets consider the costs of doing this conversion and of serializing it afterwards relative to the costs of just serializing it.

seconds
Serialize Original Text 1.042523
Convert to Categories 0.072093
Serialize Categorical Data 0.028223

When our data is amenable to categories then it’s cheaper to convert-then-serialize than it is to serialize the raw text. Repeated serializations are just pure-win. Categorical data is good for other reasons too; computations on object dtype in Pandas generally happen at Python speeds. If you care about performance then categoricals are definitely something to roll in to your workflow.

Final Thoughts

  1. Several excellent serialization options exist, each with different strengths.
  2. A combination of good serialization support for numeric data and Pandas categorical dtypes enable efficient serialization and storage of DataFrames.
  3. Object dtype is bad for PyData. String dtypes would be nice. I’d like to shout out to DyND a possible NumPy replacement that would resolve this.
  4. MsgPack provides surprisingly good performance over custom Python solutions, why is that?
  5. I suspect that we could improve performance by special casing Object dtypes and assuming that they contain only text.

Partition and Shuffle

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers.

tl;dr We partition out-of-core dataframes efficiently.

Partition Data

Many efficient parallel algorithms require intelligently partitioned data.

For time-series data we might partition into month-long blocks. For text-indexed data we might have all of the “A”s in one group and all of the “B”s in another. These divisions let us arrange work with foresight.

To extend Pandas operations to larger-than-memory data efficient partition algorithms are critical. This is tricky when data doesn’t fit in memory.

Partitioning is fundamentally hard

Data locality is the root of all performance
    -- A Good Programmer

Partitioning/shuffling is inherently non-local. Every block of input data needs to separate and send bits to every block of output data. If we have a thousand partitions then that’s a million little partition shards to communicate. Ouch.

Shuffling data between partitions

Consider the following setup

  100GB dataset
/ 100MB partitions
= 1,000 input partitions

To partition we need shuffle data in the input partitions to a similar number of output partitions

  1,000 input partitions
* 1,000 output partitions
= 1,000,000 partition shards

If our communication/storage of those shards has even a millisecond of latency then we run into problems.

  1,000,000 partition shards
x 1ms
= 18 minutes

Previously I stored the partition-shards individually on the filesystem using cPickle. This was a mistake. It was very slow because it treated each of the million shards independently. Now we aggregate shards headed for the same out-block and write out many at a time, bundling overhead. We balance this against memory constraints. This stresses both Python latencies and memory use.

BColz, now for very small data

Fortunately we have a nice on-disk chunked array container that supports append in Cython. BColz (formerly BLZ, formerly CArray) does this for us. It wasn’t originally designed for this use case but performs admirably.

Briefly, BColz is…

  • A binary store (like HDF5)
  • With columnar access (useful for tabular computations)
  • That stores data in cache-friendly sized blocks
  • With a focus on compression
  • Written mostly by Francesc Alted (PyTables) and Valentin Haenel

It includes two main objects:

  • carray: An on-disk numpy array
  • ctable: A named collection of carrays to represent a table/dataframe

Partitioned Frame

We use carray to make a new data structure pframe with the following operations:

  • Append DataFrame to collection, and partition it along the index on known block divisions blockdivs
  • Extract DataFrame corresponding to a particular partition

Internally we invent two new data structures:

  • cframe: Like ctable this stores column information in a collection of carrays. Unlike ctable this maps perfectly onto the custom block structure used internally by Pandas. For internal use only.
  • pframe: A collection of cframes, one for each partition.

Partitioned Frame design

Through bcolz.carray, cframe manages efficient incremental storage to disk. PFrame partitions incoming data and feeds it to the appropriate cframe.

Example

Create test dataset

In[1]:importpandasaspdIn[2]:df=pd.DataFrame({'a':[1,2,3,4],...'b':[1.,2.,3.,4.]},...index=[1,4,10,20])

Create pframe like our test dataset, partitioning on divisions 5, 15. Append the single test dataframe.

In[3]:frompframeimportpframeIn[4]:pf=pframe(like=df,blockdivs=[5,15])In[5]:pf.append(df)

Pull out partitions

In[6]:pf.get_partition(0)Out[6]:ab111422In[7]:pf.get_partition(1)Out[7]:ab1033In[8]:pf.get_partition(2)Out[8]:ab2044

Continue to append data…

In[9]:df2=pd.DataFrame({'a':[10,20,30,40],...'b':[10.,20.,30.,40.]},...index=[1,4,10,20])In[10]:pf.append(df2)

… and partitions grow accordingly.

In[12]:pf.get_partition(0)Out[12]:ab1114221101042020

We can continue this until our disk fills up. This runs near peak I/O speeds (on my low-power laptop with admittedly poor I/O.)

Performance

I’ve partitioned the NYCTaxi trip dataset a lot this week and posting my results to the Continuum chat with messages like the following

I think I've got it to work, though it took all night and my hard drive filled up.
Down to six hours and it actually works.
Three hours!
By removing object dtypes we're down to 30 minutes
20!  This is actually usable.
OK, I've got this to six minutes.  Thank goodness for Pandas categoricals.
Five.
Down to about three and a half with multithreading, but only if we stop blosc from segfaulting.

And thats where I am now. It’s been a fun week. Here is a tiny benchmark.

>>>importpandasaspd>>>importnumpyasnp>>>frompframeimportpframe>>>df=pd.DataFrame({'a':np.random.random(1000000),'b':np.random.poisson(100,size=1000000),'c':np.random.random(1000000),'d':np.random.random(1000000).astype('f4')}).set_index('a')

Set up a pframe to match the structure of this DataFrame Partition index into divisions of size 0.1

>>>pf=pframe(like=df,...blockdivs=[.1,.2,.3,.4,.5,.6,.7,.8,.9],...chunklen=2**15)

Dump the random data into the Partition Frame one hundred times and compute effective bandwidths.

>>>foriinrange(100):...pf.append(df)CPUtimes:user39.4s,sys:3.01s,total:42.4sWalltime:40.6s>>>pf.nbytes2800000000>>>pf.nbytes/40.6/1e6# MB/s68.9655172413793>>>pf.cbytes/40.6/1e6# Actual compressed bytes on disk41.5172952955665

We partition and store on disk random-ish data at 68MB/s (cheating with compression). This is on my old small notebook computer with a weak processor and hard drive I/O bandwidth at around 100 MB/s.

Theoretical Comparison to External Sort

There isn’t much literature to back up my approach. That concerns me. There is a lot of literature however on external sorting and they often site our partitioning problem as a use case. Perhaps we should do an external sort?

I thought I’d quickly give some reasons why I think the current approach is theoretically better than an out-of-core sort; hopefully someone smarter can come by and tell me why I’m wrong.

We don’t need a full sort, we need something far weaker. External sort requires at least two passes over the data while the method above requires one full pass through the data as well as one additional pass through the index column to determine good block divisions. These divisions should be of approximately equal size. The approximate size can be pretty rough. I don’t think we would notice a variation of a factor of five in block sizes. Task scheduling lets us be pretty sloppy with load imbalance as long as we have many tasks.

I haven’t implemented a good external sort though so I’m only able to argue theory here. I’m likely missing important implementation details.

  • PFrame code lives in a dask branch at the moment. It depends on a couple of BColz PRs (#163, #164)

Profiling Data Throughput

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Disclaimer: This post is on experimental/buggy code.

tl;dr We measure the costs of processing semi-structured data like JSON blobs.

Semi-structured Data

Semi-structured data is ubiquitous and computationally painful. Consider the following JSON blobs:

{'name': 'Alice',   'payments': [1, 2, 3]}
{'name': 'Bob',     'payments': [4, 5]}
{'name': 'Charlie', 'payments': None}

This data doesn’t fit nicely into NumPy or Pandas and so we fall back to dynamic pure-Python data structures like dicts and lists. Python’s core data structures are surprisingly good, about as good as compiled languages like Java, but dynamic data structures present some challenges for efficient parallel computation.

Volume

Semi-structured data is often at the beginning of our data pipeline and so often has the greatest size. We may start with 100GB of raw data, reduce to 10GB to load into a database, and finally aggregate down to 1GB for analysis, machine learning, etc., 1kB of which becomes a plot or table.

Data Bandwidth (MB/s)In Parallel (MB/s)
Disk I/O500500
Decompression100500
Deserialization50250
In-memory computation2000oo
Shuffle930

Common solutions for large semi-structured data include Python iterators, multiprocessing, Hadoop, and Spark as well as proper databases like MongoDB and ElasticSearch. Two months ago we built dask.bag, a toy dask experiment for semi-structured data. Today we’ll strengthen the dask.bag project and look more deeply at performance in this space.

We measure performance with data bandwidth, usually in megabytes per second (MB/s). We’ll build intuition for why dealing with this data is costly.

Dataset

As a test dataset we play with a dump of GitHub data from https://www.githubarchive.org/. This data records every public github event (commit, comment, pull request, etc.) in the form of a JSON blob. This data is representative fairly representative of a broader class of problems. Often people want to do fairly simple analytics, like find the top ten committers to a particular repository, or clean the data before they load it into a database.

We’ll play around with this data using dask.bag. This is both to get a feel for what is expensive and to provide a cohesive set of examples. In truth we won’t do any real analytics on the github dataset, we’ll find that the expensive parts come well before analytic computation.

Items in our data look like this:

>>>importjson>>>importdask.bagasdb>>>path='/home/mrocklin/data/github/2013-05-0*.json.gz'>>>db.from_filenames(path).map(json.loads).take(1)({u'actor':u'mjcramer',u'actor_attributes':{u'gravatar_id':u'603762b7a39807503a2ee7fe4966acd1',u'login':u'mjcramer',u'type':u'User'},u'created_at':u'2013-05-01T00:01:28-07:00',u'payload':{u'description':u'',u'master_branch':u'master',u'ref':None,u'ref_type':u'repository'},u'public':True,u'repository':{u'created_at':u'2013-05-01T00:01:28-07:00',u'description':u'',u'fork':False,u'forks':0,u'has_downloads':True,u'has_issues':True,u'has_wiki':True,u'id':9787210,u'master_branch':u'master',u'name':u'settings',u'open_issues':0,u'owner':u'mjcramer',u'private':False,u'pushed_at':u'2013-05-01T00:01:28-07:00',u'size':0,u'stargazers':0,u'url':u'https://github.com/mjcramer/settings',u'watchers':0},u'type':u'CreateEvent',u'url':u'https://github.com/mjcramer/settings'},)

Disk I/O and Decompression – 100-500 MB/s

Data Bandwidth (MB/s)
Read from disk with open500
Read from disk with gzip.open100
Parallel Read from disk with gzip.open500

A modern laptop hard drive can theoretically read data from disk to memory at 800 MB/s. So we could burn through a 10GB dataset in fifteen seconds on our laptop. Workstations with RAID arrays can do a couple GB/s. In practice I get around 500 MB/s on my personal laptop.

In[1]:importjsonIn[2]:importdask.bagasdbIn[3]:fromglobimportglobIn[4]:path='/home/mrocklin/data/github/2013-05-0*.json.gz'In[5]:%timecompressed='\n'.join(open(fn).read()forfninglob(path))CPUtimes:user75.1ms,sys:1.07s,total:1.14sWalltime:1.14sIn[6]:len(compressed)/0.194/1e6# MB/s508.5912175438597

To reduce storage and transfer costs we often compress data. This requires CPU effort whenever we want to operate on the stored values. This can limit data bandwidth.

In[7]:importgzipIn[8]:%timetotal='\n'.join(gzip.open(fn).read()forfninglob(path))CPUtimes:user12.2s,sys:18.7s,total:30.9sWalltime:30.9sIn[9]:len(total)/30.9/1e6# MB/s  total bandwidthOut[9]:102.16563844660195In[10]:len(compressed)/30.9/1e6# MB/s  compressed bandwidthOut[10]:18.763559482200648

So we lose some data bandwidth through compression. Where we could previously process 500 MB/s we’re now down to only 100 MB/s. If we count bytes in terms of the amount stored on disk then we’re only hitting 18 MB/s. We’ll get around this with multiprocessing.

Decompression and Parallel processing – 500 MB/s

Fortunately we often have more cores than we know what to do with. Parallelizing reads can hide much of the decompression cost.

In[12]:importdask.bagasdbIn[13]:%timenbytes=db.from_filenames(path).map(len).sum().compute()CPUtimes:user130ms,sys:402ms,total:532msWalltime:5.5sIn[14]:nbytes/5.5/1e6Out[14]:573.9850932727272

Dask.bag infers that we need to use gzip from the filename. Dask.bag currently uses multiprocessing to distribute work, allowing us to reclaim our 500 MB/s throughput on compressed data. We also could have done this with multiprocessing, straight Python, and a little elbow-grease.

Deserialization – 30 MB/s

Data Bandwidth (MB/s)
json.loads30
ujson.loads50
Parallel ujson.loads150

Once we decompress our data we still need to turn bytes into meaningful data structures (dicts, lists, etc..) Our GitHub data comes to us as JSON. This JSON contains various encodings and bad characters so, just for today, we’re going to punt on bad lines. Converting JSON text to Python objects explodes out in memory a bit, so we’ll consider a smaller subset for this part, a single day.

In[20]:defloads(line):...try:returnjson.loads(line)...except:returnNoneIn[21]:path='/home/mrocklin/data/github/2013-05-01-*.json.gz'In[22]:lines=list(db.from_filenames(path))In[23]:%timeblobs=list(map(loads,lines))CPUtimes:user10.7s,sys:760ms,total:11.5sWalltime:11.3sIn[24]:len(total)/11.3/1e6Out[24]:33.9486321238938In[25]:len(compressed)/11.3/1e6Out[25]:6.2989179646017694

So in terms of actual bytes of JSON we can only convert about 30MB per second. If we count in terms of the compressed data we store on disk then this looks more bleak at only 6 MB/s.

This can be improved by using faster libraries – 50 MB/s

The ultrajson library, ujson, is pretty slick and can improve our performance a bit. This is what Pandas uses under the hood.

In[28]:importujsonIn[29]:defloads(line):...try:returnujson.loads(line)...except:returnNoneIn[30]:%timeblobs=list(map(loads,lines))CPUtimes:user6.37s,sys:1.17s,total:7.53sWalltime:7.37sIn[31]:len(total)/7.37/1e6Out[31]:52.05149837177748In[32]:len(compressed)/7.37/1e6Out[32]:9.657771099050203

Or through Parallelism – 150 MB/s

This can also be accelerated through parallelism, just like decompression. It’s a bit cumbersome to show parallel deserializaiton in isolation. Instead we’ll show all of them together. This will under-estimate performance but is much easier to code up.

In[33]:%timedb.from_filenames(path).map(loads).count().compute()CPUtimes:user32.3ms,sys:822ms,total:854msWalltime:2.8sIn[38]:len(total)/2.8/1e6Out[38]:137.00697964285717In[39]:len(compressed)/2.8/1e6Out[39]:25.420633214285715

Mapping and Grouping - 2000 MB/s

Data Bandwidth (MB/s)
Simple Python operations1400
Complex CyToolz operations2600

Once we have data in memory, Pure Python is relatively fast. Cytoolz moreso.

In[55]:%timeset(d['type']fordinblobs)CPUtimes:user162ms,sys:123ms,total:285msWalltime:268msOut[55]:{u'CommitCommentEvent',u'CreateEvent',u'DeleteEvent',u'DownloadEvent',u'FollowEvent',u'ForkEvent',u'GistEvent',u'GollumEvent',u'IssueCommentEvent',u'IssuesEvent',u'MemberEvent',u'PublicEvent',u'PullRequestEvent',u'PullRequestReviewCommentEvent',u'PushEvent',u'WatchEvent'}In[56]:len(total)/0.268/1e6Out[56]:1431.4162052238805In[57]:importcytoolzIn[58]:%time_=cytoolz.groupby('type',blobs)# CyToolz FTWCPUtimes:user144ms,sys:0ns,total:144msWalltime:144msIn[59]:len(total)/0.144/1e6Out[59]:2664.024604166667

So slicing and logic are essentially free. The cost of compression and deserialization dominates actual computation time. Don’t bother optimizing fast per-record code, especially if CyToolz has already done so for you. Of course, you might be doing something expensive per record. If so then most of this post isn’t relevant for you.

Shuffling - 5-50 MB/s

Data Bandwidth (MB/s)
Naive groupby with on-disk Shuffle25
Clever foldby without Shuffle250

For complex logic, like full groupbys and joins, we need to communicate large amounts of data between workers. This communication forces us to go through another full serialization/write/deserialization/read cycle. This hurts. And so, the single most important message from this post:

Avoid communication-heavy operations on semi-structured data. Structure your data and load into a database instead.

That being said, people will inevitably ignore this advice so we need to have a not-terrible fallback.

In[62]:%timedict(db.from_filenames(path)....map(loads)....groupby('type')....map(lambda(k,v):(k,len(v))))CPUtimes:user46.3s,sys:6.57s,total:52.8sWalltime:2min14sOut[62]:{'CommitCommentEvent':17889,'CreateEvent':210516,'DeleteEvent':14534,'DownloadEvent':440,'FollowEvent':35910,'ForkEvent':67939,'GistEvent':7344,'GollumEvent':31688,'IssueCommentEvent':163798,'IssuesEvent':102680,'MemberEvent':11664,'PublicEvent':1867,'PullRequestEvent':69080,'PullRequestReviewCommentEvent':17056,'PushEvent':960137,'WatchEvent':173631}In[63]:len(total)/134/1e6# MB/sOut[63]:23.559091

This groupby operation goes through the following steps:

  1. Read from disk
  2. Decompress GZip
  3. Deserialize with ujson
  4. Do in-memory groupbys on chunks of the data
  5. Reserialize with msgpack (a bit faster)
  6. Append group parts to disk
  7. Read in new full groups from disk
  8. Deserialize msgpack back to Python objects
  9. Apply length function per group

Some of these steps have great data bandwidths, some less-so. When we compound many steps together our bandwidth suffers. We get about 25 MB/s total. This is about what pyspark gets (although today pyspark can parallelize across multiple machines while dask.bag can not.)

Disclaimer, the numbers above are for dask.bag and could very easily be due to implementation flaws, rather than due to inherent challenges.

>>>importpyspark>>>sc=pyspark.SparkContext('local[8]')>>>rdd=sc.textFile(path)>>>dict(rdd.map(loads)....keyBy(lambdad:d['type'])....groupByKey()....map(lambda(k,v):(k,len(v)))....collect())

I would be interested in hearing from people who use full groupby on BigData. I’m quite curious to hear how this is used in practice and how it performs.

Creative Groupbys - 250 MB/s

Don’t use groupby. You can often work around it with cleverness. Our example above can be handled with streaming grouping reductions (see toolz docs.) This requires more thinking from the programmer but avoids the costly shuffle process.

In[66]:%timedict(db.from_filenames(path)....map(loads)....foldby('type',lambdatotal,d:total+1,0,lambdaa,b:a+b))Out[66]:{'CommitCommentEvent':17889,'CreateEvent':210516,'DeleteEvent':14534,'DownloadEvent':440,'FollowEvent':35910,'ForkEvent':67939,'GistEvent':7344,'GollumEvent':31688,'IssueCommentEvent':163798,'IssuesEvent':102680,'MemberEvent':11664,'PublicEvent':1867,'PullRequestEvent':69080,'PullRequestReviewCommentEvent':17056,'PushEvent':960137,'WatchEvent':173631}CPUtimes:user322ms,sys:604ms,total:926msWalltime:13.2sIn[67]:len(total)/13.2/1e6# MB/sOut[67]:239.16047181818183

We can also spell this with PySpark which performs about the same.

>>>dict(rdd.map(loads)# PySpark equivalent....keyBy(lambdad:d['type'])....combineByKey(lambdad:1,lambdatotal,d:total+1,lambdaa,b:a+b)....collect())

Use a Database

By the time you’re grouping or joining datasets you probably have structured data that could fit into a dataframe or database. You should transition from dynamic data structures (dicts/lists) to dataframes or databases as early as possible. DataFrames and databases compactly represent data in formats that don’t require serialization; this improves performance. Databases are also very clever about reducing communication.

Tools like pyspark, toolz, and dask.bag are great for initial cleanings of semi-structured data into a structured format but they’re relatively inefficient at complex analytics. For inconveniently large data you should consider a database as soon as possible. That could be some big-data-solution or often just Postgres.

Better data structures for semi-structured data?

Dynamic data structures (dicts, lists) are overkill for semi-structured data. We don’t need or use their full power but we inherit all of their limitations (e.g. serialization costs.) Could we build something NumPy/Pandas-like that could handle the blob-of-JSON use-case? Probably.

DyND is one such project. DyND is a C++ project with Python bindings written by Mark Wiebe and Irwin Zaid and historically funded largely by Continuum and XData under the same banner as Blaze/Dask. It could probably handle the semi-structured data problem case if given a bit of love. It handles variable length arrays, text data, and missing values all with numpy-like semantics:

>>>fromdyndimportnd>>>data=[{'name':'Alice',# Semi-structured data...'location':{'city':'LA','state':'CA'},...'credits':[1,2,3]},...{'name':'Bob',...'credits':[4,5],...'location':{'city':'NYC','state':'NY'}}]>>>dtype='''var * {name: string,...                   location: {city: string,...                              state: string[2]},...                   credits: var * int}'''# Shape of our data>>>x=nd.array(data,type=dtype)# Create DyND array>>>x# Store compactly in memorynd.array([["Alice",["LA","CA"],[1,2,3]],["Bob",["NYC","NY"],[4,5]]])>>>x.location.city# Nested indexingnd.array(["LA","NYC"],type="strided * string")>>>x.credits# Variable length datand.array([[1,2,3],[4,5]],type="strided * var * int32")>>>x.credits*10# And computationnd.array([[10,20,30],[40,50]],type="strided * var * int32")

Sadly DyND has functionality gaps which limit usability.

>>>-x.credits# Sadly incomplete :(TypeError:badoperandtypeforunary-

I would like to see DyND mature to the point where it could robustly handle semi-structured data. I think that this would be a big win for productivity that would make projects like dask.bag and pyspark obsolete for a large class of use-cases. If you know Python, C++, and would like to help DyND grow I’m sure that Mark and Irwin would love the help

Comparison with PySpark

Dask.bag pros:

  1. Doesn’t engage the JVM, no heap errors or fiddly flags to set
  2. Conda/pip installable. You could have it in less than twenty seconds from now.
  3. Slightly faster in-memory implementations thanks to cytoolz; this isn’t important though
  4. Good handling of lazy results per-partition
  5. Faster / lighter weight start-up times
  6. (Subjective) I find the API marginally cleaner

PySpark pros:

  1. Supports distributed computation (this is obviously huge)
  2. More mature, more filled out API
  3. HDFS integration

Dask.bag reinvents a wheel; why bother?

  1. Given the machinery inherited from dask.array and toolz, dask.bag is very cheap to build and maintain. It’s around 500 significant lines of code.
  2. PySpark throws Python processes inside a JVM ecosystem which can cause some confusion among users and a performance hit. A task scheduling system in the native code ecosystem would be valuable.
  3. Comparison and competition is healthy
  4. I’ve been asked to make a distributed array. I suspect that distributed bag is a good first step.

State of Dask

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We lay out the pieces of Dask, a system for parallel computing

Introduction

Dask started five months ago as a parallel on-disk array; it has since broadened out. I’ve enjoyed writing about its development tremendously. With the recent 0.5.0 release I decided to take a moment to give an overview of dask’s various pieces, their state, and current development.

Collections, graphs, and schedulers

Dask modules can be separated as follows:

Partitioned Frame design

On the left there are collections like arrays, bags, and dataframes. These copy APIs for NumPy, PyToolz, and Pandas respectively and are aimed towards data science users, allowing them to interact with larger datasets. Operations on these dask collections produce task graphs which are recipes to compute the desired result using many smaller computations that each fit in memory. For example if we want to sum a trillion numbers then we might break the numbers into million element chunks, sum those, and then sum the sums. A previously impossible task becomes a million and one easy ones.

On the right there are schedulers. Schedulers execute task graphs in different situations, usually in parallel. Notably there are a few schedulers for a single machine, and a new prototype for a distributed scheduler.

In the center is the directed acyclic graph. This graph serves as glue between collections and schedulers. The dask graph format is simple and doesn’t include any dask classes; it’s just functions, dicts, and tuples and so is easy to build on and low-tech enough to understand immediately. This separation is very useful to dask during development; improvements to one side immediately affect the other and new developers have had surprisingly little trouble. Also developers from a variety of backgrounds have been able to come up to speed in about an hour.

This separation is useful to other projects too. Directed acyclic graphs are popular today in many domains. By exposing dask’s schedulers publicly, other projects can bypass dask collections and go straight for the execution engine.

A flattering quote from a github issue:

dask has been very helpful so far, as it allowed me to skip implementing all of the usual graph operations. Especially doing the asynchronous execution properly would have been a lot of work.

Who uses dask?

Dask developers work closely with a few really amazing users:

  1. Stephan Hoyer at Climate Corp has integrated dask.array into xray a library to manage large volumes of meteorlogical data (and other labeled arrays.)

  2. Scikit image now includes an apply_parallel operation (github PR) that uses dask.array to parallelize image processing routines. (work by Blake Griffith)

  3. Mariano Tepper a postdoc at Duke, uses dask in his research on matrix factorizations. Mariano is also the primary author of the dask.array.linalg module, which includes efficient and stable QR and SVD for tall and skinny matrices. See Mariano’s paper on arXiv.

  4. Finally I personally use dask on daily work related to the XData project. This tends to drive some of the newer features.

A few other groups pop up on github from time to time; I’d love to know more detail about how people use dask.

What works and what doesn’t

Dask is modular. Each of the collections and each of the schedulers are effectively separate projects. These subprojects are at different states of development. Knowing the stability of each subproject can help you to determine how you use and depend on dask.

Dask.array and dask.threaded work well, are stable, and see constant use. They receive relatively minor bug reports which are dealt with swiftly.

Dask.bag and dask.multiprocessing undergo more API churn but are mostly ready for public use with a couple of caveats. Neither dask.dataframe nor

dask.distributed are ready for public use; they undergo significant API churn and have known errors.

Current work

The current state of development as I see it is as follows:

  1. Dask.bag and dask.dataframe are progressing nicely. My personal work depends on these modules, so they see a lot of attention.
    • At the moment I focus on grouping and join operations through fast shuffles; I hope to write about this problem soon.
    • The Pandas API is large and complex. Reimplementing a subset of it in a blocked way is straightforward but also detailed and time consuming. This would be a great place for community contributions.
  2. Dask.distributed is new. It needs it tires kicked but it’s an exciting development.
    • For deployment we’re planning to bootstrap off of IPython parallel which already has decent coverage of many parallel job systems, (see #208 by Blake)
  3. Dask.array development these days focuses on outreach. We’ve found application domains where dask is very useful; we’d like to find more.
  4. The collections (Array, Bag, DataFrame) don’t cover all cases. I would like to start finding uses for the task schedulers in isolation. They serve as a release valve in complex situations.

More information

You can install dask with conda

conda install dask

or with pip

pip install dask
or
pip install dask[array]
or
pip install dask[bag]

You can read more about dask at the docs or github.

Pandas Categoricals

$
0
0

tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categories

Disclaimer: Categoricals were created by the Pandas development team and not by me.

There is More to Speed Than Parallelism

I usually write about parallelism. As a result people ask me how to parallelize their slow computations. The answer is usually just use pandas in a better way

  • Q: How do I make my pandas code faster with parallelism?
  • A: You don’t need parallelism, you can use Pandas better

This is almost always simpler and more effective than using multiple cores or multiple machines. You should look towards parallelism only after you’ve made sane choices about storage format, compression, data representation, etc..

Today we’ll talk about how Pandas can represent categorical text data numerically. This is a cheap and underused trick to get an order of magnitude speedup on common queries.

Categoricals

Often our data includes text columns with many repeated elements. Examples:

  • Stock symbols – GOOG, APPL, MSFT, ...
  • Gender – Female, Male, ...
  • Experiment outcomes – Healthy, Sick, No Change, ...
  • States – California, Texas, New York, ...

We usually represent these as text. Pandas represents text with the object dtype which holds a normal Python string. This is a common culprit for slow code because object dtypes run at Python speeds, not at Pandas’ normal C speeds.

Pandas categoricals are a new and powerful feature that encodes categorical data numerically so that we can leverage Pandas’ fast C code on this kind of text data.

>>># Example dataframe with names, balances, and genders as object dtypes>>>df=pd.DataFrame({'name':['Alice','Bob','Charlie','Danielle'],...'balance':[100.0,200.0,300.0,400.0],...'gender':['Female','Male','Male','Female']},...columns=['name','balance','gender'])>>>df.dtypes# Oh no!  Slow object dtypes!nameobjectbalancefloat64genderobjectdtype:object

We can represent columns with many repeats, like gender, more efficiently by using categoricals. This stores our original data in two pieces

  • Original data

     Female, Male, Male, Female
    
  1. Index mapping each category to an integer

    Female: 0
    Male: 1
    ...
    
  2. Normal array of integers

    0, 1, 1, 0
    

This integer array is more compact and is now a normal C array. This allows for normal C-speeds on previously slow object dtype columns. Categorizing a column is easy:

In[5]:df['gender']=df['gender'].astype('category')# Categorize!

Lets look at the result

In[6]:df# DataFrame looks the sameOut[6]:namebalancegender0Alice100Female1Bob200Male2Charlie300Male3Danielle400FemaleIn[7]:df.dtypes# But dtypes have changedOut[7]:nameobjectbalancefloat64gendercategorydtype:objectIn[8]:df.gender# Note Categories at the bottomOut[8]:0Female1Male2Male3FemaleName:gender,dtype:categoryCategories(2,object):[Female,Male]In[9]:df.gender.cat.categories# Category indexOut[9]:Index([u'Female',u'Male'],dtype='object')In[10]:df.gender.cat.codes# Numerical valuesOut[10]:00112130dtype:int8# Stored in single bytes!

Notice that we can store our genders much more compactly as single bytes. We can continue to add genders (there are more than just two) and Pandas will use new values (2, 3, …) as necessary.

Our dataframe looks and feels just like it did before. Pandas internals will smooth out the user experience so that you don’t notice that you’re actually using a compact array of integers.

Performance

Lets look at a slightly larger example to see the performance difference.

We take a small subset of the NYC Taxi dataset and group by medallion ID to find the taxi drivers who drove the longest distance during a certain period.

In[1]:importpandasaspdIn[2]:df=pd.read_csv('trip_data_1_00.csv')In[3]:%timedf.groupby(df.medallion).trip_distance.sum().sort(ascending=False,inplace=False).head()CPUtimes:user161ms,sys:0ns,total:161msWalltime:175msOut[3]:medallion1E76B5DCA3A19D03B0FB39BCF2A2F534870.836945300E90C69061B463CCDA370DE5D6832.914F4BEA1914E323156BE0B24EF8205B73811.99191115180C29B1E2AF8BE0FD0ABD138F787.33B83044D63E9421B76011917CE280C137782.78Name:trip_distance,dtype:float64

That took around 170ms. We categorize in about the same time.

In[4]:%timedf['medallion']=df['medallion'].astype('category')CPUtimes:user168ms,sys:12.1ms,total:180msWalltime:197ms

Now that we have numerical categories our computaion runs 20ms, improving by about an order of magnitude.

In[5]:%timedf.groupby(df.medallion).trip_distance.sum().sort(ascending=False,inplace=False).head()CPUtimes:user16.4ms,sys:3.89ms,total:20.3msWalltime:20.3msOut[5]:medallion1E76B5DCA3A19D03B0FB39BCF2A2F534870.836945300E90C69061B463CCDA370DE5D6832.914F4BEA1914E323156BE0B24EF8205B73811.99191115180C29B1E2AF8BE0FD0ABD138F787.33B83044D63E9421B76011917CE280C137782.78Name:trip_distance,dtype:float64

We see almost an order of magnitude speedup after we do the one-time-operation of replacing object dtypes with categories. Most other computations on this column will be similarly fast. Our memory use drops dramatically as well.

Conclusion

Pandas Categoricals efficiently encode repetitive text data. Categoricals are useful for data like stock symbols, gender, experiment outcomes, cities, states, etc.. Categoricals are easy to use and greatly improve performance on this data.

We have several options to increase performance when dealing with inconveniently large or slow data. Good choices in storage format, compression, column layout, and data representation can dramatically improve query times and memory use. Each of these choices is as important as parallelism but isn’t overly hyped and so is often overlooked.

Jeff Reback gave a nice talk on categoricals (and other featuress in Pandas) at PyData NYC 2014 and is giving another this weekend at PyData London.

Distributed Scheduling

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory scheduler.

Dask.distributed is new and is not battle-tested. Use at your own risk and adjust expectations accordingly.

Evaluate dask graphs

Most dask users use the dask collections, Array, Bag, and DataFrame. These collections are convenient ways to produce dask graphs. A dask graph is a dictionary of tasks. A task is a tuple with a function and arguments.

The graph comprising a dask collection (like a dask.array) is available through its .dask attribute.

>>>importdask.arrayasda>>>x=da.arange(15,chunks=(5,))# 0..14 in three chunks of size five>>>x.dask# dask array holds the graph to create the full array{('x',0):(np.arange,0,5),('x',1):(np.arange,5,10),('x',2):(np.arange,10,15)}

Further operations on x create more complex graphs

>>>z=(x+100).sum()>>>z.dask{('x',0):(np.arange,0,5),('x',1):(np.arange,5,10),('x',2):(np.arange,10,15),('y',0):(add,('x',0),100),('y',1):(add,('x',1),100),('y',2):(add,('x',2),100),('z',0):(np.sum,('y',0)),('z',1):(np.sum,('y',1)),('z',2):(np.sum,('y',2)),('z',):(sum,[('z',0),('z',1),('z',2)])}

Hand-made dask graphs

We can make dask graphs by hand without dask collections. This involves creating a dictionary of tuples of functions.

>>>defadd(a,b):...returna+b>>># x = 1>>># y = 2>>># z = add(x, y)>>>dsk={'x':1,...'y':2,...'z':(add,'x','y')}

We evaluate these graphs with one of the dask schedulers

>>>fromdask.threadedimportget>>>get(dsk,'z')# Evaluate graph with multiple threads3>>>fromdask.multiprocessingimportget>>>get(dsk,'z')# Evaluate graph with multiple processes3

We separate the evaluation of the graphs from their construction.

Distributed Scheduling

The separation of graphs from evaluation allows us to create new schedulers. In particular there exists a scheduler that operates on multiple machines in parallel, communicating over ZeroMQ.

This system has a single centralized scheduler, several workers, and potentially several clients.

Clients send graphs to the central scheduler which farms out those tasks to workers and coordinates the execution of the graph. While the scheduler centralizes metadata, the workers themselves handle transfer of intermediate data in a peer-to-peer fashion. Once the graph completes the workers send data to the scheduler which passes it through to the appropriate user/client.

Example

And so now we can execute our dask graphs in parallel across multiple machines.

$ipython# On your laptop                 $ ipython  # Remote Process #1:  Scheduler>>>defadd(a,b):>>>fromdask.distributedimportScheduler...returna+b>>>s=Scheduler(port_to_workers=4444,...port_to_clients=5555,>>>dsk={'x':1,...hostname='notebook')...'y':2,...'z':(add,'x','y')}$ipython# Remote Process #2:  Worker>>>fromdask.distributedimportWorker>>>fromdask.threadedimportget>>>w=Worker('tcp://notebook:4444')>>>get(dsk,'z')# use threads3$ipython# Remote Process #3:  Worker>>>fromdask.distributedimportWorker>>>w=Worker('tcp://notebook:4444')>>>fromdask.distributedimportClient>>>c=Client('tcp://notebook:5555')>>>c.get(dsk,'z')# use distributed network3

Choose Your Scheduler

This graph is small. We didn’t need a distributed network of machines to compute it (a single thread would have been much faster) but this simple example can be easily extended to more important cases, including generic use with the dask collections (Array, Bag, DataFrame). You can control the scheduler with a keyword argument to any compute call.

>>>importdask.arrayasda>>>x=da.random.normal(10,0.1,size=(1000000000,),chunks=(1000000,))>>>x.mean().compute(get=get)# use threads>>>x.mean().compute(get=c.get)# use distributed network

Alternatively you can set the default scheduler in dask with dask.set_options

>>>importdask>>>dask.set_options(get=c.get)# use distributed scheduler by default

Known Limitations

We intentionally made the simplest and dumbest distributed scheduler we could think of. Because dask separates graphs from schedulers we can iterate on this problem many times; building better schedulers after learning what is important. This current scheduler learns from our single-memory system but is the first dask scheduler that has to think about distributed memory. As a result it has the following known limitations:

  1. It does not consider data locality. While linear chains of tasks will execute on the same machine we don’t think much about executing multi-input tasks on nodes where only some of the data is local.
  2. In particular, this scheduler isn’t optimized for data-local file-systems like HDFS. It’s still happy to read data from HDFS, but this results in unnecessary network communication. We’ve found that it’s great when paired with S3.
  3. This scheduler is new and hasn’t yet had its tires kicked. Vocal beta users are most welcome.
  4. We haven’t thought much about deployment. E.g. somehow you need to ssh into a bunch of machines and start up workers, then tear them down when you’re done. Dask.distributed can bootstrap off of an IPython Parallel cluster, and we’ve integrated it into anaconda-cluster but deployment remains a tough problem.

The dask.distributed module is available in the last release but I suggest using the development master branch. There will be another release in early July.

Further Information

Blake Griffith has been playing with dask.distributed and dask.bag together on data from http://githubarchive.org. He plans to write a blogpost to give a better demonstration of the use of dask.distributed on real world problems. Look for that post in the next week or two.

You can read more about the internal design of dask.distributed at the dask docs.

Thanks

Special thanks to Min Regan-Kelley, John Jacobsen, Ben Zaitlen, and Hugo Shi for their advice on building distributed systems.

Also thanks to Blake Griffith for serving as original user/developer and for smoothing over the user experience.


Write Complex Parallel Algorithms

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk parallel SVD.

Most Parallel Computation is Simple

Most parallel workloads today are fairly trivial:

>>>importdask.bagasdb>>>b=db.from_s3('githubarchive-data','2015-01-01-*.json.gz').map(json.loads).map(lambdad:d['type']=='PushEvent').count()

Graphs for these computations look like the following:

Embarrassingly parallel dask graph

This is great; these are simple problems to solve efficiently in parallel. Generally these simple computations occur at the beginning of our analyses.

Sophisticated Algorithms can be Complex

Later in our analyses we want more complex algorithms for statistics , machine learning, etc.. Often this stage fits comfortably in memory, so we don’t worry about parallelism and can use statsmodels or scikit-learn on the gigabyte result we’ve gleaned from terabytes of data.

However, if our reduced result is still large then we need to think about sophisticated parallel algorithms. This is fresh space with lots of exciting academic and software work.

Example: Parallel, Stable, Out-of-Core SVD

I’d like to show off work by Mariano Tepper, who is responsible for dask.array.linalg. In particular he has a couple of wonderful algorithms for the Singular Value Decomposition (SVD) (also strongly related to Principal Components Analysis (PCA).) Really I just want to show off this pretty graph.

>>>importdask.arrayasda>>>x=da.ones((5000,1000),chunks=(1000,1000))>>>u,s,v=da.linalg.svd(x)

Parallel SVD dask graph

This algorithm computes the exact SVD (up to numerical precision) of a large tall-and-skinny matrix in parallel in many small chunks. This allows it to operate out-of-core (from disk) and use multiple cores in parallel. At the bottom we see the construction of our trivial array of ones, followed by many calls to np.linalg.qr on each of the blocks. Then there is a lot of rearranging of various pieces as they are stacked, multiplied, and undergo more rounds of np.linalg.qr and np.linalg.svd. The resulting arrays are available in many chunks at the top and second-from-top rows.

The dask dict for one of these arrays, s, looks like the following:

>>>s.dask{('x',0,0):(np.ones,(1000,1000)),('x',1,0):(np.ones,(1000,1000)),('x',2,0):(np.ones,(1000,1000)),('x',3,0):(np.ones,(1000,1000)),('x',4,0):(np.ones,(1000,1000)),('tsqr_2_QR_st1',0,0):(np.linalg.qr,('x',0,0)),('tsqr_2_QR_st1',1,0):(np.linalg.qr,('x',1,0)),('tsqr_2_QR_st1',2,0):(np.linalg.qr,('x',2,0)),('tsqr_2_QR_st1',3,0):(np.linalg.qr,('x',3,0)),('tsqr_2_QR_st1',4,0):(np.linalg.qr,('x',4,0)),('tsqr_2_R',0,0):(operator.getitem,('tsqr_2_QR_st2',0,0),1),('tsqr_2_R_st1',0,0):(operator.getitem,('tsqr_2_QR_st1',0,0),1),('tsqr_2_R_st1',1,0):(operator.getitem,('tsqr_2_QR_st1',1,0),1),('tsqr_2_R_st1',2,0):(operator.getitem,('tsqr_2_QR_st1',2,0),1),('tsqr_2_R_st1',3,0):(operator.getitem,('tsqr_2_QR_st1',3,0),1),('tsqr_2_R_st1',4,0):(operator.getitem,('tsqr_2_QR_st1',4,0),1),('tsqr_2_R_st1_stacked',0,0):(np.vstack,[('tsqr_2_R_st1',0,0),('tsqr_2_R_st1',1,0),('tsqr_2_R_st1',2,0),('tsqr_2_R_st1',3,0),('tsqr_2_R_st1',4,0)])),('tsqr_2_QR_st2',0,0):(np.linalg.qr,('tsqr_2_R_st1_stacked',0,0)),('tsqr_2_SVD_st2',0,0):(np.linalg.svd,('tsqr_2_R',0,0)),('tsqr_2_S',0):(operator.getitem,('tsqr_2_SVD_st2',0,0),1)}

So to write complex parallel algorithms we write down dictionaries of tuples of functions.

The dask schedulers take care of executing this graph in parallel using multiple threads. Here is a profile result of a larger computation on a 30000x1000 array:

Low Barrier to Entry

Looking at this graph you may think “Wow, Mariano is awesome” and indeed he is. However, he is more an expert at linear algebra than at Python programming. Dask graphs (just dictionaries) are simple enough that a domain expert was able to look at them say “Yeah, I can do that” and write down the very complex algorithms associated to his domain, leaving the execution of those algorithms up to the dask schedulers.

You can see the source code that generates the above graphs on GitHub.

Approximate SVD dask graph

Randomized Parallel Out-of-Core SVD

A few weeks ago a genomics researcher asked for an approximate/randomized variant to SVD. Mariano had a solution up in a few days.

>>>importdask.arrayasda>>>x=da.ones((5000,1000),chunks=(1000,1000))>>>u,s,v=da.linalg.svd_compressed(x,k=100,n_power_iter=2)

I’ll omit the full dict for obvious space reasons.

Final Thoughts

Dask graphs let us express parallel algorithms with very little extra complexity. There are no special objects or frameworks to learn, just dictionaries of tuples of functions. This allows domain experts to write sophisticated algorithms without fancy code getting in their way.

Custom Parallel Workflows

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability custom of dask graphs.

Recent Parallel Work Focuses on Big Collections

Parallel databases, Spark, and Dask collections all provide large distributed collections that handle parallel computation for you. You put data into the collection, program with a small set of operations like map or groupby, and the collections handle the parallel processing. This idea has become so popular that there are now a dozen projects promising big and friendly Pandas clones.

This is good. These collections provide usable, high-level interfaces for a large class of common problems.

Custom Workloads

However, many workloads are too complex for these collections. Workloads might be complex either because they come from sophisticated algorithms (as we saw in a recent post on SVD) or because they come from the real world, where problems tend to be messy.

In these cases I tend to see people do two things

  1. Fall back to multiprocessing, MPI or some other explicit form of parallelism
  2. Perform mental gymnastics to fit their problem into Spark using a clever choice of keys. These cases often fail to acheive much speedup.

Direct Dask Graphs

Historically I’ve recommended the manual construction of dask graphs in these cases. Manual construction of dask graphs lets you specify fairly arbitrary workloads that then use the dask schedulers to execute in parallel. The dask docs hold the following example of a simple data processing pipeline:

defload(filename):...defclean(data):...defanalyze(sequence_of_data):...defstore(result):...dsk={'load-1':(load,'myfile.a.data'),'load-2':(load,'myfile.b.data'),'load-3':(load,'myfile.c.data'),'clean-1':(clean,'load-1'),'clean-2':(clean,'load-2'),'clean-3':(clean,'load-3'),'analyze':(analyze,['clean-%d'%iforiin[1,2,3]]),'store':(store,'analyze')}fromdask.multiprocessingimportgetget(dsk,'store')# executes in parallel

Feedback from users is that this is interesting and powerful but that programming directly in dictionaries is not inutitive, doesn’t integrate well with IDEs, and is prone to error.

Introducing dask.do

To create the same custom parallel workloads using normal-ish Python code we use the dask.do function. This do function turns any normal Python function into a delayed version that adds to a dask graph. The do function lets us rewrite the computation above as follows:

fromdaskimportdoloads=[do(load)('myfile.a.data'),do(load)('myfile.b.data'),do(load)('myfile.c.data')]cleaned=[do(clean)(x)forxinloads]analysis=do(analyze)(cleaned)result=do(store)(analysis)

The explicit function calls here don’t perform work directly; instead they build up a dask graph which we can then execute in parallel with our choice of scheduler.

fromdask.multiprocessingimportgetresult.compute(get=get)

This interface was suggested by Gael Varoquaux based on his experience with joblib. It was implemented by Jim Crist in PR (#408).

Example: Nested Cross Validation

I sat down with a Machine learning student, Gabriel Krummenacher and worked to parallelize a small code to do nested cross validation. Below is a comparison of a sequential implementation that has been parallelized using dask.do:

You can safely skip reading this code in depth. The take-away is that it’s somewhat involved but that the addition of parallelism is light.

parallized cross validation code

The parallel version runs about four times faster on my notebook. Disclaimer: The sequential version presented here is just a downgraded version of the parallel code, hence why they look so similar. This is available on github.

So the result of our normal imperative-style for-loop code is a fully parallelizable dask graph. We visualize that graph below.

test_score.visualize()

Cross validation dask graph

Help!

Is this a useful interface? It would be great if people could try this out and generate feedback on dask.do.

For more information on dask.do see the dask imperative documentation.

Caching

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data science well. We propose a new caching policy.

Humans Repeat Stuff

Consider the dataset that you’ve worked on most recently. How many times have you loaded it from disk into memory? How many times have you repeated almost the same computations on that data?

Exploratory data science workloads involve repetition of very similar computations. These computations share structure. By caching frequently used results (or parts of results) we may be able to considerably speed up exploratory data analysis.

Caching in other contexts

The web community loves caching. Database backed web applications almost always guard their database lookups with a system like memcached which devotes some amount of memory to caching frequent and recent queries. Because humans visit mostly the same pages over and over again this can reduce database load by an order of magnitude. Even if humans visit different pages with different inputs these pages often share many elements.

Limited caching resources

Given infinite memory we would cache every result that we’ve ever computed. This would give us for instant recall on anything that wasn’t novel. Sadly our memory resources are finite and so we evict cached results that don’t seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policy evicts results that have not been requested for a long time. This is cheap and works well for web and systems applications.

LRU doesn’t fit analytic workloads

Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads have a large spread computation times and of storage costs. While most web application database queries take roughly the same amount of time (100ms-1000ms) and take up roughly the same amount of space to store (1-10kb), the computation and storage costs of analytic computations can easily vary by many orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

  1. x.std() # costly to recompute, cheap to store
  2. x.T # cheap to recompute, costly to store

In the first case, x.std(), this might take a second on a large array (somewhat expensive) but takes only a few bytes to store. This result is so cheap to store that we’re happy to keep it in our cache for a long time, even if its infrequently requested.

In the second case, x.T this is cheap to compute (just a metadata change in the array) and executes in microseconds. However the result might take gigabytes of memory to store. We don’t want to keep this in our cache, even if it’s very frequently requested; it takes up all of the space for other potentially more useful (and smaller) results and we can recompute it trivially anyway.

So we want to keep cached results that have the following properties:

  1. Costly to recompute (in seconds)
  2. Cheap to store (in bytes)
  3. Frequently used
  4. Recently used

Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the score associated to the entry with the following value

Where compute time is the time it took to compute the result in the first place, nbytes is the number of bytes that it takes to store the result, epsilon is a small number that determines the halflife of what “recently” means, and t is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to new results with a slowly growing exponential growth, and amplifies the score of frequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes, and cull the cache as necessary to keep storage costs beneath a fixed budget. Updates cost O(log(k)) for k the number of elements in the cache.

Cachey

I wrote this up into a tiny library called cachey. This is experimental code and subject to wild API changes (including renaming.)

The central object is a Cache that includes asks for the following:

  1. Number of available bytes to devote to the cache
  2. Halflife on importance (the number of access that occur to reduce the importance of a cached result by half) (default 1000)
  3. A lower limit on costs to consider entry to the cache (default 0)

Example

So here is the tiny example

>>>fromcacheyimportCache>>>c=Cache(available_bytes=1e9)# 1 GB>>>c.put('Hello','world!',cost=3)>>>c.get('Hello')'world!'

More interesting example

The cache includes a memoize decorator. Lets memoize pd.read_csv.

In[1]:importpandasaspdIn[2]:fromcacheyimportCacheIn[3]:c=Cache(1e9)In[4]:read_csv=c.memoize(pd.read_csv)In[5]:%timedf=read_csv('accounts.csv')CPUtimes:user262ms,sys:27.7ms,total:290msWalltime:303msIn[6]:%timedf=read_csv('accounts.csv')# second read is freeCPUtimes:user77µs,sys:16µs,total:93µsWalltime:93µsIn[7]:c.total_bytes/c.available_bytesOut[7]:0.096

So we create a new function read_csv that operates exactly like pandas.read_csv except that it holds on to recent results in c, a cache. This particular CSV file created a dataframe that filled a tenth of our cache space. The more often we request this CSV file the more its score will grow and the more likely it is to remain in the cache into the future. If other memoized functions using this same cache produce more valuable results (costly to compute, cheap to store) and we run out of space then this result will be evicted and we’ll have to recompute our result if we ask for read_csv('accounts.csv') again.

Memoize everything

Just memoizing read_csv isn’t very interesting. The pd.read_csv function operates at a constant data bandwidth of around 100 MB/s. The caching policies around cachey really shine when they get to see all of our computations. For example it could be that we don’t want to hold on to the results of read_csv because these take up a lot of space. If we find ourselves doing the same groupby computations then we might prefer to use our gigabyte of caching space to store these both because

  1. groupby computations take a long time to compute
  2. groupby results are often very compact in memory

Cachey and Dask

I’m slowly working on integrating cachey into dask’s shared memory scheduler.

Dask is in a good position to apply cachey to many computations. It can look at every task in a task graph and consider the result for inclusion into the cache. We don’t need to explicitly memoize every function we want to use, dask can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection of sub-tasks. Similar computations (like mean and variance) often share sub-tasks.

Future Work

Cachey is new and untested but potentially useful now, particularly through the memoize method shown above. It’s a small and simple codebase. I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

  1. How do we build a hierarchy of caches to share between memory and disk?
  2. How do we cleanly integrate cachey into dask (see PR #502) and how do we make dask amenable to caching (see PR #510)?

A Weekend with Asyncio

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: I learned asyncio and rewrote part of dask.distributed with it; this details my experience

asyncio

The asyncio library provides concurrent programming in the style of Go, Clojure’s core.async library, or more traditional libraries like Twisted. Asyncio offers a programming paradigm that lets many moving parts interact without involving separate threads. These separate parts explicitly yield control to each other and to a central authority and then regain control as others yield control to them. This lets one escape traps like race conditions to shared state, a web of callbacks, lost error reporting, and general confusion.

I’m not going to write too much about asyncio. Instead I’m going to briefly describe my problem, link to a solution, and then dive into good-and-bad points about using asyncio while they’re fresh in my mind.

Exercise

I won’t actually discuss the application much after this section; you can safely skip this.

I decided to rewrite the dask.distributed Worker using asyncio. This worker has to do the following:

  1. Store local data in a dictionary (easy)
  2. Perform computations on that data as requested by a remote connection (act as a server in a client-server relationship)
  3. Collect data from other workers when we don’t have all of the necessary data for a computation locally (peer-to-peer)
  4. Serve data to other workers who need our data for their own computations (peer-to-peer)

It’s a sort of distributed RPC mechanism with peer-to-peer value sharing. Metadata for who-has-what data is stored in a central metadata store; this could be something like Redis.

The current implementation of this is a nest of threads, queues, and callbacks. It’s not bad and performs well but tends to be hard for others to develop.

Additionally I want to separate the worker code because it’s useful outside of dask.distributed. Other distributed computation solutions exist in my head that rely on this technology.

For the moment the code lives here: https://github.com/mrocklin/dist. I like the design. The module-level docstring of worker.py is short and informative. But again, I’m not going to discuss the application yet; instead, here are some thoughts on learning/developing with asyncio.

General Thoughts

Disclaimer I am a novice concurrent programmer. I write lots of parallel code but little concurrent code. I have never used existing frameworks like Twisted.

I liked the experience of using asyncio and recommend the paradigm to anyone building concurrent applications.

The Good:

  • I can write complex code that involves multiple asynchronous calls, complex logic, and exception handling all in a single place. Complex application logic is no longer spread in many places.
  • Debugging is much easier now that I can throw import pdb; pdb.set_trace() lines into my code and expect them to work (this fails when using threads).
  • My code fails more gracefully, further improving the debug experience. Ctrl-C works.
  • The paradigm shared by Go, Clojure’s core.async, and Python’s asyncio felt viscerally good. I was able to reason well about my program as I was building it and made nice diagrams about explicitly which sequential processes interacted with which others over which channels. I am much more confident of the correctness of the implementation and the design of my program. However, after having gone through this exercise I suspect that I could now implement just about the same design without asyncio. The design paradigm was perhaps as important as the library itself.
  • I have to support Python 2. Fortunately I found the trollius port of asyncio to be very usable. It looks like it was a direct fork-then-modify of tulip.

The Bad:

  • There wasn’t a ZeroMQ connectivity layer for Trollius (though aiozmq exists in Python 3) so I ended up having to use threads anyway for inter-node I/O. This, combined with ZeroMQ’s finicky behavior did mean that my program crashed hard sometimes. I’m considering switching to plain sockets (which are supported nativel by Trollius and asyncio) due to this.
  • While exceptions raise cleanly I can’t determine from where they originate. There are no line numbers or tracebacks. Debugging in a concurrent environment is hard; my experience was definitely better than threads but still could be improved. I hope that asyncio in Python 3.4 has better debugging support.
  • The API documentation is thorough but stackoverflow, general best practices, and example coverage is very sparse. The project is new so there isn’t much to go on. I found that reading documentation for Go and presentations on Clojure’s core.async were far more helpful in preparing me to use asyncio than any of the asyncio docs/presentations.

Future

I intend to pursue this into the future and, if the debugging experience is better in Python 3 am considering rewriting the dask.distributed Scheduler in Python 3 with asyncio proper. This is possible because the Scheduler doesn’t have to be compatible with user code.

I found these videos to be useful:

  1. Stuart Halloway on core.async
  2. David Nolen on core.async

Efficient Tabular Storage

$
0
0

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:

  • Binary stores
  • Column stores
  • Categorical support
  • Compression
  • Indexed/Partitioned stores

We use NYCTaxi dataset for examples, and introduce a small project, Castra.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Larger than Memory Data and Disk I/O

We analyze large datasets (10-100GB) on our laptop by extending memory with disk. Tools like dask.array and dask.dataframe make this easier for array and tabular data.

Interaction times can improve significantly (from minutes to seconds) if we choose to store our data on disk efficiently. This is particularly important for large data because we can no longer separately “load in our data” while we get a coffee and then iterate rapidly on our dataset once it’s comfortably in memory.

Larger-than-memory datasets force interactive workflows to include the hard drive.

CSV is convenient but slow

CSV is great. It’s human readable, accessible by every tool (even Excel!), and pretty simple.

CSV is also slow. The pandas.read_csv parser maxes out at 100MB/s on simple data. This doesn’t include any keyword arguments like datetime parsing that might slow it down further. Consider the time to parse a 24GB dataset:

24GB / (100MB/s) == 4 minutes

A four minute delay is too long for interactivity. We need to operate in seconds rather than minutes otherwise people leave to work on something else. This improvement from a few minutes to a few seconds is entirely possible if we choose better formats.

Example with CSVs

As an example lets play with the NYC Taxi dataset using dask.dataframe, a library that copies the Pandas API but operates in chunks off of disk.

>>>importdask.dataframeasdd>>>df=dd.read_csv('csv/trip_data_*.csv',...skipinitialspace=True,...parse_dates=['pickup_datetime','dropoff_datetime'])>>>df.head()
medallionhack_licensevendor_idrate_codestore_and_fwd_flagpickup_datetimedropoff_datetimepassenger_counttrip_time_in_secstrip_distancepickup_longitudepickup_latitudedropoff_longitudedropoff_latitude
089D227B655E5C82AECF13C3F540D4CF4BA96DE419E711691B9445D6A6307C170CMT1N2013-01-01 15:11:482013-01-01 15:18:1043821.0-73.97816540.757977-73.98983840.751171
10BD7C8F5BA12B88E0B67BED28BEA73D89FD8F69F0804BDB5549F40E9DA1BE472CMT1N2013-01-06 00:18:352013-01-06 00:22:5412591.5-74.00668340.731781-73.99449940.750660
20BD7C8F5BA12B88E0B67BED28BEA73D89FD8F69F0804BDB5549F40E9DA1BE472CMT1N2013-01-05 18:49:412013-01-05 18:54:2312821.1-74.00470740.737770-74.00983440.726002
3DFD2202EE08F7A8DC9A57B02ACB81FE251EE87E3205C985EF8431D850C786310CMT1N2013-01-07 23:54:152013-01-07 23:58:2022440.7-73.97460240.759945-73.98473440.759388
4DFD2202EE08F7A8DC9A57B02ACB81FE251EE87E3205C985EF8431D850C786310CMT1N2013-01-07 23:25:032013-01-07 23:34:2415602.1-73.97625040.748528-74.00258640.747868

Time Costs

It takes a second to load the first few lines but 11 to 12 minutes to roll through the entire dataset. We make a zoomable picture below of a random sample of the taxi pickup locations in New York City. This example is taken from a full example notebook here.

df2=df[(df.pickup_latitude>40)&(df.pickup_latitude<42)&(df.pickup_longitude>-75)&(df.pickup_longitude<-72)]sample=df2.sample(frac=0.0001)pickup=sample[['pickup_latitude','pickup_longitude']]result=pickup.compute()frombokeh.plottingimportfigure,show,output_notebookp=figure(title="Pickup Locations")p.scatter(result.pickup_longitude,result.pickup_latitude,size=3,alpha=0.2)

Eleven minutes is a long time

This result takes eleven minutes to compute, almost all of which is parsing CSV files. While this may be acceptable for a single computation we invariably make mistakes and start over or find new avenues in our data to explore. Each step in our thought process now takes eleven minutes, ouch.

Interactive exploration of larger-than-memory datasets requires us to evolve beyond CSV files.

Principles to store tabular data

What efficient techniques exist for tabular data?

A good solution may have the following attributes:

  1. Binary
  2. Columnar
  3. Categorical support
  4. Compressed
  5. Indexed/Partitioned

We discuss each of these below.

Binary

Consider the text ‘1.23’ as it is stored in a CSV file and how it is stored as a Python/C float in memory:

  • CSV: 1.23
  • C/Python float: 0x3f9d70a4

These look very different. When we load 1.23 from a CSV textfile we need to translate it to 0x3f9d70a4; this takes time.

A binary format stores our data on disk exactly how it will look in memory; we store the bytes 0x3f9d70a4 directly on disk so that when we load data from disk to memory no extra translation is necessary. Our file is no longer human readable but it’s much faster.

This gets more intense when we consider datetimes:

  • CSV: 2015-08-25 12:13:14
  • NumPy datetime representation: 1440529994000000 (as an integer)

Every time we parse a datetime we need to compute how many microseconds it has been since the epoch. This calculation needs to take into account things like how many days in each month, and all of the intervening leap years. This is slow. A binary representation would record the integer directly on disk (as 0x51e278694a680) so that we can load our datetimes directly into memory without calculation.

Columnar

Many analytic computations only require a few columns at a time, often only one, e.g.

>>>df.passenger_counts.value_counts().compute().sort_index()0375511196050392230971533718735443519779598525396662828773082392412912551Name:passenger_count,dtype:int64

Of our 24 GB we may only need 2GB. Columnar storage means storing each column separately from the others so that we can read relevant columns without passing through irrelevant columns.

Our CSV example fails at this. While we only want two columns, pickup_datetime and pickup_longitude, we pass through all of our data to collect the relevant fields. The pickup location data is mixed with all the rest.

Categoricals

Categoricals encode repetitive text columns (normally very expensive) as integers (very very cheap) in a way that is invisible to the user.

Consider the following (mostly text) columns of our NYC taxi dataset:

>>>df[['medallion','vendor_id','rate_code','store_and_fwd_flag']].head()
medallionvendor_idrate_codestore_and_fwd_flag
089D227B655E5C82AECF13C3F540D4CF4CMT1N
10BD7C8F5BA12B88E0B67BED28BEA73D8CMT1N
20BD7C8F5BA12B88E0B67BED28BEA73D8CMT1N
3DFD2202EE08F7A8DC9A57B02ACB81FE2CMT1N
4DFD2202EE08F7A8DC9A57B02ACB81FE2CMT1N

Each of these columns represents elements of a small set:

  • There are two vendor ids
  • There are twenty one rate codes
  • There are three store-and-forward flags (Y, N, missing)
  • There are about 13000 taxi medallions. (still a small number)

And yet we store these elements in large and cumbersome dtypes:

In[4]:df[['medallion','vendor_id','rate_code','store_and_fwd_flag']].dtypesOut[4]:medallionobjectvendor_idobjectrate_codeint64store_and_fwd_flagobjectdtype:object

We use int64 for rate code, which could easily have fit into an int8 an opportunity for an 8x improvement in memory use. The object dtype used for strings in Pandas and Python takes up a lot of memory and is quite slow:

In[1]:importsysIn[2]:sys.getsizeof('CMT')# bytesOut[2]:40

Categoricals replace the original column with a column of integers (of the appropriate size, often int8) along with a small index mapping those integers to the original values. I’ve written about categoricals before so I won’t go into too much depth here. Categoricals increase both storage and computational efficiency by about 10x if you have text data that describes elements in a category.

Compression

After we’ve encoded everything well and separated our columns we find ourselves limited by disk I/O read speeds. Disk read bandwidths range from 100MB/s (laptop spinning disk hard drive) to 2GB/s (RAID of SSDs). This read speed strongly depends on how large our reads are. The bandwidths given above reflect large sequential reads such as you might find when reading all of a 100MB file in one go. Performance degrades for smaller reads. Fortunately, for analytic queries we’re often in the large sequential read case (hooray!)

We reduce disk read times through compression. Consider the datetimes of the NYC taxi dataset. These values are repetitive and slowly changing; a perfect match for modern compression techniques.

>>>ind=df.index.compute()# this is on presorted index data (see castra section below)>>>indDatetimeIndex(['2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00','2013-01-01 00:00:00',...'2013-12-31 23:59:42','2013-12-31 23:59:47','2013-12-31 23:59:48','2013-12-31 23:59:49','2013-12-31 23:59:50','2013-12-31 23:59:51','2013-12-31 23:59:54','2013-12-31 23:59:55','2013-12-31 23:59:57','2013-12-31 23:59:57'],dtype='datetime64[ns]',name=u'pickup_datetime',length=169893985,freq=None,tz=None)

Benchmark datetime compression

We can use a modern compression library, like fastlz or blosc to compress this data at high speeds.

In[36]:importbloscIn[37]:%timecompressed=blosc.compress_ptr(address=ind.values.ctypes.data,...:items=len(ind),...:typesize=ind.values.dtype.alignment,...:clevel=5)CPUtimes:user3.22s,sys:332ms,total:3.55sWalltime:512msIn[40]:len(compressed)/ind.nbytes# compression ratioOut[40]:0.14296813539337488In[41]:ind.nbytes/0.512/1e9# Compresson bandwidth (GB/s)Out[41]:2.654593515625In[42]:%time_=blosc.decompress(compressed)CPUtimes:user1.3s,sys:438ms,total:1.74sWalltime:406msIn[43]:ind.nbytes/0.406/1e9# Decompression bandwidth (GB/s)Out[43]:3.3476647290640393

We store 7x fewer bytes on disk (thus septupling our effective disk I/O) by adding an extra 3GB/s delay. If we’re on a really nice Macbook pro hard drive (~600MB/s) then this is a clear and substantial win. The worse the hard drive, the better this is.

But sometimes compression isn’t as nice

Some data is more or less compressable than others. The following column of floating point data does not compress as nicely.

In[44]:x=df.pickup_latitude.compute().valuesIn[45]:%timecompressed=blosc.compress_ptr(x.ctypes.data,len(x),x.dtype.alignment,clevel=5)CPUtimes:user5.87s,sys:0ns,total:5.87sWalltime:925msIn[46]:len(compressed)/x.nbytesOut[46]:0.7518617315969132

This compresses more slowly and only provides marginal benefit. Compression may still be worth it on slow disk but this isn’t a huge win.

The pickup_latitude column isn’t compressible because most of the information isn’t repetitive. The numbers to the far right of the decimal point are more or less random.

40.747868

Other floating point columns may compress well, particularly when they are rounded to small and meaningful decimal values.

Compression rules of thumb

Optimal compression requires thought. General rules of thumb include the following:

  • Compress integer dtypes
  • Compress datetimes
  • If your data is slowly varying (e.g. sorted time series) then use a shuffle filter (default in blosc)
  • Don’t bother much with floating point dtypes
  • Compress categoricals (which are just integer dtypes)

Avoid gzip and bz2

Finally, avoid gzip and bz2. These are both very common and very slow. If dealing with text data, consider snappy (also available via blosc.)

Indexing/Partitioning

One column usually dominates our queries. In time-series data this is time. For personal data this is the user ID.

Just as column stores let us avoid irrelevant columns, partitioning our data along a preferred index column lets us avoid irrelevant rows. We may need the data for the last month and don’t need several years’ worth. We may need the information for Alice and don’t need the information for Bob.

Traditional relational databases provide indexes on any number of columns or sets of columns. This is excellent if you are using a traditional relational database. Unfortunately the data structures to provide arbitrary indexes don’t mix well with some of the attributes discussed above and we’re limited to a single index that partitions our data into sorted blocks.

Some projects that implement these principles

Many modern distributed database storage systems designed for analytic queries implement these principles well. Notable players include Redshift and Parquet.

Additionally newer single-machine data stores like Dato’s SFrame and BColz follow many of these principles. Finally many people have been doing this for a long time with custom use of libraries like HDF5.

It turns out that these principles are actually quite easy to implement with the right tools (thank you #PyData) The rest of this post will talk about a tiny 500 line project, Castra, that implements these princples and gets good speedups on biggish Pandas data.

Castra

With these goals in mind we built Castra, a binary partitioned compressed columnstore with builtin support for categoricals and integration with both Pandas and dask.dataframe.

Load data from CSV files, sort on index, save to Castra

Here we load in our data from CSV files, sort on the pickup datetime column, and store to a castra file. This takes about an hour (as compared to eleven minutes for a single read.) Again, you can view the full notebook here

>>>importdask.dataframeasdd>>>df=dd.read_csv('csv/trip_data_*.csv',...skipinitialspace=True,...parse_dates=['pickup_datetime','dropoff_datetime'])>>>(df.set_index('pickup_datetime',compute=False)....to_castra('trip.castra',categories=True))

Profit

Now we can take advantage of columnstores, compression, and binary representation to perform analytic queries quickly. Here is code to create a histogram of trip distance. The plot of the results follows below.

Note that this is especially fast because Pandas now releases the GIL on value_counts operations (all groupby operations really). This takes around 20 seconds on my machine on the last release of Pandas vs 5 seconds on the development branch. Moving from CSV files to Castra moved the bottleneck of our computation from disk I/O to processing speed, allowing improvements like multi-core processing to really shine.

We plot the result of the above computation with Bokeh below. Note the spike around 20km. This is around the distance from Midtown Manhattan to LaGuardia airport.

I’ve shown Castra used above with dask.dataframe but it works fine with straight Pandas too.

Credit

Castra was started by myself and Valentin Haenel (current maintainer of bloscpack and bcolz) during an evening sprint following PyData Berlin. Several bugfixes and refactors were followed up by Phil Cloud and Jim Crist.

Castra is roughly 500 lines long. It’s a tiny project which is both good and bad. It’s being used experimentally and there are some heavy disclaimers in the README. This post is not intended as a sales pitch for Castra, but rather to provide a vocabulary to talk about efficient tabular storage.

Response to twitter traffic: again, this blogpost is not saying “use Castra!” Rather it says “don’t use CSVs!” and consider more efficient storage for interactive use. Other, more mature solutions exist or could be built. Castra was an experiment of “how fast can we get without doing too much work.”

Distributed Prototype

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We demonstrate a prototype distributed computing library and discuss data locality.

Distributed Computing

Here’s a new prototype library for distributed computing. It could use some critical feedback.

This blogpost uses distributed on a toy example. I won’t talk about the design here, but the docs should be a quick and informative read. I recommend the quickstart in particular.

We’re going to do a simple computation a few different ways on a cluster of four nodes. The computation will be

  1. Make a 1000 random numpy arrays, each of size 1 000 000
  2. Compute the sum of each array
  3. Compute the total sum of the sums

We’ll do this directly with a distributed Pool and again with a dask graph.

Start up a Cluster

I have a cluster of four m3.xlarges on EC2

$ ssh node1
$ dcenter

$ ssh node2
$ dworkder node1:8787

$ ssh node3
$ dworkder node1:8787

$ ssh node4
$ dworkder node1:8787

Notes on how I set up my cluster.

Pool

On the client side we spin up a distributed Pool and point it to the center node.

>>>fromdistributedimportPool>>>pool=Pool('node1:8787')

Then we create a bunch of random numpy arrays:

>>>importnumpyasnp>>>arrays=pool.map(np.random.random,[1000000]*1000)

Our result is a list of proxy objects that point back to individual numpy arrays on the worker computers. We don’t move data until we need to. (Though we could call .get() on this to collect the numpy array from the worker.)

>>>arrays[0]RemoteData<center=10.141.199.202:8787,key=3e446310-6...>

Further computations on this data happen on the cluster, on the worker nodes that hold the data already.

>>>sums=pool.map(np.sum,arrays)

This avoids costly data transfer times. Data transfer will happen when necessary though, as when we compute the final sum. This forces communication because all of the intermediate sums must move to one node for the final addition.

>>>total=pool.apply(np.sum,args=(sums,))>>>total.get()# finally transfer result to local machine499853416.82058007

distributed.dask

Now we do the same computation all at once by manually constructing a dask graph (beware, this can get gnarly, friendlier approaches exist below.)

>>>dsk=dict()>>>foriinrange(1000):...dsk[('x',i)]=(np.random.random,1000000)...dsk[('sum',i)]=(np.sum,('x',i))>>>dsk['total']=(sum,[('sum',i)foriinrange(1000)])>>>fromdistributed.daskimportget>>>get('node1',8787,dsk,'total')500004095.00759566

Apparently not everyone finds dask dictionaries to be pleasant to write by hand. You could also use this with dask.imperative or dask.array.

dask.imperative

defget2(dsk,keys):""" Make `get` scheduler that hardcodes the IP and Port """returnget('node1',8787,dsk,keys)
>>>fromdask.imperativeimportdo>>>arrays=[do(np.random.random)(1000000)foriinrange(1000)]>>>sums=[do(np.sum)(x)forxinarrays]>>>total=do(np.sum)(sums)>>>total.compute(get=get2)499993637.00844824

dask.array

>>>importdask.arrayasda>>>x=da.random.random(1000000*1000,chunks=(1000000,))>>>x.sum().compute(get=get2)500000250.44921482

The dask approach was smart enough to delete all of the intermediates that it didn’t need. It could have run intelligently on far more data than even our cluster could hold. With the pool we manage data ourselves manually.

>>>fromdistributedimportdelete>>>delete(('node0',8787),arrays)

Mix and Match

We can also mix these abstractions and put the results from the pool into dask graphs.

>>>arrays=pool.map(np.random.random,[1000000]*1000)>>>dsk={('sum',i):(np.sum,x)fori,xinenumerate(arrays)}>>>dsk['total']=(sum,[('sum',i)foriinrange(1000)])

Discussion

The Pool and get user interfaces are independent from each other but both use the same underlying network and both build off of the same codebase. With distributed I wanted to build a system that would allow me to experiment easily. I’m mostly happy with the result so far.

One non-trivial theme here is data-locality. We keep intermediate results on the cluster and schedule jobs on computers that already have the relevant data if possible. The workers can communicate with each other if necessary so that any worker can do any job, but we try to arrange jobs so that workers don’t have to communicate if not necessary.

Another non-trivial aspect is that the high level dask.array example works without any tweaking of dask. Dask’s separation of schedulers from collections means that existing dask.array code (or dask.dataframe, dask.bag, dask.imperative code) gets to evolve as we experiment with new fancier schedulers.

Finally, I hope that the cluster setup here feels pretty minimal. You do need some way to run a command on a bunch of machines but most people with clusters have some mechanism to do that, even if its just ssh as I did above. My hope is that distributed lowers the bar for non-trivial cluster computing in Python.

Disclaimer

Everything here is very experimental. The library itself is broken and unstable. It was made in the last few weeks and hasn’t been used on anything serious. Please adjust expectations accordingly and provide critical feedback.

Data Bandwidth

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We list and combine common bandwidths relevant in data science

Understanding data bandwidths helps us to identify bottlenecks and write efficient code. Both hardware and software can be characterized by how quickly they churn through data. We present a rough list of relevant data bandwidths and discuss how to use this list when optimizing a data pipeline.

NameBandwidth MB/s
Memory copy 3000
Basic filtering in C/NumPy/Pandas 3000
Fast decompression 1000
SSD Large Sequential Read 500
Interprocess communication (IPC) 300
msgpack deserialization 125
Gigabit Ethernet 100
Pandas read_csv 100
JSON Deserialization 50
Slow decompression (e.g. gzip/bz2) 50
SSD Small Random Read 20
Wireless network 1

Disclaimer: all numbers in this post are rule-of-thumb and vary by situation

Understanding these scales can help you to identify how to speed up your program. For example, there is no need to use a faster network or disk if you store your data as JSON.

Combining bandwidths

Complex data pipelines involve many stages. The rule to combine bandwidths is to add up the inverses of the bandwidths, then take the inverse again:

This is the same principle behind adding conductances in serial within electrical circuits. One quickly learns to optimize the slowest link in the chain first.

Example

When we read data from disk (500 MB/s) and then deserialize it from JSON (50 MB/s) our full bandwidth is 45 MB/s:

If we invest in a faster hard drive system that has 2GB of read bandwidth then we get only marginal performance improvement:

However if we invest in a faster serialization technology, like msgpack (125 MB/s), then we double our effective bandwidth.

This example demonstrates that we should focus on the weakest bandwidth first. Cheap changes like switching from JSON to msgpack can be more effective than expensive changes, like purchasing expensive hardware for fast storage.

Overlapping Bandwidths

We can overlap certain classes of bandwidths. In particular we can often overlap communication bandwidths with computation bandwidths. In our disk+JSON example above we can probably hide the disk reading time completely. The same would go for network applications if we handle sockets correctly.

Parallel Bandwidths

We can parallelize some computational bandwidths. For example we can parallelize JSON deserialization by our number of cores to quadruple the effective bandwidth 50 MB/s * 4 = 200 MB/s. Typically communication bandwidths are not parallelizable per core.


Disk Bandwidth

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Disk read and write bandwidths depend strongly on block size.

Disk read/write bandwidths on commodity hardware vary between 10 MB/s (or slower) to 500 MB/s (or faster on fancy hardware). This variance can be characterized by the following rules:

  1. Reading/writing large blocks of data is faster than reading/writing small blocks of data
  2. Reading is faster than writing
  3. Solid state drives are faster than spinning disk especially for many small reads/writes

In this notebook we experiment with the dependence of disk bandwidth on file size. The result of this experiment is the following image, which depicts the read and write bandwidths of a commercial laptop SSD as we vary block size:

Analysis

We see that this particular hard drive wants to read/write data in chunksizes of 1-100 MB. If we can arrange our data so that we consistently pull off larger blocks of data at a time then we can read through data quite quickly at 500 MB/s. We can churn through a 30 GB dataset in one minute. Sophisticated file formats take advantage of this by storing similar data consecutively. For example column stores store all data within a single column in single large blocks.

Difficulties when measuring disk I/O

Your file system is sophisticated. It will buffer both reads and writes in RAM as you think you’re writing to disk. In particular, this guards your disk somewhat against the “many small writes” regime of terrible performance. This is great, your file system does a fantastic job (bringing write numbers up from 0.1 MB/s to 20 MB/s or so) but it makes it a bit tricky to benchmark properly. In the experiment above we fsync each file after write to flush write buffers and explicitly clear all buffers before entering the read section.

Anecdotally I also learned that my operating system caps write speeds at 30 MB/s when operating off of battery power. This anecdote demonstrates how particular your hard drive may be when controlled by a file system. It is worth remembering that your hard drive is a physical machine and not just a convenient abstraction.

Introducing Dask distributed

$
0
0

tl;dr: We analyze JSON data on a cluster using pure Python projects.

Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset of PyData to distributed computing. Over the next few weeks I and others will write about this system. Please note that dask+distributed is developing quickly and so the API is likely to shift around a bit.

Today we start simple with the typical cluster computing problem, parsing JSON records, filtering, and counting events using dask.bag and the new distributed scheduler. We’ll dive into more advanced problems in future posts.

A video version of this blogpost is available here.

GitHub Archive Data on S3

GitHub releases data dumps of their public event stream as gzipped compressed, line-delimited, JSON. This data is too large to fit comfortably into memory, even on a sizable workstation. We could stream it from disk but, due to the compression and JSON encoding this takes a while and so slogs down interactive use. For an interactive experience with data like this we need a distributed cluster.

Setup and Data

We provision nine m3.2xlarge nodes on EC2. These have eight cores and 30GB of RAM each. On this cluster we provision one scheduler and nine workers (see setup docs). (More on launching in later posts.) We have five months of data, from 2015-01-01 to 2015-05-31 on the githubarchive-data bucket in S3. This data is publicly avaialble if you want to play with it on EC2. You can download the full dataset at https://www.githubarchive.org/ .

The first record looks like the following:

{'actor':{'avatar_url':'https://avatars.githubusercontent.com/u/9152315?','gravatar_id':'','id':9152315,'login':'davidjhulse','url':'https://api.github.com/users/davidjhulse'},'created_at':'2015-01-01T00:00:00Z','id':'2489368070','payload':{'before':'86ffa724b4d70fce46e760f8cc080f5ec3d7d85f','commits':[{'author':{'email':'david.hulse@live.com','name':'davidjhulse'},'distinct':True,'message':'Altered BingBot.jar\n\nFixed issue with multiple account support','sha':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],'distinct_size':1,'head':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','push_id':536740396,'ref':'refs/heads/master','size':1},'public':True,'repo':{'id':28635890,'name':'davidjhulse/davesbingrewardsbot','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},'type':'PushEvent'}

So we have a large dataset on S3 and a moderate sized play cluster on EC2, which has access to S3 data at about 100MB/s per node. We’re ready to play.

Play

We start an ipython interpreter on our local laptop and connect to the dask scheduler running on the cluster. For the purposes of timing, the cluster is on the East Coast while the local machine is in California on commercial broadband internet.

>>>fromdistributedimportExecutor,s3>>>e=Executor('54.173.84.107:8786')>>>e<Executor:scheduler=54.173.84.107:8786workers=72threads=72>

Our seventy-two worker processes come from nine workers with eight processes each. We chose processes rather than threads for this task because computations will be bound by the GIL. We will change this to threads in later examples.

We start by loading a single month of data into distributed memory.

importjsontext=s3.read_text('githubarchive-data','2015-01',compression='gzip')records=text.map(json.loads)records=e.persist(records)

The data lives in S3 in hourly files as gzipped encoded, line delimited JSON. The s3.read_text and text.map functions produce dask.bag objects which track our operations in a lazily built task graph. When we ask the executor to persist this collection we ship those tasks off to the scheduler to run on all of the workers in parallel. The persist function gives us back another dask.bag pointing to these remotely running results. This persist function returns immediately, and the computation happens on the cluster in the background asynchronously. We gain control of our interpreter immediately while the cluster hums along.

The cluster takes around 40 seconds to download, decompress, and parse this data. If you watch the video embedded above you’ll see fancy progress-bars.

We ask for a single record. This returns in around 200ms, which is fast enough that it feels instantaneous to a human.

>>>records.take(1)({'actor':{'avatar_url':'https://avatars.githubusercontent.com/u/9152315?','gravatar_id':'','id':9152315,'login':'davidjhulse','url':'https://api.github.com/users/davidjhulse'},'created_at':'2015-01-01T00:00:00Z','id':'2489368070','payload':{'before':'86ffa724b4d70fce46e760f8cc080f5ec3d7d85f','commits':[{'author':{'email':'david.hulse@live.com','name':'davidjhulse'},'distinct':True,'message':'Altered BingBot.jar\n\nFixed issue with multiple account support','sha':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],'distinct_size':1,'head':'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81','push_id':536740396,'ref':'refs/heads/master','size':1},'public':True,'repo':{'id':28635890,'name':'davidjhulse/davesbingrewardsbot','url':'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},'type':'PushEvent'},)

This particular event is a 'PushEvent'. Let’s quickly see all the kinds of events. For fun, we’ll also time the interaction:

>>>%timerecords.pluck('type').frequencies().compute()CPUtimes:user112ms,sys:0ns,total:112msWalltime:2.41s[('ReleaseEvent',44312),('MemberEvent',69757),('IssuesEvent',693363),('PublicEvent',14614),('CreateEvent',1651300),('PullRequestReviewCommentEvent',214288),('PullRequestEvent',680879),('ForkEvent',491256),('DeleteEvent',256987),('PushEvent',7028566),('IssueCommentEvent',1322509),('GollumEvent',150861),('CommitCommentEvent',96468),('WatchEvent',1321546)]

And we compute the total count of all commits for this month.

>>>%timerecords.count().compute()CPUtimes:user134ms,sys:133µs,total:134msWalltime:1.49s14036706

We see that it takes a few seconds to walk through the data (and perform all scheduling overhead.) The scheduler adds about a millisecond overhead per task, and there are about 1000 partitions/files here (the GitHub data is split by hour and there are 730 hours in a month) so most of the cost here is overhead.

Investigate Jupyter

We investigate the activities of Project Jupyter. We chose this project because it’s sizable and because we understand the players involved and so can check our accuracy. This will require us to filter our data to a much smaller subset, then find popular repositories and members.

>>>jupyter=(records.filter(lambdad:d['repo']['name'].startswith('jupyter/')).repartition(10))>>>jupyter=e.persist(jupyter)

All records, regardless of event type, have a repository which has a name like 'organization/repository' in typical GitHub fashion. We filter all records that start with 'jupyter/'. Additionally, because this dataset is likely much smaller, we push all of these records into just ten partitions. This dramatically reduces scheduling overhead. The persist call hands this computation off to the scheduler and then gives us back our collection that points to that computing result. Filtering this month for Jupyter events takes about 7.5 seconds. Afterwards computations on this subset feel snappy.

>>>%timejupyter.count().compute()CPUtimes:user5.19ms,sys:97µs,total:5.28msWalltime:199ms747>>>%timejupyter.take(1)CPUtimes:user7.01ms,sys:259µs,total:7.27msWalltime:182ms({'actor':{'avatar_url':'https://avatars.githubusercontent.com/u/26679?','gravatar_id':'','id':26679,'login':'marksteve','url':'https://api.github.com/users/marksteve'},'created_at':'2015-01-01T13:25:44Z','id':'2489612400','org':{'avatar_url':'https://avatars.githubusercontent.com/u/7388996?','gravatar_id':'','id':7388996,'login':'jupyter','url':'https://api.github.com/orgs/jupyter'},'payload':{'action':'started'},'public':True,'repo':{'id':5303123,'name':'jupyter/nbviewer','url':'https://api.github.com/repos/jupyter/nbviewer'},'type':'WatchEvent'},)

So the first event of the year was by 'marksteve' who decided to watch the 'nbviewer' repository on new year’s day.

Notice that these computations take around 200ms. I can’t get below this from my local machine, so we’re likely bound by communicating to such a remote location. A 200ms latency is not great if you’re playing a video game, but it’s decent for interactive computing.

Here are all of the Jupyter repositories touched in the month of January,

>>>%timejupyter.pluck('repo').pluck('name').distinct().compute()CPUtimes:user2.84ms,sys:4.03ms,total:6.86msWalltime:204ms['jupyter/dockerspawner','jupyter/design','jupyter/docker-demo-images','jupyter/jupyterhub','jupyter/configurable-http-proxy','jupyter/nbshot','jupyter/sudospawner','jupyter/colaboratory','jupyter/strata-sv-2015-tutorial','jupyter/tmpnb-deploy','jupyter/nature-demo','jupyter/nbcache','jupyter/jupyter.github.io','jupyter/try.jupyter.org','jupyter/jupyter-drive','jupyter/tmpnb','jupyter/tmpnb-redirector','jupyter/nbgrader','jupyter/nbindex','jupyter/nbviewer','jupyter/oauthenticator']

And the top ten most active people on GitHub.

>>>%time(jupyter.pluck('actor').pluck('login').frequencies().topk(10,lambdakv:kv[1]).compute())CPUtimes:user8.03ms,sys:90µs,total:8.12msWalltime:226ms[('rgbkrk',156),('minrk',87),('Carreau',87),('KesterTong',74),('jhamrick',70),('bollwyvl',25),('pkt',18),('ssanderson',13),('smashwilson',13),('ellisonbg',13)]

Nothing too surprising here if you know these folks.

Full Dataset

The full five months of data is too large to fit in memory, even for this cluster. When we represent semi-structured data like this with dynamic data structures like lists and dictionaries there is quite a bit of memory bloat. Some careful attention to efficient semi-structured storage here could save us from having to switch to such a large cluster, but that will have to be the topic of another post.

Instead, we operate efficiently on this dataset by flowing it through memory, persisting only the records we care about. The distributed dask scheduler descends from the single-machine dask scheduler, which was quite good at flowing through a computation and intelligently removing intermediate results.

From a user API perspective, we call persist only on the jupyter dataset, and not the full records dataset.

>>>full=(s3.read_text('githubarchive-data','2015',compression='gzip').map(json.loads)>>>jupyter=(full.filter(lambdad:d['repo']['name'].startswith('jupyter/')).repartition(10))>>>jupyter=e.persist(jupyter)

It takes 2m36s to download, decompress, and parse the five months of publicly available GitHub events for all Jupyter events on nine m3.2xlarges.

There were seven thousand such events.

>>>jupyter.count().compute()7065

We find which repositories saw the most activity during that time:

>>>%time(jupyter.pluck('repo').pluck('name').frequencies().topk(20,lambdakv:kv[1]).compute())CPUtimes:user6.98ms,sys:474µs,total:7.46msWalltime:219ms[('jupyter/jupyterhub',1262),('jupyter/nbgrader',1235),('jupyter/nbviewer',846),('jupyter/jupyter_notebook',507),('jupyter/jupyter-drive',505),('jupyter/notebook',451),('jupyter/docker-demo-images',363),('jupyter/tmpnb',284),('jupyter/jupyter_client',162),('jupyter/dockerspawner',149),('jupyter/colaboratory',134),('jupyter/jupyter_core',127),('jupyter/strata-sv-2015-tutorial',108),('jupyter/jupyter_nbconvert',103),('jupyter/configurable-http-proxy',89),('jupyter/hubpress.io',85),('jupyter/jupyter.github.io',84),('jupyter/tmpnb-deploy',76),('jupyter/nbconvert',66),('jupyter/jupyter_qtconsole',59)]

We see that projects like jupyterhub were quite active during that time while, surprisingly, nbconvert saw relatively little action.

Local Data

The Jupyter data is quite small and easily fits in a single machine. Let’s bring the data to our local machine so that we can compare times:

>>>%timeL=jupyter.compute()CPUtimes:user4.74s,sys:10.9s,total:15.7sWalltime:30.2s

It takes surprisingly long to download the data, but once its here, we can iterate far more quickly with basic Python.

>>>fromtoolz.curriedimportpluck,frequencies,topk,pipe>>>%timepipe(L,pluck('repo'),pluck('name'),frequencies,dict.items,topk(20,key=lambdakv:kv[1]),list)CPUtimes:user11.8ms,sys:0ns,total:11.8msWalltime:11.5ms[('jupyter/jupyterhub',1262),('jupyter/nbgrader',1235),('jupyter/nbviewer',846),('jupyter/jupyter_notebook',507),('jupyter/jupyter-drive',505),('jupyter/notebook',451),('jupyter/docker-demo-images',363),('jupyter/tmpnb',284),('jupyter/jupyter_client',162),('jupyter/dockerspawner',149),('jupyter/colaboratory',134),('jupyter/jupyter_core',127),('jupyter/strata-sv-2015-tutorial',108),('jupyter/jupyter_nbconvert',103),('jupyter/configurable-http-proxy',89),('jupyter/hubpress.io',85),('jupyter/jupyter.github.io',84),('jupyter/tmpnb-deploy',76),('jupyter/nbconvert',66),('jupyter/jupyter_qtconsole',59)]

The difference here is 20x, which is a good reminder that, once you no longer have a large problem you should probably eschew distributed systems and act locally.

Conclusion

Downloading, decompressing, parsing, filtering, and counting JSON records is the new wordcount. It’s the first problem anyone sees. Fortunately it’s both easy to solve and the common case. Woo hoo!

Here we saw that dask+distributed handle the common case decently well and with a Pure Python stack. Typically Python users rely on a JVM technology like Hadoop/Spark/Storm to distribute their computations. Here we have Python distributing Python; there are some usability gains to be had here like nice stack traces, a bit less serialization overhead, and attention to other Pythonic style choices.

Over the next few posts I intend to deviate from this common case. Most “Big Data” technologies were designed to solve typical data munging problems found in web companies or with simple database operations in mind. Python users care about these things too, but they also reach out to a wide variety of fields. In dask+distributed development we care about the common case, but also support less traditional workflows that are commonly found in the life, physical, and algorithmic sciences.

By designing to support these more extreme cases we’ve nailed some common pain points in current distributed systems. Today we’ve seen low latency and remote control; in the future we’ll see others.

What doesn’t work

I’ll have an honest section like this at the end of each upcoming post describing what doesn’t work, what still feels broken, or what I would have done differently with more time.

  • The imports for dask and distributed are still strange. They’re two separate codebases that play very nicely together. Unfortunately the functionality you need is sometimes in one or in the other and it’s not immediately clear to the novice user where to go. For example dask.bag, the collection we’re using for records, jupyter, etc. is in dask but the s3 module is within the distributed library. We’ll have to merge things at some point in the near-to-moderate future. Ditto for the API: there are compute methods both on the dask collections (records.compute()) and on the distributed executor (e.compute(records)) that behave slightly differently.

  • We lack an efficient distributed shuffle algorithm. This is very important if you want to use operations like .groupby (which you should avoid anyway). The user API here doesn’t even cleanly warn users that this is missing in the distributed case which is kind of a mess. (It works fine on a single machine.) Efficient alternatives like foldbyare available.

  • I would have liked to run this experiment directly on the cluster to see how low we could have gone below the 200ms barrier we ran into here.

  • dask, the original project
  • distributed, the distributed memory scheduler powering the cluster computing
  • dask.bag, the user API we’ve used in this post.
  • This post largely repeats work by Blake Griffith in a similar post last year with an older iteration of the dask distributed scheduler

Pandas on HDFS with Dask Dataframes

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these computations with dask.dataframe. A screencast version of this blogpost is available here and the previous post in this series is available here.

To start, we connect to our scheduler, import the hdfs module from the distributed library, and read our CSV data from HDFS.

>>>fromdistributedimportExecutor,hdfs,progress>>>e=Executor('127.0.0.1:8786')>>>e<Executor:scheduler=127.0.0.1:8786workers=64threads=64>>>>nyc2014=hdfs.read_csv('/nyctaxi/2014/*.csv',...parse_dates=['pickup_datetime','dropoff_datetime'],...skipinitialspace=True)>>>nyc2015=hdfs.read_csv('/nyctaxi/2015/*.csv',...parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'])>>>nyc2014,nyc2015=e.persist([nyc2014,nyc2015])>>>progress(nyc2014,nyc2015)

Our data comes from the New York City Taxi and Limousine Commission which publishes all yellow cab taxi rides in NYC for various years. This is a nice model dataset for computational tabular data because it’s large enough to be annoying while also deep enough to be broadly appealing. Each year is about 25GB on disk and about 60GB in memory as a Pandas DataFrame.

HDFS breaks up our CSV files into 128MB chunks on various hard drives spread throughout the cluster. The dask.distributed workers each read the chunks of bytes local to them and call the pandas.read_csv function on these bytes, producing 391 separate Pandas DataFrame objects spread throughout the memory of our eight worker nodes. The returned objects, nyc2014 and nyc2015, are dask.dataframe objects which present a subset of the Pandas API to the user, but farm out all of the work to the many Pandas dataframes they control across the network.

Play with Distributed Data

If we wait for the data to load fully into memory then we can perform pandas-style analysis at interactive speeds.

>>>nyc2015.head()
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
022015-01-15 19:05:392015-01-15 19:23:4211.59-73.99389640.7501111N-73.97478540.750618112.01.00.53.2500.317.05
112015-01-10 20:33:382015-01-10 20:53:2813.30-74.00164840.7242431N-73.99441540.759109114.50.50.52.0000.317.80
212015-01-10 20:33:382015-01-10 20:43:4111.80-73.96334140.8027881N-73.95182040.82441329.50.50.50.0000.310.80
312015-01-10 20:33:392015-01-10 20:35:3110.50-74.00908740.7138181N-74.00432640.71998623.50.50.50.0000.34.80
412015-01-10 20:33:392015-01-10 20:52:5813.00-73.97117640.7624281N-74.00418140.742653215.00.50.50.0000.316.30
>>>len(nyc2014)165114373>>>len(nyc2015)146112989

Interestingly it appears that the NYC cab industry has contracted a bit in the last year. There are fewer cab rides in 2015 than in 2014.

When we ask for something like the length of the full dask.dataframe we actually ask for the length of all of the hundreds of Pandas dataframes and then sum them up. This process of reaching out to all of the workers completes in around 200-300 ms, which is generally fast enough to feel snappy in an interactive session.

The dask.dataframe API looks just like the Pandas API, except that we call .compute() when we want an actual result.

>>>nyc2014.passenger_count.sum().compute()279997507.0>>>nyc2015.passenger_count.sum().compute()245566747

Dask.dataframes build a plan to get your result and the distributed scheduler coordinates that plan on all of the little Pandas dataframes on the workers that make up our dataset.

Pandas for Metadata

Let’s appreciate for a moment all the work we didn’t have to do around CSV handling because Pandas magically handled it for us.

>>>nyc2015.dtypesVendorIDint64tpep_pickup_datetimedatetime64[ns]tpep_dropoff_datetimedatetime64[ns]passenger_countint64trip_distancefloat64pickup_longitudefloat64pickup_latitudefloat64RateCodeIDint64store_and_fwd_flagobjectdropoff_longitudefloat64dropoff_latitudefloat64payment_typeint64fare_amountfloat64extrafloat64mta_taxfloat64tip_amountfloat64tolls_amountfloat64improvement_surchargefloat64total_amount\rfloat64dtype:object

We didn’t have to find columns or specify data-types. We didn’t have to parse each value with an int or float function as appropriate. We didn’t have to parse the datetimes, but instead just specified a parse_datetimes= keyword. The CSV parsing happened about as quickly as can be expected for this format, clocking in at a network total of a bit under 1 GB/s.

Pandas is well loved because it removes all of these little hurdles from the life of the analyst. If we tried to reinvent a new “Big-Data-Frame” we would have to reimplement all of the work already well done inside of Pandas. Instead, dask.dataframe just coordinates and reuses the code within the Pandas library. It is successful largely due to work from core Pandas developers, notably Masaaki Horikoshi (@sinhrks), who have done tremendous work to align the API precisely with the Pandas core library.

Analyze Tips and Payment Types

In an effort to demonstrate the abilities of dask.dataframe we ask a simple question of our data, “how do New Yorkers tip?”. The 2015 NYCTaxi data is quite good about breaking down the total cost of each ride into the fare amount, tip amount, and various taxes and fees. In particular this lets us measure the percentage that each rider decided to pay in tip.

>>>nyc2015[['fare_amount','tip_amount','payment_type']].head()
fare_amounttip_amountpayment_type
012.03.251
114.52.001
29.50.002
33.50.002
415.00.002

In the first two lines we see evidence supporting the 15-20% tip standard common in the US. The following three lines interestingly show zero tip. Judging only by these first five lines (a very small sample) we see a strong correlation here with the payment type. We analyze this a bit more by counting occurrences in the payment_type column both for the full dataset, and filtered by zero tip:

>>>%timenyc2015.payment_type.value_counts().compute()CPUtimes:user132ms,sys:0ns,total:132msWalltime:558ms19157464425386464835030704170599528Name:payment_type,dtype:int64>>>%timenyc2015[nyc2015.tip_amount==0].payment_type.value_counts().compute()CPUtimes:user212ms,sys:4ms,total:216msWalltime:1.69s2538625571336566835020254170234526Name:payment_type,dtype:int64

We find that almost all zero-tip rides correspond to payment type 2, and that almost all payment type 2 rides don’t tip. My un-scientific hypothesis here is payment type 2 corresponds to cash fares and that we’re observing a tendancy of drivers not to record cash tips. However we would need more domain knowledge about our data to actually make this claim with any degree of authority.

Analyze Tips Fractions

Lets make a new column, tip_fraction, and then look at the average of this column grouped by day of week and grouped by hour of day.

First, we need to filter out bad rows, both rows with this odd payment type, and rows with zero fare (there are a surprising number of free cab rides in NYC.) Second we create a new column equal to the ratio of tip_amount / fare_amount.

>>>df=nyc2015[(nyc2015.fare_amount>0)&(nyc2015.payment_type!=2)]>>>df=df.assign(tip_fraction=(df.tip_amount/df.fare_amount))

Next we choose to groupby the pickup datetime column in order to see how the average tip fraction changes by day of week and by hour. The groupby and datetime handling of Pandas makes these operations trivial.

>>>dayofweek=df.groupby(df.tpep_pickup_datetime.dt.dayofweek).tip_fraction.mean()>>>hour=df.groupby(df.tpep_pickup_datetime.dt.hour).tip_fraction.mean()>>>dayofweek,hour=e.persist([dayofweek,hour])>>>progress(dayofweek,hour)

Grouping by day-of-week doesn’t show anything too striking to my eye. However I would like to note at how generous NYC cab riders seem to be. A 23-25% tip can be quite nice:

>>>dayofweek.compute()tpep_pickup_datetime00.23751010.23649420.23607330.24600740.24208150.23241560.259974Name:tip_fraction,dtype:float64

But grouping by hour shows that late night and early morning riders are more likely to tip extravagantly:

>>>hour.compute()tpep_pickup_datetime00.26360210.27882820.29353630.27678440.34864950.24861860.23325770.21600380.22150890.217018100.225618110.231396120.225186130.235662140.237636150.228832160.234086170.240635180.237488190.272792200.235866210.242157220.243244230.244586Name:tip_fraction,dtype:float64In[24]:

We plot this with matplotlib and see a nice trough during business hours with a surge in the early morning with an astonishing peak of 34% at 4am:

Performance

Lets dive into a few operations that run at different time scales. This gives a good understanding of the strengths and limits of the scheduler.

>>>%timenyc2015.head()CPUtimes:user4ms,sys:0ns,total:4msWalltime:20.9ms

This head computation is about as fast as a film projector. You could perform this roundtrip computation between every consecutive frame of a movie; to a human eye this appears fluid. In the last post we asked about how low we could bring latency. In that post we were running computations from my laptop in California and so were bound by transcontinental latencies of 200ms. This time, because we’re operating from the cluster, we can get down to 20ms. We’re only able to be this fast because we touch only a single data element, the first partition. Things change when we need to touch the entire dataset.

>>>%timelen(nyc2015)CPUtimes:user48ms,sys:0ns,total:48msWalltime:271ms

The length computation takes 200-300 ms. This computation takes longer because we touch every individual partition of the data, of which there are 178. The scheduler incurs about 1ms of overhead per task, add a bit of latency and you get the ~200ms total. This means that the scheduler will likely be the bottleneck whenever computations are very fast, such as is the case for computing len. Really, this is good news; it means that by improving the scheduler we can reduce these durations even further.

If you look at the groupby computations above you can add the numbers in the progress bars to show that we computed around 3000 tasks in around 7s. It looks like this computation is about half scheduler overhead and about half bound by actual computation.

Conclusion

We used dask+distributed on a cluster to read CSV data from HDFS into a dask dataframe. We then used dask.dataframe, which looks identical to the Pandas dataframe, to manipulate our distributed dataset intuitively and efficiently.

We looked a bit at the performance characteristics of simple computations.

What doesn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

  • Dask dataframe implements a commonly used subset of Pandas functionality, not all of it. It’s surprisingly hard to communicate the exact bounds of this subset to users. Notably, in the distributed setting we don’t have a shuffle algorithm, so groupby(...).apply(...) and some joins are not yet possible.

  • If you want to use threads, you’ll need Pandas 0.18.0 which, at the time of this writing, was still in release candidate stage. This Pandas release fixes some important GIL related issues.

  • The 1ms overhead per task limit is significant. While we can still scale out to clusters far larger than what we have here, we probably won’t be able to strongly accelerate very quick operations until we reduce this number.

  • We use the hdfs3 library to read data from HDFS. This library seems to work great but is new and could use more active users to flush out bug reports.

Setup and Data

You can obtain public data from the New York City Taxi and Limousine Commission here. I downloaded this onto the head node and dumped it into HDFS with commands like the following:

$ wget https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-{01..12}.csv
$ hdfs dfs -mkdir /nyctaxi
$ hdfs dfs -mkdir /nyctaxi/2015
$ hdfs dfs -put yellow*.csv /nyctaxi/2015/

The cluster was hosted on EC2 and was comprised of nine m3.2xlarges with 8 cores and 30GB of RAM each. Eight of these nodes were used as workers; they used processes for parallelism, not threads.

Distributed Dask Arrays

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus on the following:

  1. How to set up the distributed scheduler with a job scheduler like Sun GridEngine.
  2. How to load NetCDF data from a network file system (NFS) into distributed RAM
  3. How to manipulate data with dask.arrays
  4. How to interact with distributed data using IPython widgets

This blogpost has an accompanying screencast which might be a bit more fun than this text version.

This is the third in a sequence of blogposts about dask.distributed:

  1. Dask Bags on GitHub Data
  2. Dask DataFrames on HDFS
  3. Dask Arrays on NetCDF data

Setup

We wanted to emulate the typical academic cluster setup using a job scheduler like SunGridEngine (similar to SLURM, Torque, PBS scripts and other technologies), a shared network file system, and typical binary stored arrays in NetCDF files (similar to HDF5).

To this end we used Starcluster, a quick way to set up such a cluster on EC2 with SGE and NFS, and we downloaded data from the European Centre for Meteorology and Weather Forecasting

To deploy dask’s distributed scheduler with SGE we made a scheduler on the master node:

sgeadmin@master:~$ dscheduler
distributed.scheduler - INFO - Start Scheduler at:  172.31.7.88:8786

And then used the qsub command to start four dask workers, pointing to the scheduler address:

sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 1 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 2 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 3 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 4 ("dworker") has been submitted

After a few seconds these workers start on various nodes in the cluster and connect to the scheduler.

Load sample data on a single machine

On the shared NFS drive we’ve downloaded several NetCDF3 files, each holding the global temperature every six hours for a single day:

>>>fromglobimportglob>>>filenames=sorted(glob('*.nc3'))>>>filenames[:5]['2014-01-01.nc3','2014-01-02.nc3','2014-01-03.nc3','2014-01-04.nc3','2014-01-05.nc3']

We use conda to install the netCDF4 library and make a small function to read the t2m variable for “temperature at two meters elevation” from a single filename:

$ conda install netcdf4
importnetCDF4defload_temperature(fn):withnetCDF4.Dataset(fn)asf:returnf.variables['t2m'][:]

This converts a single file into a single numpy array in memory. We could call this on an individual file locally as follows:

>>>load_temperature(filenames[0])array([[[253.96238624,253.96238624,253.96238624,...,253.96238624,253.96238624,253.96238624],[252.80590921,252.81070124,252.81389593,...,252.79792249,252.80111718,252.80271452],...>>>load_temperature(filenames[0]).shape(4,721,1440)

Our dataset has dimensions of (time, latitude, longitude). Note above that each day has four time entries (measurements every six hours).

The NFS set up by Starcluster is unfortunately quite small. We were only able to fit around five months of data (136 days) in shared disk.

Load data across cluster

We want to call the load_temperature function on our list filenames on each of our four workers. We connect a dask Executor to our scheduler address and then map our function on our filenames:

>>>fromdistributedimportExecutor,progress>>>e=Executor('172.31.7.88:8786')>>>e<Executor:scheduler=172.31.7.88:8786workers=4threads=32>>>>futures=e.map(load_temperature,filenames)>>>progress(futures)

After this completes we have several numpy arrays scattered about the memory of each of our four workers.

Coordinate with dask.array

We coordinate these many numpy arrays into a single logical dask array as follows:

>>>fromdistributed.collectionsimportfutures_to_dask_arrays>>>xs=futures_to_dask_arrays(futures)# many small dask arrays>>>importdask.arrayasda>>>x=da.concatenate(xs,axis=0)# one large dask array, joined by time>>>xdask.array<concate...,shape=(544,721,1440),dtype=float64,chunksize=(4,721,1440)>

This single logical dask array is comprised of 136 numpy arrays spread across our cluster. Operations on the single dask array will trigger many operations on each of our numpy arrays.

Interact with Distributed Data

We can now interact with our dataset using standard NumPy syntax and other PyData libraries. Below we pull out a single time slice and render it to the screen with matplotlib.

frommatplotlibimportpyplotaspltplt.imshow(x[100,:,:].compute(),cmap='viridis')plt.colorbar()

In the screencast version of this post we hook this up to an IPython slider widget and scroll around time, which is fun.

Speed

We benchmark a few representative operations to look at the strengths and weaknesses of the distributed system.

Single element

This single element computation accesses a single number from a single NumPy array of our dataset. It is bound by a network roundtrip from client to scheduler, to worker, and back.

>>>%timex[0,0,0].compute()CPUtimes:user4ms,sys:0ns,total:4msWalltime:9.72ms

Single time slice

This time slice computation pulls around 8 MB from a single NumPy array on a single worker. It is likely bound by network bandwidth.

>>>%timex[0].compute()CPUtimes:user24ms,sys:24ms,total:48msWalltime:274ms

Mean computation

This mean computation touches every number in every NumPy array across all of our workers. Computing means is quite fast, so this is likely bound by scheduler overhead.

>>>%timex.mean().compute()CPUtimes:user88ms,sys:0ns,total:88msWalltime:422ms

Interactive Widgets

To make these times feel more visceral we hook up these computations to IPython Widgets.

This first example looks fairly fluid. This only touches a single worker and returns a small result. It is cheap because it indexes in a way that is well aligned with how our NumPy arrays are split up by time.

@interact(time=[0,x.shape[0]-1])deff(time):returnx[time,:,:].mean().compute()

This second example is less fluid because we index across our NumPy chunks. Each computation touches all of our data. It’s still not bad though and quite acceptable by today’s standards of interactive distributed data science.

@interact(lat=[0,x.shape[1]-1])deff(lat):returnx[:,lat,:].mean().compute()

Normalize Data

Until now we’ve only performed simple calculations on our data, usually grabbing out means. The image of the temperature above looks unsurprising. The image is dominated by the facts that land is warmer than oceans and that the equator is warmer than the poles. No surprises there.

To make things more interesting we subtract off the mean and divide by the standard deviation over time. This will tell us how unexpectedly hot or cold a particular point was, relative to all measurements of that point over time. This gives us something like a geo-located Z-Score.

z=(x-x.mean(axis=0))/x.std(axis=0)z=e.persist(z)progress(z)

plt.imshow(z[slice].compute(),cmap='RdBu_r')plt.colorbar()

We can now see much more fine structure of the currents of the day. In the screencast version we hook this dataset up to a slider as well and inspect various times.

I’ve avoided displaying GIFs of full images changing in this post to keep the size down, however we can easily render a plot of average temperature by latitude changing over time here:

importnumpyasnpxrange=90-np.arange(z.shape[1])/4@interact(time=[0,z.shape[0]-1])deff(time):plt.figure(figsize=(10,4))plt.plot(xrange,z[time].mean(axis=1).compute())plt.ylabel("Normalized Temperature")plt.xlabel("Latitude (degrees)")

Conclusion

We showed how to use distributed dask.arrays on a typical academic cluster. I’ve had several conversations with different groups about this topic; it seems to be a common case. I hope that the instructions at the beginning of this post prove to be helpful to others.

It is really satisfying to me to couple interactive widgets with data on a cluster in an intuitive way. This sort of fluid interaction on larger datasets is a core problem in modern data science.

What didn’t work

As always I’ll include a section like this on what didn’t work well or what I would have done with more time:

  • No high-level read_netcdf function: We had to use the mid-level API of executor.map to construct our dask array. This is a bit of a pain for novice users. We should probably adapt existing high-level functions in dask.array to robustly handle the distributed data case.
  • Need a larger problem: Our dataset could have fit into a Macbook Pro. A larger dataset that could not have been efficiently investigated from a single machine would have really cemented the need for this technology.
  • Easier deployment: The solution above with qsub was straightforward but not always accessible to novice users. Additionally while SGE is common there are several other systems that are just as common. We need to think of nice ways to automate this for the user.
  • XArray integration: Many people use dask.array on single machines through XArray, an excellent library for the analysis of labeled nd-arrays especially common in climate science. It would be good to integrate this new distributed work into the XArray project. I suspect that doing this mostly involves handling the data ingest problem described above.
  • Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

Dask EC2 Startup Script

$
0
0

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://youtu.be/KGlhU9kSfVk

Summary

Copy-pasting the following commands gives you a Dask cluster on EC2.

pip install dec2
dec2 up --keyname YOUR-AWS-KEY-NAME
        --keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem
        --count 9   # Provision nine nodes
        --nprocs 8  # Use eight separate worker processes per node

dec2 ssh            # SSH into head node
ipython             # Start IPython console on head node
fromdistributedimportExecutor,s3,progresse=Executor('127.0.0.1:8786')df=s3.read_csv('dask-data/nyc-taxi/2015',lazy=False)progress(df)df.head()

You will have to use your own AWS credentials, but you’ll get fast distributed Pandas access on the NYCTaxi data across a cluster, loaded from S3.

Motivation

Reducing barriers to entry enables curious play.

Curiosity drives us to play with new tools. We love the idea that previously difficult tasks will suddenly become easy, expanding our abilities and opening up a range of newly solvable problems.

However, as our problems grow more complex our tools grow more cumbersome and setup costs increase. This cost stops us from playing around, which is a shame, because playing is good both for the education of the user and for the development of the tool. Tool makers who want feedback are strongly incentivized to decrease setup costs, especially for the play case.

In February we introduced dask.distributed, a lightweight distributed computing framework for Python. We focused on processing data with high level abstractions like dataframes and arrays in the following blogposts:

  1. Analyze GitHub JSON record data in S3
  2. Use Dask DataFrames on CSV data in HDFS
  3. Process NetCDF data with Dask arrays on a traditional cluster

Today we present a simple setup script to launch dask.distributed on EC2, enabling any user with AWS credentials to repeat these experiments easily.

dec2

Devops tooling and EC2 to the rescue

DEC2 does the following:

  1. Provisions nodes on EC2 using your AWS credentials
  2. Installs Anaconda on those nodes
  3. Deploys a dask.distributed Scheduler on the head node and Workers on the rest of the nodes
  4. Helps you to SSH into the head node or connect from your local machine
$ pip install dec2
$ dec2 up --help
Usage: dec2 up [OPTIONS]

Options:
  --keyname TEXT                Keyname on EC2 console  [required]
  --keypair PATH                Path to the keypair that matches the keyname [required]
  --name TEXT                   Tag name on EC2
  --region-name TEXT            AWS region  [default: us-east-1]
  --ami TEXT                    EC2 AMI  [default: ami-d05e75b8]
  --username TEXT               User to SSH to the AMI  [default: ubuntu]
  --type TEXT                   EC2 Instance Type  [default: m3.2xlarge]
  --count INTEGER               Number of nodes  [default: 4]
  --security-group TEXT         Security Group Name  [default: dec2-default]
  --volume-type TEXT            Root volume type  [default: gp2]
  --volume-size INTEGER         Root volume size (GB)  [default: 500]
  --file PATH                   File to save the metadata  [default: cluster.yaml]
  --provision / --no-provision  Provision salt on the nodes  [default: True]
  --dask / --no-dask            Install Dask.Distributed in the cluster [default: True]
  --nprocs INTEGER              Number of processes per worker  [default: 1]
  -h, --help                    Show this message and exit.

Note: dec2 was largely built by Daniel Rodriguez

Run

As an example we use dec2 to create a new cluster of nine nodes. Each worker will run with eight processes, rather than using threads.

dec2 up --keyname my-key-name
        --keypair ~/.ssh/my-key-file.pem
        --count 9       # Provision nine nodes
        --nprocs 8      # Use eight separate worker processes per node

Connect

We ssh into the head node and start playing in an IPython terminal:

localmachine:~$ dec2 ssh                          # SSH into head node
ec2-machine:~$ ipython                            # Start IPython console
In[1]:fromdistributedimportExecutor,s3,progressIn[2]:e=Executor('127.0.0.1:8786')In[3]:eOut[3]:<Executor:scheduler=127.0.0.1:8786workers=64threads=64>

Notebooks

Alternatively we set up a globally visible Jupyter notebook server:

localmachine:~$ dec2 dask-distributed address    # Get location of head node
Scheduler Address: XXX:XXX:XXX:XXX:8786

localmachine:~$ dec2 ssh                          # SSH into head node
ec2-machine:~$ jupyter notebook --ip="*"          # Start Jupyter Server

Then navigate to http://XXX:XXX:XXX:XXX:8888 from your local browser. Note, this method is not secure, see Jupyter docs for a better solution.

Public datasets

We repeat the experiments from our earlier blogpost on NYCTaxi data.

df=s3.read_csv('dask-data/nyc-taxi/2015',lazy=False)progress(df)df.head()
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latitudeRateCodeIDstore_and_fwd_flagdropoff_longitudedropoff_latitudepayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
022015-01-15 19:05:392015-01-15 19:23:4211.59-73.99389640.7501111N-73.97478540.750618112.01.00.53.2500.317.05
112015-01-10 20:33:382015-01-10 20:53:2813.30-74.00164840.7242431N-73.99441540.759109114.50.50.52.0000.317.80
212015-01-10 20:33:382015-01-10 20:43:4111.80-73.96334140.8027881N-73.95182040.82441329.50.50.50.0000.310.80
312015-01-10 20:33:392015-01-10 20:35:3110.50-74.00908740.7138181N-74.00432640.71998623.50.50.50.0000.34.80
412015-01-10 20:33:392015-01-10 20:52:5813.00-73.97117640.7624281N-74.00418140.742653215.00.50.50.0000.316.30

Acknowledgments

The dec2 startup script is largely the work of Daniel Rodriguez. Daniel usually works on Anaconda for cluster management which is normally part of an Anaconda subscription but is also free for moderate use (4 nodes.) This does things similar to dec2, but much more maturely.

DEC2 was inspired by the excellent spark-ec2 setup script, which is how most Spark users, myself included, were first able to try out the library. The spark-ec2 script empowered many new users to try out a distributed system for the first time.

The S3 work was largely done by Hussain Sultan (Capital One) and Martin Durant (Continuum).

What didn’t work

  • Originally we automatically started an IPython console on the local machine and connected it to the remote scheduler. This felt slick, but was error prone due to mismatches between the user’s environment and the remote cluster’s environment.
  • It’s tricky to replicate functionality that’s part of a proprietary product (Anaconda for cluster management.) Fortunately, Continuum management has been quite supportive.
  • There aren’t many people in the data science community who know Salt, the system that backs dec2. I expect maintenance to be a bit tricky moving forward, especially during periods when Daniel and other developers are occupied.
Viewing all 100 articles
Browse latest View live