Quantcast
Channel: Working notes by Matthew Rocklin - SciPy
Viewing all 100 articles
Browse latest View live

Dask Benchmarks

$
0
0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

We measure the performance of Dask’s distributed scheduler for a variety of different workloads under increasing scales of both problem and cluster size. This helps to answer questions about dask’s scalability and also helps to educate readers on the sorts of computations that scale well.

We will vary our computations in a few ways to see how they stress performance. We consider the following:

  1. Computational and communication patterns like embarrassingly parallel, fully sequential, bulk communication, many-small communication, nearest neighbor, tree reductions, and dynamic graphs.
  2. Varying task duration ranging from very fast (microsecond) tasks, to 100ms and 1s long tasks. Faster tasks make it harder for the central scheduler to keep up with the workers.
  3. Varying cluster size from one two-core worker to 256 two-core workers and varying dataset size which we scale linearly with the number of workers. This means that we’re measuring weak scaling.
  4. Varying APIs between tasks, multidimensional arrays and dataframes all of which have cases in the above categories but depend on different in-memory computational systems like NumPy or Pandas.

We will start with benchmarks for straight tasks, which are the most flexible system and also the easiest to understand. This will help us to understand scaling limits on arrays and dataframes.

Note: we did not tune our benchmarks or configuration at all for these experiments. They are well below what is possible, but perhaps representative of what a beginning user might experience upon setting up a cluster without expertise or thinking about configuration.

A Note on Benchmarks and Bias

you can safely skip this section if you’re in a rush

This is a technical document, not a marketing piece. These benchmarks adhere to the principles laid out in this blogpost and attempt to avoid those pitfalls around developer bias. In particular the following are true:

  1. We decided on a set of benchmarks before we ran them on a cluster
  2. We did not improve the software or tweak the benchmarks after seeing the results. These were run on the current release of Dask in the wild that was put out weeks ago, not on a development branch.
  3. The computations were constructed naively, as a novice would write them. They were not tweaked for extra performance.
  4. The cluster was configured naively, without attention to scale or special parameters

We estimate that expert use would result in about a 5-10x scaling improvement over what we’ll see. We’ll detail how to improve scaling with expert methods at the bottom of the post.

All that being said the author of this blogpost is paid to write this software and so you probably shouldn’t trust him. We invite readers to explore things independently. All configuration, notebooks, plotting code, and data are available below:


Tasks

We start by benchmarking the task scheduling API. Dask’s task scheduling APIs are at the heart of the other “big data” APIs (like dataframes). We start with tasks because they’re the simplest and most raw representation of Dask. Mostly we’ll run the following functions on integers, but you could fill in any function here, like a pandas dataframe method or sklearn routine.

importtimedefinc(x):returnx+1defadd(x,y):returnx+ydefslowinc(x,delay=0.1):time.sleep(delay)returnx+1defslowadd(x,y,delay=0.1):time.sleep(delay)returnx+ydefslowsum(L,delay=0.1):time.sleep(delay)returnsum(L)

Embarrassingly Parallel Tasks

We run the following code on our cluster and measure how long they take to complete:

futures=client.map(slowinc,range(4*n),delay=1)# 1s delaywait(futures)
futures=client.map(slowinc,range(100*n_cores))# 100ms delaywait(futures)
futures=client.map(inc,range(n_cores*200))# fastwait(futures)

We see that for fast tasks the system can process around 2000-3000 tasks per second. This is mostly bound by scheduler and client overhead. Adding more workers into the system doesn’t give us any more tasks per second. However if our tasks take any amount of time (like 100ms or 1s) then we see decent speedups.

If you switch to linear scales on the plots, you’ll see that as we get out to 512 cores we start to slow down by about a factor of two. I’m surprised to see this behavior (hooray benchmarks) because all of Dask’s scheduling decisions are independent of cluster size. My first guess is that the scheduler may be being swamped with administrative messages, but we’ll have to dig in a bit deeper here.

Tree Reduction

Not all computations are embarrassingly parallel. Many computations have dependencies between them. Consider a tree reduction, where we combine neighboring elements until there is only one left. This stresses task dependencies and small data movement.

fromdaskimportdelayedL=range(2**7*n)whilelen(L)>1:# while there is more than one element left# add neighbors togetherL=[delayed(slowadd)(a,b)fora,binzip(L[::2],L[1::2])]L[0].compute()

We see similar scaling to the embarrassingly parallel case. Things proceed linearly until they get to around 3000 tasks per second, at which point they fall behind linear scaling. Dask doesn’t seem to mind dependencies, even custom situations like this one.

Nearest Neighbor

Nearest neighbor computations are common in data analysis when you need to share a bit of data between neighboring elements, such as frequently occurs in timeseries computations in dataframes or overlapping image processing in arrays or PDE computations.

L=range(20*n)L=client.map(slowadd,L[:-1],L[1:])L=client.map(slowadd,L[:-1],L[1:])wait(L)

Scaling is similar to the tree reduction case. Interesting dependency structures don’t incur significant overhead or scaling costs.

Sequential

We consider a computation that isn’t parallel at all, but is instead highly sequential. Increasing the number of workers shouldn’t help here (there is only one thing to do at a time) but this does demonstrate the extra stresses that arise from a large number of workers. Note that we have turned off task fusion for this, so here we’re measuring how many roundtrips can occur between the scheduler and worker every second.

x=1foriinrange(100):x=delayed(inc)(x)x.compute()

So we get something like 100 roundtrips per second, or around 10ms roundtrip latencies. It turns out that a decent chunk of this cost was due to an optimization; workers prefer to batch small messages for higher throughput. In this case that optimization hurts us. Still though, we’re about 2-4x faster than video frame-rate here (video runs at around 24Hz or 40ms between frames).

Client in the loop

Finally we consider a reduction that consumes whichever futures finish first and adds them together. This is an example of using client-side logic within the computation, which is often helpful in complex algorithms. This also scales a little bit better because there are fewer dependencies to track within the scheduler. The client takes on a bit of the load.

fromdask.distributedimportas_completedfutures=client.map(slowinc,range(n*20))pool=as_completed(futures)batches=pool.batches()whileTrue:try:batch=next(batches)iflen(batch)==1:batch+=next(batches)exceptStopIteration:breakfuture=client.submit(slowsum,batch)pool.add(future)

Tasks: Complete

We show most of the plots from above for comparison.

Arrays

When we combine NumPy arrays with the task scheduling system above we get dask.array, a distributed multi-dimensional array. This section shows computations like the last section (maps, reductions, nearest-neighbor), but now these computations are motivated by actual data-oriented computations and involve real data movement.

Create Dataset

We make a square array with somewhat random data. This array scales with the number of cores. We cut it into uniform chunks of size 2000 by 2000.

N=int(5000*math.sqrt(n_cores))x=da.random.randint(0,10000,size=(N,N),chunks=(2000,2000))x=x.persist()wait(x)

Creating this array is embarrassingly parallel. There is an odd corner in the graph here that I’m not able to explain.

Elementwise Computation

We perform some numerical computation element-by-element on this array.

y=da.sin(x)**2+da.cos(x)**2y=y.persist()wait(y)

This is also embarrassingly parallel. Each task here takes around 300ms (the time it takes to call this on a single 2000 by 2000 numpy array chunk).

Reductions

We sum the array. This is implemented as a tree reduction.

x.std().compute()

Random Access

We get a single element from the array. This shouldn’t get any faster with more workers, but it may get slower depending on how much base-line load a worker adds to the scheduler.

x[1234,4567].compute()

We get around 400-800 bytes per second, which translates to response times of 10-20ms, about twice the speed of video framerate. We see that performance does degrade once we have a hundred or so active connections.

Communication

We add the array to its transpose. This forces different chunks to move around the network so that they can add to each other. Roughly half of the array moves on the network.

y=x+x.Ty=y.persist()wait(y)

The task structure of this computation is something like nearest-neighbors. It has a regular pattern with a small number of connections per task. It’s really more a test of the network hardware, which we see does not impose any additional scaling limitations (this looks like normal slightly-sub-linear scaling).

Rechunking

Sometimes communication is composed of many small transfers. For example if you have a time series of images so that each image is a chunk, you might want to rechunk the data so that all of the time values for each pixel are in a chunk instead. Doing this can be very challenging because every output chunk requires a little bit of data from every input chunk, resulting in potentially n-squared transfers.

y=x.rechunk((20000,200)).persist()wait(y)y=y.rechunk((200,20000)).persist()wait(y)

This computation can be very hard. We see that dask does it more slowly than fast computations like reductions, but it still scales decently well up to hundreds of workers.

Nearest Neighbor

Dask.array includes the ability to overlap small bits of neighboring blocks to enable functions that require a bit of continuity like derivatives or spatial smoothing functions.

y=x.map_overlap(slowinc,depth=1,delay=0.1).persist()wait(y)

Array Complete

DataFrames

We can combine Pandas Dataframes with Dask to obtain Dask dataframes, distributed tables. This section will be much like the last section on arrays but will instead focus on pandas-style computations.

Create Dataset

We make an array of random integers with ten columns and two million rows per core, but into chunks of size one million. We turn this into a dataframe of integers:

x=da.random.randint(0,10000,size=(N,10),chunks=(1000000,10))df=dd.from_dask_array(x).persist()wait(df)

Elementwise

We can perform 100ms tasks or try out a bunch of arithmetic.

y=df.map_partitions(slowinc,meta=df).persist()wait(y)
y=(df[0]+1+2+3+4+5+6+7+8+9+10).persist()wait(y)

Random access

Similarly we can try random access with loc.

df.loc[123456].compute()

Reductions

We can try reductions along the full dataset or a single series:

df.std().compute()
df[0].std().compute()

Groupby aggregations like df.groupby(...).column.mean() operate very similarly to reductions, just with slightly more complexity.

df.groupby(0)[1].mean().compute()

Shuffles

However operations like df.groupby(...).apply(...) are much harder to accomplish because we actually need to construct the groups. This requires a full shuffle of all of the data, which can be quite expensive.

This is also the same operation that occurs when we sort or call set_index.

df.groupby(0).apply(len).compute()# this would be faster as df.groupby(0).size()
y=df.set_index(1).persist()wait(y)

This still performs decently and scales well out to a hundred or so workers.

Timeseries operations

Timeseries operations often require nearest neighbor computations. Here we look at rolling aggregations, but cumulative operations, resampling, and so on are all much the same.

y=df.rolling(5).mean().persist()wait(y)

Dataframes: Complete

Analysis

Let’s start with a few main observations:

  1. The longer your individual tasks take, the better Dask (or any distributed system) will scale. As you increase the number of workers you should also endeavor to increase average task size, for example by increasing the in-memory size of your array chunks or dataframe partitions.
  2. The Dask scheduler + Client currently maxes out at around 3000 tasks per second. Another way to put this is that if our computations take 100ms then we can saturate about 300 cores, which is more-or-less what we observe here.
  3. Adding dependencies is generally free in modest cases such as in a reduction or nearest-neighbor computation. It doesn’t matter what structure your dependencies take, as long as parallelism is still abundant.
  4. Adding more substantial dependencies, such as in array rechunking or dataframe shuffling, can be more costly, but dask collection algorithms (array, dataframe) are built to maintain scalability even at scale.
  5. The scheduler seems to slow down at 256 workers, even for long task lengths. This suggests that we may have an overhead issue that needs to be resolved.

Expert Approach

So given our experience here, let’s now tweak settings to make Dask run well. We want to avoid two things:

  1. Lots of independent worker processes
  2. Lots of small tasks

So lets change some things:

  1. Bigger workers: Rather than have 256 two-core workers lets deploy 32 sixteen-core workers.
  2. Bigger chunks: Rather than have 2000 by 2000 numpy array chunks lets bump this up to 10,000 by 10,000.

    Rather than 1,000,000 row Pandas dataframe partitions let’s bump this up to 10,000,000.

    These sizes are still well within comfortable memory limits. Each is about a Gigabyte in our case.

When we make these changes we find that all metrics improve at larger scales. Some notable improvements are included in a table below (sorry for not having pretty plots in this case).

BenchmarkSmallBigUnit
Tasks: Embarrassingly parallel35003800tasks/s
Array: Elementwise sin(x)**2 + cos(x)**224006500MB/s
DataFrames: Elementwise arithmetic 960066000MB/s
Arrays: Rechunk47004800MB/s
DataFrames: Set index14001000MB/s

We see that for some operations we can get significant improvements (dask.dataframe is now churning through data at 60/s) and for other operations that are largely scheduler or network bound this doesn’t strongly improve the situation (and sometimes hurts).

Still though, even with naive settings we’re routinely pushing through 10s of gigabytes a second on a modest cluster. These speeds are available for a very wide range of computations.

Final thoughts

Hopefully these notes help people to understand Dask’s scalability. Like all tools it has limits, but even under normal settings Dask should scale well out to a hundred workers or so. Once you reach this limit you might want to start taking other factors into consideration, especially threads-per-worker and block size, both of which can help push well into the thousands-of-cores range.

The included notebooks are self contained, with code to both run and time the computations as well as produce the Bokeh figures. I would love to see other people reproduce these benchmarks (or others!) on different hardware or with different settings.

Tooling

This blogpost made use of the following tools:

  1. Dask-kubernetes: for deploying clusters of varying sizes on Google compute engine
  2. Bokeh: for plotting (gallery)
  3. gcsfs: for storage on Google cloud storage

Scikit-Image and Dask Performance

$
0
0

This weekend at the SciPy 2017 sprints I worked alongside Scikit-image developers to investigate parallelizing scikit-image with Dask.

Here is a notebook of our work.

Dask Release 0.15.2

$
0
0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.2. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.0 release on June 11th.

You can conda install Dask:

conda install dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on the defaults and conda-forge channels.

Full changelogs are available here:

Some notable changes follow.

New dask-core and dask conda packages

On conda there are now three relevant Dask packages:

  1. dask-core: Package that includes only the core Dask package. This has no dependencies other than the standard library. This is primarily intended for down-stream libraries that depend on certain parts of Dask.
  2. distributed: Dask’s distributed scheduler, depends on Tornado, cloudpickle, and other libraries.
  3. dask: Metapackage that includes dask-core, distributed, and all relevant libraries like NumPy, Pandas, Bokeh, etc.. This is intended for users to install

This organization is designed to both allow downstream libraries to only depend on the parts of Dask that they need while also making the default behavior for users all-inclusive.

Downstream libraries may want to change conda dependencies from dask to dask-core. They will then need to be careful to include the necessary libraries (like numpy or cloudpickle) based on their user community.

Improved Deployment

Due to increased deployment on Docker or other systems with complex networking rules dask-worker processes now include separate --contact-address and --listen-address keywords that can be used to specify addresses that they advertise and addresses on which they listen. This is especially helpful when the perspective of ones network can shift dramatically.

dask-worker scheduler-address:8786 \
            --contact-address 192.168.0.100:9000  # contact me at 192.168.0.100:9000
            --listen-address 172.142.0.100:9000  # I listen on this host

Additionally other services like the HTTP and Bokeh servers now respect the hosts provided by --listen-address or --host keywords and will not be visible outside of the specified network.

Avoid memory, file descriptor, and process leaks

There were a few occasions where Dask would leak resources in complex situations. Many of these have now been cleaned up. We’re grateful to all those who were able to provide very detailed case studies that demonstrated these issues and even more grateful to those who participated in resolving them.

There is undoubtedly more work to do here and we look forward to future collaboration.

Array and DataFrame APIs

As usual, Dask array and dataframe have a new set of functions that fill out their API relative to NumPy and Pandas.

See the full APIs for further reference:

Deprecations

Officially deprecated dask.distributed.Executor, users should use dask.distributed.Client instead. Previously this was set to an alias.

Removed Bag.concat, users should use Bag.flatten instead.

Removed magic tuple unpacking in Bag.map like bag.map(lambda x, y: x + y). Users should unpack manually instead.

Julia

Developers from the Invenia have been building Julia workers and clients that operate with the Dask.distributed scheduler. They have been helpful in raising issues necessary to ensure cross-language support.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.0 release on June 11th

  • Bogdan
  • Elliott Sales de Andrade
  • Bruce Merry
  • Erik Welch
  • Fabian Keller
  • James Bourbeau
  • Jeff Reback
  • Jim Crist
  • John A Kirkham
  • Luke Canavan
  • Mark Dunne
  • Martin Durant
  • Matthew Rocklin
  • Olivier Grisel
  • Søren Fuglede Jørgensen
  • Stephan Hoyer
  • Tom Augspurger
  • Yu Feng

The following people contributed to the dask/distributed repository since the 1.17.1 release on June 14th:

  • Antoine Pitrou
  • Dan Brown
  • Elliott Sales de Andrade
  • Eric Davies
  • Erik Welch
  • Evan Welch
  • John A Kirkham
  • Jim Crist
  • James Bourbeau
  • Jeremiah Lowin
  • Julius Neuffer
  • Martin Durant
  • Matthew Rocklin
  • Paul Anton Letnes
  • Peter Waller
  • Sohaib Iftikhar
  • Tom Augspurger

Additionally we’re happy to announce that John Kirkham (@jakirkham) has accepted commit rights to the Dask organization and become a core contributor. John has been active through the Dask project, and particularly active in Dask.array.

Dask Release 0.15.3

$
0
0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.3. This release contains stability enhancements and bug fixes. This blogpost outlines notable changes since the 0.15.2 release on August 30th.

You can conda install Dask:

conda install -c conda-forge dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on conda-forge channels. They will be on defaults in a few days.

Full changelogs are available here:

Some notable changes follow.

Masked Arrays

Dask.array now supports masked arrays similar to NumPy.

In[1]:importdask.arrayasdaIn[2]:x=da.arange(10,chunks=(5,))In[3]:mask=x%2==0In[4]:m=da.ma.masked_array(x,mask)In[5]:mOut[5]:dask.array<masked_array,shape=(10,),dtype=int64,chunksize=(5,)>In[6]:m.compute()Out[6]:masked_array(data=[--1--3--5--7--9],mask=[TrueFalseTrueFalseTrueFalseTrueFalseTrueFalse],fill_value=999999)

This work was primarily done by Jim Crist and partially funded by the UK Met office in support of the Iris project.

Constants in atop

Dask.array experts will be familiar with the atop function, which powers a non-trivial amount of dask.array and is commonly used by people building custom algorithms. This function now supports constants when the index given is None.

atop(func,'ijk',x,'ik',y,'kj',CONSTANT,None)

Memory management for workers

Dask workers spill excess data to disk when they reach 60% of their alloted memory limit. Previously we only measured memory use by adding up the memory use of every piece of data produce by the worker. This could fail under a few situations

  1. Our per-data estiamtes were faulty
  2. User code consumed a large amount of memory without our tracking it

To compensate we now also periodically check the memory use of the worker using system utilities with the psutil module. We dump data to disk if the process rises about 70% use, stop running new tasks if it rises above 80%, and restart the worker if it rises above 95% (assuming that the worker has a nanny process).

Breaking Change: Previously the --memory-limit keyword to the dask-worker process specified the 60% “start pushing to disk” limit. So if you had 100GB of RAM then you previously might have started a dask-worker as follows:

dask-worker ... --memory-limit 60e9  # before specify 60% target

And the worker would start pushing to disk once it had 60GB of data in memory. However, now we are changing this meaning to be the full amount of memory given to the process.

dask-worker ... --memory-limit 100e9A  # now specify 100% target

Of course, you don’t have to sepcify this limit (many don’t). It will be chosen for you automatically. If you’ve never cared about this then you shouldn’t start caring now.

More about memory management here: http://distributed.readthedocs.io/en/latest/worker.html?highlight=memory-limit#memory-management

Statistical Profiling

Workers now poll their worker threads every 10ms and keep a running count of which functions are being used. This information is available on the diagnostic dashboard as a new “Profile” page. It provides information that is orthogonal, and generally more detailed than the typical task-stream plot.

These plots are available on each worker, and an aggregated view is available on the scheduler. The timeseries on the bottom allows you to select time windows of your computation to restrict the parallel profile.

More information about diagnosing performance available here: http://distributed.readthedocs.io/en/latest/diagnosing-performance.html

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.2 release on August 30th

  • Adonis
  • Christopher Prohm
  • Danilo Horta
  • jakirkham
  • Jim Crist
  • Jon Mease
  • jschendel
  • Keisuke Fujii
  • Martin Durant
  • Matthew Rocklin
  • Tom Augspurger
  • Will Warner

The following people contributed to the dask/distributed repository since the 1.18.3 release on September 2nd:

  • Casey Law
  • Edrian Irizarry
  • Matthew Rocklin
  • rbubley
  • Tom Augspurger
  • ywangd

Notes on Kafka in Python

$
0
0

Summary

I recently investigated the state of Python libraries for Kafka. This blogpost contains my findings.

Both PyKafka and confluent-kafka have mature implementations and are maintained by invested companies. Confluent-kafka is generally faster while PyKafka is arguably better designed and documented for Python usability.

Conda packages are now available for both. I hope to extend one or both to support asynchronous workloads with Tornado.

Disclaimer: I am not an expert in this space. I have no strong affiliation with any of these projects. This is a report based on my experience of the past few weeks. I don’t encourage anyone to draw conclusions from this work. I encourage people to investigate on their own.

Introduction

Apache Kafka is a common data system for streaming architectures. It manages rolling buffers of byte messages and provides a scalable mechanism to publish or subscribe to those buffers in real time. While Kafka was originally designed within the JVM space the fact that it only manages bytes makes it easy to access from native code systems like C/C++ and Python.

Python Options

Today there are three independent Kafka implementations in Python, two of which are optionally backed by a C implementation, librdkafka, for speed:

  • kafka-python: The first on the scene, a Pure Python Kafka client with robust documentation and an API that is fairly faithful to the original Java API. This implementation has the most stars on GitHub, the most active development team (by number of committers) but also lacks a connection to the fast C library. I’ll admit that I didn’t spend enough time on this project to judge it well because of this.

  • PyKafka: The second implementation chronologically. This library is maintained by Parse.ly a web analytics company that heavily uses both streaming systems and Python. PyKafka’s API is more creative and designed to follow common Python idioms rather than the Java API. PyKafka has both a pure Python implementation and connections to the low-level librdkafka C library for increased performance.

  • Confluent-kafka: Is the final implementation chronologically. It is maintained by Confluent, the primary for-profit company that supports and maintains Kafka. This library is the fastest, but also the least accessible from a Python perspective. This implementation is written in CPython extensions, and the documentation is minimal. However, if you are coming from the Java API then this is entirely consistent with that experience, so that documentation probably suffices.

Performance

Confluent-kafka message-consumption bandwidths are around 50% higher and message-production bandwidths are around 3x higher than PyKafka, both of which are significantly higher than kafka-python. I’m taking these numbers from this blogpost which gives benchmarks comparing the three libraries. The primary numeric results follow below:

Note: It’s worth noting that this blogpost was moving smallish 100 byte messages around. I would hope that Kafka would perform better (closer to network bandwidths) when messages are of a decent size.

Producer Throughput

time_in_secondsMBs/sMsgs/s
confluent_kafka_producer5.417183000
pykafka_producer_rdkafka166.164000
pykafka_producer571.717000
python_kafka_producer681.415000

Consumer Throughput

time_in_secondsMBs/sMsgs/s
confluent_kafka_consumer3.825261000
pykafka_consumer_rdkafka6.117164000
pykafka_consumer293.234000
python_kafka_consumer263.638000

Note: I discovered this article on parsely/pykafka #559, which has good conversation about the three libraries.

I profiled PyKafka in these cases and it doesn’t appear that these code paths have yet been optimized. I expect that modest effort could close that gap considerably. This difference seems to be more from lack of interest than any hard design constraint.

It’s not clear how critical these speeds are. According to the PyKafka maintainers at Parse.ly they haven’t actually turned on the librdkafka optimizations in their internal pipelines, and are instead using the slow Pure Python implementation, which is apparently more than fast enough for common use. Getting messages out of Kafka just isn’t their bottleneck. It may be that these 250,000 messages/sec limits are not significant in most applications. I suspect that this matters more in bulk analysis workloads than in online applications.

Pythonic vs Java APIs

It took me a few times to get confluent-kafka to work. It wasn’t clear what information I needed to pass to the constructor to connect to Kafka and when I gave the wrong information I received no message that I had done anything incorrectly. Docstrings and documentation were both minimal. In contrast, PyKafka’s API and error messages quickly led me to correct behavior and I was up and running within a minute.

However, I persisted with confluent-kafka, found the right Java documentation, and eventually did get things up and running. Once this happened everything fell into place and I was able to easily build applications with Confluent-kafka that were both simple and fast.

Development experience

I would like to add asynchronous support to one or both of these libraries so that they can read or write data in a non-blocking fashion and play nicely with other asynchronous systems like Tornado or Asyncio. I started investigating this with both libraries on GitHub.

Developers

Both libraries have a maintainer who is somewhat responsive and whose time is funded by the parent company. Both maintainers seem active on a day-to-day basis and handle contributions from external developers.

Both libraries are fully active with a common pattern of a single main dev merging work from a number of less active developers. Distributions of commits over the last six months look similar:

confluent-kafka-python$ git shortlog -ns --since "six months ago"
38  Magnus Edenhill
5  Christos Trochalakis
4  Ewen Cheslack-Postava
1  Simon Wahlgren

pykafka$ git shortlog -ns --since "six months ago"
52  Emmett Butler
23  Emmett J. Butler
20  Marc-Antoine Parent
18  Tanay Soni
5  messense
1  Erik Stephens
1  Jeff Widman
1  Prateek Shrivastava
1  aleatha
1  zpcui

Codebase

In regards to the codebases I found that PyKafka was easier to hack on for a few reasons:

  1. Most of PyKafka is written in Python rather than C extensions, and so it is more accessible to a broader development base. I find that Python C extensions are not pleasant to work with, even if you are comfortable with C.
  2. PyKafka appears to be much more extensively tested. PyKafka actually spins up a local Kafka instance to do comprehensive integration tests while Confluent-kafka seems to only test API without actually running against a real Kakfa instance.
  3. For what it’s worth, PyKafka maintainers responded quickly to an issue on Tornado. Confluent-kafka maintainers still have not responded to a comment on an existing Tornado issue, even though that comment had signfiicnatly more content (a working prototype).

To be clear, no maintainer has any responsibility to answer my questions on github. They are likely busy with other things that are of more relevance to their particular mandate.

Conda packages

I’ve pushed/updated recipes for both packages on conda-forge. You can install them as follows:

conda install -c conda-forge pykafka                 # Linux, Mac, Windows
conda install -c conda-forge python-confluent-kafka  # Linux, Mac

In both cases this these are built against the fast librdkafka C library (except on Windows) and install that library as well.

Future plans

I’ve recently started work on streaming systems and pipelines for Dask, so I’ll probably continue to investigate this space. I’m still torn between the two implementations. There are strong reasons to use either of them.

Culturally I am drawn to Parse.ly’s PyKafka library. They’re clearly Python developers writing for Python users. However the costs of using a non-Pythonic system here just aren’t that large (Kafka’s API is small), and Confluent’s interests are more aligned with investing in Kafka long term than are Parse.ly’s.

Streaming Dataframes

$
0
0

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

This post is about experimental software. This is not ready for public use. All code examples and API in this post are subject to change without warning.

Summary

This post describes a prototype project to handle continuous data sources of tabular data using Pandas and Streamz.

Introduction

Some data never stops. It arrives continuously in a constant, never-ending stream. This happens in financial time series, web server logs, scientific instruments, IoT telemetry, and more. Algorithms to handle this data are slightly different from what you find in libraries like NumPy and Pandas, which assume that they know all of the data up-front. It’s still possible to use NumPy and Pandas, but you need to combine them with some cleverness and keep enough intermediate data around to compute marginal updates when new data comes in.

Example: Streaming Mean

For example, imagine that we have a continuous stream of CSV files arriving and we want to print out the mean of our data over time. Whenever a new CSV file arrives we need to recompute the mean of the entire dataset. If we’re clever we keep around enough state so that we can compute this mean without looking back over the rest of our historical data. We can accomplish this by keeping running totals and running counts as follows:

total=0count=0forfilenameinfilenames:# filenames is an infinite iteratordf=pd.read_csv(filename)total=total+df.sum()count=count+df.count()mean=total/countprint(mean)

Now as we add new files to our filenames iterator our code prints out new means that are updated over time. We don’t have a single mean result, we have continuous stream of mean results that are each valid for the data up to that point. Our output data is an infinite stream, just like our input data.

When our computations are linear and straightforward like this a for loop suffices. However when our computations have several streams branching out or converging, possibly with rate limiting or buffering between them, this for-loop approach can grow complex and difficult to manage.

Streamz

A few months ago I pushed a small library called streamz, which handled control flow for pipelines, including linear map operations, operations that accumulated state, branching, joining, as well as back pressure, flow control, feedback, and so on. Streamz was designed to handle all of the movement of data and signaling of computation at the right time. This library was quietly used by a couple of groups and now feels fairly clean and useful.

Streamz was designed to handle the control flow of such a system, but did nothing to help you with streaming algorithms. Over the past week I’ve been building a dataframe module on top of streamz to help with common streaming tabular data situations. This module uses Pandas and implements a subset of the Pandas API, so hopefully it will be easy to use for programmers with existing Python knowledge.

Example: Streaming Mean

Our example above could be written as follows with streamz

source=Stream.filenames('path/to/dir/*.csv')# stream of filenamessdf=(source.map(pd.read_csv)# stream of Pandas dataframes.to_dataframe(example=...))# logical streaming dataframesdf.mean().stream.sink(print)# printed stream of mean values

This example is no more clear than the for-loop version. On its own this is probably a worse solution than what we had before, just because it involves new technology. However it starts to become useful in two situations:

  1. You want to do more complex streaming algorithms

    sdf=sdf[sdf.name=='Alice']sdf.x.groupby(sdf.y).mean().sink(print)# orsdf.x.rolling('300ms').mean()

    It would require more cleverness to build these algorithms with a for loop as above.

  2. You want to do multiple operations, deal with flow control, etc..

    sdf.mean().sink(print)sdf.x.sum().rate_limit(0.500).sink(write_to_database)...

    Consistently branching off computations, routing data correctly, and handling time can all be challenging to accomplish consistently.

Jupyter Integration and Streaming Outputs

During development we’ve found it very useful to have live updating outputs in Jupyter.

Usually when we evaluate code in Jupyter we have static inputs and static outputs:

However now both our inputs and our outputs are live:

We accomplish this using a combination of ipywidgets and Bokeh plots both of which provide nice hooks to change previous Jupyter outputs and work well with the Tornado IOLoop (streamz, Bokeh, Jupyter, and Dask all use Tornado for concurrency). We’re able to build nicely responsive feedback whenever things change.

In the following example we build our CSV to dataframe pipeline that updates whenever new files appear in a directory. Whenever we drag files to the data directory on the left we see that all of our outputs update on the right.

What is supported?

This project is very young and could use some help. There are plenty of holes in the API. That being said, the following works well:

Elementwise operations:

sdf['z']=sdf.x+sdf.ysdf=sdf[sdf.z>2]

Simple reductions:

sdf.sum()sdf.x.mean()

Groupby reductions:

sdf.groupby(sdf.x).y.mean()

Rolling reductions by number of rows or time window

sdf.rolling(20).x.mean()sdf.rolling('100ms').x.quantile(0.9)

Real time plotting with Bokeh (one of my favorite features)

sdf.plot()

What’s missing?

  1. Parallel computing: The core streamz library has an optional Dask backend for parallel computing. I haven’t yet made any attempt to attach this to the dataframe implementation.
  2. Data ingestion from common streaming sources like Kafka. We’re in the process now of building asynchronous-aware wrappers around Kafka Python client libraries, so this is likely to come soon.
  3. Out-of-order data access: soon after parallel data ingestion (like reading from multiple Kafka partitions at once) we’ll need to figure out how to handle out-of-order data access. This is doable, but will take some effort. This is where more mature libraries like Flink are quite strong.
  4. Performance: Some of the operations above (particularly rolling operations) do involve non-trivial copying, especially with larger windows. We’re relying heavily on the Pandas library which wasn’t designed with rapidly changing data in mind. Hopefully future iterations of Pandas (Arrow/libpandas/Pandas 2.0?) will make this more efficient.
  5. Filled out API: Many common operations (like variance) haven’t yet been implemented. Some of this is due to laziness and some is due to wanting to find the right algorithm.
  6. Robust plotting: Currently this works well for numeric data with a timeseries index but not so well for other data.

But most importantly this needs use by people with real problems to help us understand what here is valuable and what is unpleasant.

Help would be welcome with any of this.

You can install this from github

pip install git+https://github.com/mrocklin/streamz.git

Documentation and code are here:

Current work

Current and upcoming work is focused on data ingestion from Kafka and parallelizing with Dask.

Optimizing Data Structure Access in Python

$
0
0

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

Last week at PyCon DE I had the good fortune to meet Stefan Behnel, one of the core developers of Cython. Together we worked to optimize a small benchmark that is representative of Dask’s central task scheduler, a pure-Python application that is primarily data structure bound.

Our benchmark is a toy problem that creates three data structures that index each other with dictionaries, lists, and sets, and then does some simple arithmetic. (You don’t need to understand this benchmark deeply to read this article.)

importrandomimporttimenA=100nB=100nC=100A={'A-%d'%i:['B-%d'%random.randint(0,nB-1)foriinrange(random.randint(0,5))]foriinrange(nA)}B={'B-%d'%i:{'C-%d'%random.randint(0,nC-1)foriinrange(random.randint(1,3))}foriinrange(nB)}C={'C-%d'%i:iforiinrange(nC)}data=['A-%d'%iforiinrange(nA)]deff(A,B,C,data):fora_keyindata:b_keys=A[a_key]forb_keyinb_keys:forc_keyinB[b_key]:C[c_key]+=1start=time.time()foriinrange(10000):f(A,B,C,data)end=time.time()print("Duration: %0.3f seconds"%(end-start))
$ python benchmark.py
Duration: 1.12 seconds

This is an atypical problem Python optimization because it is primarily bound by data structure access (dicts, lists, sets), rather than numerical operations commonly optimized by Cython (nested for loops over floating point arithmetic). Python is already decently fast here, typically within a factor of 2-5x of compiled languages like Java or C++, but still we’d like to improve this when possible.

In this post we combine two different methods to optimize data-structure bound workloads:

  1. Compiling Python code with Cython with no other annotations
  2. Interning strings for more rapid dict lookups

Finally at the end of the post we also run the benchmark under PyPy to compare performance.

Cython

First we compile our Python code with Cython. Normally when using Cython we annotate our variables with types, giving the compiler enough information to avoid using Python altogether. However in our case we don’t have many numeric operations and we’re going to be using Python data structures regardless, so this won’t help much. We compile our original Python code without alteration.

cythonize -i benchmark.py

And run

$ python -c "import benchmark"
Duration: 0.73 seconds

This gives us a decent speedup from 1.1 seconds to 0.73 seconds. This isn’t huge relative to typical Cython speedups (which are often 10-100x) but would be a very welcome change for our scheduler, where we’ve been chasing 5% optimizations for a while now.

Interning Strings

Our second trick is to intern strings. This means that we try to always have only one copy of every string. This improves performance when doing dictionary lookups because of the following:

  1. Python computes the hash of the string only once (strings cache their hash value once computed)
  2. Python checks for object identity (fast) before moving on to value equality (slow)

Or, anecdotally, text is text is faster in Python than text == text. If you ensure that there is only one copy of every string then you only need to do identity comparisons like text is text.

So, if any time we see a string "abc" it is exactly the same object as all other "abc" strings in our program, then string-dict lookups will only require a pointer/integer equality check, rather than having to do a full string comparison.

Adding string interning to our benchmark looks like the following:

inter={}defintern(x):try:returninter[x]exceptKeyError:inter[x]=xreturnxA={intern('A-%d'%i):[intern('B-%d'%random.randint(0,nB-1))foriinrange(random.randint(0,5))]foriinrange(nA)}B={intern('B-%d'%i):{intern('C-%d'%random.randint(0,nC-1))foriinrange(random.randint(1,3))}foriinrange(nB)}C={intern('C-%d'%i):iforiinrange(nC)}data=[intern('A-%d'%i)foriinrange(nA)]# The rest of the benchmark is as before

This brings our duration from 1.1s down to 0.75s. Note that this is without the separate Cython improvements described just above.

Cython + Interning

We can combine both optimizations. This brings us to around 0.45s, a 2-3x improvement over our original time.

cythonize -i benchmark2.py

$ python -c "import benchmark2"
Duration: 0.46 seconds

PyPy

Alternatively, we can just run everything in PyPy.

$ pypy3 benchmark1.py  # original
Duration: 0.25 seconds

$ pypy3 benchmark2.py  # includes interning
Duraiton: 0.20 seconds

So PyPy can be quite a bit faster than Cython on this sort of code (which is not a big surprise). Interning helps a bit, but not quite as much.

This is fairly encouraging. The Dask scheduler can run under PyPy even while Dask clients and workers run under normal CPython (for use with the full PyData stack).

Preliminary Results on Dask Benchmark

We started this experiment with the assumption that our toy benchmark somehow represented the Dask’s scheduler in terms of performance characteristics. This assumption, of course, is false. The Dask scheduler is significantly more complex and it is difficult to build a single toy example to represent its performance.

When we try these tricks on a slightly more complex benchmark that actually uses the Dask scheduler we find the following results:

  • Cython: almost no effect
  • String Interning: almost no effect
  • PyPy: almost no effect

However I have only spent a brief amount of time on this (twenty minutes?) and so I hope that the lack of a performance gain here is due to lack of effort.

If anyone is interested in this I hope that this blogpost contains enough information to get anyone started if they want to investigate further.

Dask Release 0.16.0

$
0
0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.16.0. This is a major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.15.3 release on September 24th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Conda packages are available on both conda-forge and default channels.

Full changelogs are available here:

Some notable changes follow.

Breaking Changes

  • The dask.async module was moved to dask.local for Python 3.7 compatibility. This was previously deprecated and is now fully removed.
  • The distributed scheduler’s diagnostic JSON pages have been removed and replaced by more informative templated HTML.
  • The use of commonly-used private methods _keys and _optimize have been replaced with the Dask collection interface (see below).

Dask collection interface

It is now easier to implement custom collections using the Dask collection interface.

Dask collections (arrays, dataframes, bags, delayed) interact with Dask schedulers (single-machine, distributed) with a few internal methods. We formalized this interface into protocols like .__dask_graph__() and .__dask_keys__() and have published that interface. Any object that implements the methods described in that document will interact with all Dask scheduler features as a first-class Dask object.

classMyDaskCollection(object):def__dask_graph__(self):...def__dask_keys__(self):...def__dask_optimize__(self,...):......

This interface has already been implemented within the XArray project for labeled and indexed arrays. Now all XArray classes (DataSet, DataArray, Variable) are fully understood by all Dask schedulers. They are as first-class as dask.arrays or dask.dataframes.

importxarrayasxafromdask.distributedimportClientclient=Client()ds=xa.open_mfdataset('*.nc',...)ds=client.persist(ds)# XArray object integrate seamlessly with Dask schedulers

Work on Dask’s collection interfaces was primarily done by Jim Crist.

Bandwidth and Tornado 5 compatibility

Dask is built on the Tornado library for concurrent network programming. In an effort to improve inter-worker bandwidth on exotic hardware (Infiniband), Dask developers are proposing changes to Tornado’s network infrastructure.

However, in order to use these changes Dask itself needs to run on the next version of Tornado in development, Tornado 5.0.0, which breaks a number of interfaces on which Dask has relied. Dask developers have been resolving these and we encourage other PyData developers to do the same. For example, neither Bokeh nor Jupyter work on Tornado 5.0.0-dev.

Dask inter-worker bandwidth is peaking at around 1.5-2GB/s on a network theoretically capable of 3GB/s. GitHub issue: pangeo #6

Dask worker bandwidth

Network performance and Tornado compatibility are primarily being handled by Antoine Pitrou.

Parquet Compatibility

Dask.dataframe can use either of the two common Parquet libraries in Python, Apache Arrow and Fastparquet. Each has its own strengths and its own base of users who prefer it. We’ve significantly extended Dask’s parquet test suite to cover each library, extending roundtrip compatibility. Notably, you can now both read and write with PyArrow.

df.to_parquet('...',engine='fastparquet')df=dd.read_parquet('...',engine='pyarrow')

There is still work to be done here. The variety of parquet reader/writers and conventions out there makes completely solving this problem difficult. It’s nice seeing the various projects slowly converge on common functionality.

This work was jointly done by Uwe Korn, Jim Crist, and Martin Durant.

Retrying Tasks

One of the most requested features for the Dask.distributed scheduler is the ability to retry failed tasks. This is particularly useful to people using Dask as a task queue, rather than as a big dataframe or array.

future=client.submit(func,*args,retries=5)

Task retries were primarily built by Antoine Pitrou.

Transactional Work Stealing

The Dask.distributed task scheduler performs load balancing through work stealing. Previously this would sometimes result in the same task running simultaneously in two locations. Now stealing is transactional, meaning that it will avoid accidentally running the same task twice. This behavior is especially important for people using Dask tasks for side effects.

It is still possible for the same task to run twice, but now this only happens in more extreme situations, such as when a worker dies or a TCP connection is severed, neither of which are common on standard hardware.

Transactional work stealing was primarily implemented by Matthew Rocklin.

New Diagnostic Pages

There is a new set of diagnostic web pages available in the Info tab of the dashboard. These pages provide more in-depth information about each worker and task, but are not dynamic in any way. They use Tornado templates rather than Bokeh plots, which means that they are less responsive but are much easier to build. This is an easy and cheap way to expose more scheduler state.

Task page of Dask's scheduler info dashboard

Nested compute calls

Calling .compute()within a task now invokes the same distributed scheduler. This enables writing more complex workloads with less thought to starting worker clients.

importdaskfromdask.distributedimportClientclient=Client()# only works for the newer scheduler@dask.delayeddeff(x):...returndask.compute(...)# can call dask.compute within delayed taskdask.compute([f(i)for...])

Nested compute calls were primarily developed by Matthew Rocklin and Olivier Grisel.

More aggressive Garbage Collection

The workers now explicitly call gc.collect() at various times when under memory pressure and when releasing data. This helps to avoid some memory leaks, especially when using Pandas dataframes. Doing this carefully proved to require a surprising degree of detail.

Improved garbage collection was primarily implemented and tested by Fabian Keller and Olivier Grisel, with recommendations by Antoine Pitrou.

Dask-ML

A variety of Dask Machine Learning projects are now being assembled under one unified repository, dask-ml. We encourage users and researchers alike to read through that project. We believe there are many useful and interesting approaches contained within.

The work to assemble and curate these algorithms is primarily being handled by Tom Augspurger.

XArray

The XArray project for indexed and labeled arrays is also releasing their major 0.10.0 release this week, which includes many performance improvements, particularly for using Dask on larger datasets.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.3 release on September 24th:

  • Ced4
  • Christopher Prohm
  • fjetter
  • Hai Nguyen Mau
  • Ian Hopkinson
  • James Bourbeau
  • James Munroe
  • Jesse Vogt
  • Jim Crist
  • John Kirkham
  • Keisuke Fujii
  • Matthias Bussonnier
  • Matthew Rocklin
  • mayl
  • Martin Durant
  • Olivier Grisel
  • severo
  • Simon Perkins
  • Stephan Hoyer
  • Thomas A Caswell
  • Tom Augspurger
  • Uwe L. Korn
  • Wei Ji
  • xwang777

The following people contributed to the dask/distributed repository since the 1.19.1 release on September 24nd:

  • Alvaro Ulloa
  • Antoine Pitrou
  • chkoar
  • Fabian Keller
  • Ian Hopkinson
  • Jim Crist
  • Kelvin Yang
  • Krisztián Szűcs
  • Matthew Rocklin
  • Mike DePalatis
  • Olivier Grisel
  • rbubley
  • Tom Augspurger

The following people contributed to the dask/dask-ml repository

  • Evan Welch
  • Matthew Rocklin
  • severo
  • Tom Augspurger
  • Trey Causey

In addition, we are proud to announce that Olivier Grisel has accepted commit rights to the Dask projects. Olivier has been particularly active on the distributed scheduler, and on related projects like Joblib, SKLearn, and Cloudpickle.


Dask Development Log

$
0
0

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Current development in Dask and Dask-related projects includes the following efforts:

  1. A possible change to our community communication model
  2. A rewrite of the distributed scheduler
  3. Kubernetes and Helm Charts for Dask
  4. Adaptive deployment fixes
  5. Continued support for NumPy and Pandas’ growth
  6. Spectral Clustering in Dask-ML

Community Communication

Dask community communication generally happens in Github issues for bug and feature tracking, the Stack Overflow #dask tag for user questions, and an infrequently used Gitter chat.

Separately, Dask developers who work for Anaconda Inc (there are about five of us part-time) use an internal company chat and a closed weekly video meeting. We’re now trying to migrate away from closed systems when possible.

Details about future directions are in dask/dask #2945. Thoughts and comments on that issue would be welcome.

Scheduler Rewrite

When you start building clusters with 1000 workers the distributed scheduler can become a bottleneck on some workloads. After working with PyPy and Cython development teams we’ve decided to rewrite parts of the scheduler to make it more amenable to acceleration by those technologies. Note that no actual acceleration has occurred yet, just a refactor of internal state.

Previously the distributed scheduler was focused around a large set of Python dictionaries, sets, and lists that indexed into each other heavily. This was done both for low-tech code technology reasons and for performance reasons (Python core data structures are fast). However, compiler technologies like PyPy and Cython can optimize Python object access down to C speeds, so we’re experimenting with switching away from Python data structures to Python objects to see how much this is able to help.

This change will be invisible operationally (the full test suite remains virtually unchanged), but will be a significant change to the scheduler’s internal state. We’re keeping around a compatibility layer, but people who were building their own diagnostics around the internal state should check out with the new changes.

Ongoing work by Antoine Pitrou in dask/distributed #1594

Kubernetes and Helm Charts for Dask

In service of the Pangeo project to enable scalable data analysis of atmospheric and oceanographic data we’ve been improving the tooling around launching Dask on Cloud infrastructure, particularly leveraging Kubernetes.

To that end we’re making some flexible Docker containers and Helm Charts for Dask, and hope to combine them with JupyterHub in the coming weeks.

Work done by myself in the following repositories. Feedback would be very welcome. I am learning on the job with Helm here.

If you use Helm on Kubernetes then you might want to try the following:

helm repo add dask https://dask.github.io/helm-chart
helm update
helm install dask/dask

This installs a full Dask cluster and a Jupyter server. The Docker containers contain entry points that allow their environments to be updated with custom packages easily.

This work extends prior work on the previous package, dask-kubernetes, but is slightly more modular for use alongside other systems.

Adaptive deployment fixes

Adaptive deployments, where a cluster manager scales a Dask cluster up or down based on current workloads recently got a makeover, including a number of bug fixes around odd or infrequent behavior.

Work done by Russ Bubley here:

Keeping up with NumPy and Pandas

NumPy 1.14 is due to release soon. Dask.array had to update how it handled structured dtypes in dask/dask #2694 (Work by Tom Augspurger).

Dask.dataframe is gaining the ability to merge/join simultaneously on columns and indices, following a similar feature released in Pandas 0.22. Work done by Jon Mease in dask/dask #2960

Spectral Clustering in Dask-ML

Dask-ML recently added an approximate and scalable Spectral Clustering algorithm in dask/dask-ml #91 (gallery example).

Pangeo: JupyterHub, Dask, and XArray on the Cloud

$
0
0

This work is supported by Anaconda Inc, the NSF EarthCube program, and UC Berkeley BIDS

A few weeks ago a few of us stood up pangeo.pydata.org, an experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE) to support atmospheric and oceanographic data analysis on large datasets. This follows on recent work to deploy Dask and XArray for the same workloads on super computers. This system is a proof of concept that has taught us a great deal about how to move forward. This blogpost briefly describes the problem, the system, then describes the collaboration, and finally discusses a number of challenges that we’ll be working on in coming months.

The Problem

Atmospheric and oceanographic sciences collect (with satellites) and generate (with simulations) large datasets that they would like to analyze with distributed systems. Libraries like Dask and XArray already solve this problem computationally if scientists have their own clusters, but we seek to expand access by deploying on cloud-based systems. We build a system to which people can log in, get Jupyter Notebooks, and launch Dask clusters without much hassle. We hope that this increases access, and connects more scientists with more cloud-based datasets.

The System

We integrate several pre-existing technologies to build a system where people can log in, get access to a Jupyter notebook, launch distributed compute clusters using Dask, and analyze large datasets stored in the cloud. They have a full user environment available to them through a website, can leverage thousands of cores for computation, and use existing APIs and workflows that look familiar to how they work on their laptop.

A video walk-through follows below:

We assembled this system from a number of pieces and technologies:

  • JupyterHub: Provides both the ability to launch single-user notebook servers and handles user management for us. In particular we use the KubeSpawner and the excellent documentation at Zero to JupyterHub, which we recommend to anyone interested in this area.
  • KubeSpawner: A JupyterHub spawner that makes it easy to launch single-user notebook servers on Kubernetes systems
  • JupyterLab: The newer version of the classic notebook, which we use to provide a richer remote user interface, complete with terminals, file management, and more.
  • XArray: Provides computation on NetCDF-style data. XArray extends NumPy and Pandas to enable scientists to express complex computations on complex datasets in ways that they find intuitive.
  • Dask: Provides the parallel computation behind XArray
  • Daskernetes: Makes it easy to launch Dask clusters on Kubernetes
  • Kubernetes: In case it’s not already clear, all of this is based on Kubernetes, which manages launching programs (like Jupyter notebook servers or Dask workers) on different machines, while handling load balancing, permissions, and so on
  • Google Container Engine: Google’s managed Kubernetes service. Every major cloud provider now has such a system, which makes us happy about not relying too heavily on one system
  • GCSFS: A Python library providing intuitive access to Google Cloud Storage, either through Python file interfaces or through a FUSE file system
  • Zarr: A chunked array storage format that is suitable for the cloud

Collaboration

We were able to build, deploy, and use this system to answer real science questions in a couple weeks. We feel that this result is significant in its own right, and is largely because we collaborated widely. This project required the expertise of several individuals across several projects, institutions, and funding sources. Here are a few examples of who did what from which organization. We list institutions and positions mostly to show the roles involved.

  • Alistair Miles, Professor, Oxford: Helped to optimize Zarr for XArray on GCS
  • Jacob Tomlinson, Staff, UK Met Informatics Lab: Developed original JADE deployment and early Dask-Kubernetes work.
  • Joe Hamman, Postdoc, National Center for Atmospheric Research: Provided scientific use case, data, and work flow. Tuned XArray and Zarr for efficient data storing and saving.
  • Martin Durant, Software developer, Anaconda Inc.: Tuned GCSFS for many-access workloads. Also provided FUSE system for NetCDF support
  • Matt Pryor, Staff, Centre for Envronmental Data Analysis: Extended original JADE deployment and early Dask-Kubernetes work.
  • Matthew Rocklin, Software Developer, Anaconda Inc. Integration. Also performance testing.
  • Ryan Abernathey, Assistant Professor, Columbia University: XArray + Zarr support, scientific use cases, coordination
  • Stephan Hoyer, Software engineer, Google: XArray support
  • Yuvi Panda, Staff, UC Berkeley BIDS and Data Science Education Program: Provided assistance configuring JupyterHub with KubeSpawner. Also prototyped the Daskernetes Dask + Kubernetes tool.

Notice the mix of academic and for-profit institutions. Also notice the mix of scientists, staff, and professional software developers. We believe that this mixture helps ensure the efficient construction of useful solutions.

Lessons

This experiment has taught us a few things that we hope to explore further:

  1. Users can launch Kubernetes deployments from Kubernetes pods, such as launching Dask clusters from their JupyterHub single-user notebooks.

    To do this well we need to start defining user roles more explicitly within JupyterHub. We need to give users a safe an isolated space on the cluster to use without affecting their neighbors.

  2. HDF5 and NetCDF on cloud storage is an open question

    The file formats used for this sort of data are pervasive, but not particulary convenient or efficent on cloud storage. In particular the libraries used to read them make many small reads, each of which is costly when operating on cloud object storage

    I see a few options:

    1. Use FUSE file systems, but tune them with tricks like read-ahead and caching in order to compensate for HDF’s access patterns
    2. Use the HDF group’s proposed HSDS service, which promises to resolve these issues
    3. Adopt new file formats that are more cloud friendly. Zarr is one such example that has so far performed admirably, but certainly doesn’t have the long history of trust that HDF and NetCDF have earned.
  3. Environment customization is important and tricky, especially when adding distributed computing.

    Immediately after showing this to science groups they want to try it out with their own software environments. They can do this easily in their notebook session with tools like pip or conda, but to apply those same changes to their dask workers is a bit more challenging, especially when those workers come and go dynamically.

    We have solutions for this. They can bulid and publish docker images. They can add environment variables to specify extra pip or conda packages. They can deploy their own pangeo deployment for their own group.

    However these have all taken some work to do well so far. We hope that some combination of Binder-like publishing and small modification tricks like environment variables resolve this problem.

  4. Our docker images are very large. This means that users sometimes need to wait a minute or more for their session or their dask workers to start up (less after things have warmed up a bit).

    It is surprising how much of this comes from conda and node packages. We hope to resolve this both by improving our Docker hygeine and by engaging packaging communities to audit package size.

  5. Explore other clouds

    We started with Google just because their Kubernetes support has been around the longest, but all major cloud providers (Google, AWS, Azure) now provide some level of managed Kubernetes support. Everything we’ve done has been cloud-vendor agnostic, and various groups with data already on other clouds have reached out and are starting deployment on those systems.

  6. Combine efforts with other groups

    We’re actually not the first group to do this. The UK Met Informatics Lab quietly built a similar prototype, JADE (Jupyter and Dask Environment) many months ago. We’re now collaborating to merge efforts.

    It’s also worth mentioning that they prototyped the first iteration of Daskernetes.

  7. Reach out to other communities

    While we started our collaboration with atmospheric and oceanographic scientists, these same solutions apply to many other disciplines. We should investigate other fields and start collaborations with those communities.

  8. Improve Dask + XArray algorithms

    When we try new problems in new environments we often uncover new opportunities to improve Dask’s internal scheduling algorithms. This case is no different :)

Much of this upcoming work is happening in the upstream projects so this experimentation is both of concrete use to ongoing scientific research as well as more broad use to the open source communities that these projects serve.

Community uptake

We presented this at a couple conferences over the past week.

We found that this project aligns well with current efforts from many government agencies to publish large datasets on cloud stores (mostly S3). Many of these data publication endeavors seek a computational system to enable access for the scientific public. Our project seems to complement these needs without significant coordination.

Disclaimers

While we encourage people to try out pangeo.pydata.org we also warn you that this system is immature. In particular it has the following issues:

  1. it is insecure, please do not host sensitive data
  2. it is unstable, and may be taken down at any time
  3. it is small, we only have a handful of cores deployed at any time, mostly for experimentation purposes

However it is also open, and instructions to deploy your own live here.

Come help

We are a growing group comprised of many institutions including technologists, scientists, and open source projects. There is plenty to do and plenty to discuss. Please engage with us at github.com/pangeo-data/pangeo/issues/new

Write Dumb Code

$
0
0

The best way you can contribute to an open source project is to remove lines of code from it.We should endeavor to write code that a novice programmer can easily understand without explanation or that a maintainer can understand without significant time investment.

As students we attempt increasingly challenging problems with increasingly sophisticated technologies. We first learn loops, then functions, then classes, etc.. We are praised as we ascend this hierarchy, writing longer programs with more advanced technology. We learn that experienced programmers use monads while new programmers use for loops.

Then we graduate and find a job or open source project to work on with others. We search for something that we can add, and implement a solution pridefully, using the all the tricks that we learned in school.

Ah ha! I can extend this project to do X! And I can use inheritance here! Excellent!

We implement this feature and feel accomplished, and with good reason. Programming in real systems is no small accomplishment. This was certainly my experience. I was excited to write code and proud that I could show off all of the things that I knew how to do to the world. As evidence of my historical love of programming technology, here is a linear algebra language built with a another meta-programming language. Notice that no one has touched this code in several years.

However after maintaining code a bit more I now think somewhat differently.

  1. We should not seek to build software. Software is the currency that we pay to solve problems, which is our actual goal. We should endeavor to build as little software as possible to solve our problems.
  2. We should use technologies that are as simple as possible, so that as many people as possible can use and extend them without needing to understand our advanced techniques. We should use advanced techniques only when we are not smart enough to figure out how to use more common techniques.

Neither of these points are novel. Most people I meet agree with them to some extent, but somehow we forget them when we go to contribute to a new project. The instinct to contribute by building and to demonstrate sophistication often take over.

Software is a cost

Every line that you write costs people time. It costs you time to write it of course, but you are willing to make this personal sacrifice. However this code also costs the reviewers their time to understand it. It costs future maintainers and developers their time as they fix and modify your code. They could be spending this time outside in the sunshine or with their family.

So when you add code to a project you should feel meek. It should feel as though you are eating with your family and there isn’t enough food on the table. You should take only what you need and no more. The people with you will respect you for your efforts to restrict yourself. Solving problems with less code is a hard, but it is a burden that you take on yourself to lighten the burdens of others.

Complex technologies are harder to maintain

As students, we demonstrate merit by using increasingly advanced technologies. Our measure of worth depends on our ability to use functions, then classes, then higher order functions, then monads, etc. in public projects. We show off our solutions to our peers and feel pride or shame according to our sophistication.

However when working with a team to solve problems in the world the situation is reversed. Now we strive to solve problems with code that is as simple as possible. When we solve a problem simply we enable junior programmers to extend our solution to solve other problems. Simple code enables others and boosts our impact. We demonstrate our value by solving hard problems with only basic techniques.

Look! I replaced this recursive function with a for loop and it still does everything that we need it to. I know it’s not as clever, but I noticed that the interns were having trouble with it and I thought that this change might help.

If you are a good programmer then you don’t need to demonstrate that you know cool tricks. Instead, you can demonstrate your value by solving a problem in a simple way that enables everyone on your team to contribute in the future.

But moderation, of course

That being said, over-adherence to the “build things with simple tools” dogma can be counter productive. Often a recursive solution can be much simpler than a for-loop solution and often times using a Class or a Monad is the right approach. But we should be mindful when using these technologies that we are building for ourselves our own system; a system with which others have had no experience.

The Case for Numba in Community Code

$
0
0

The numeric Python community should consider adopting Numba more widely within community code.

Numba is strong in performance and usability, but historically weak in ease of installation and community trust. This blogpost discusses these these strengths and weaknesses from the perspective of a OSS library maintainer. It uses other more technical blogposts written on the topic as references. It is biased in favor of wider adoption given recent changes to the project.

Let’s start with a wildly unprophetic quote from Jake Vanderplas in 2013:

I’m becoming more and more convinced that Numba is the future of fast scientific computing in Python.

– Jake Vanderplas, 2013-06-15

http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

We’ll use the following blogposts by other community members throughout this post. They’re all good reads and are more technical, showing code examples, performance numbers, etc..

At the end of the blogpost these authors will also share some thoughts on Numba today, looking back with some hindsight.

Disclaimer: I work alongside many of the Numba developers within the same company and am partially funded through the same granting institution.

Compiled code in Python

Many open source numeric Python libraries need to write efficient low-level code that works well on Numpy arrays, but is more complex than the Numpy library itself can express. Typically they use one of the following options:

  1. C-extensions: mostly older projects like NumPy and Scipy
  2. Cython: probably the current standard for mainline projects, like scikit-learn, pandas, scikit-image, geopandas, and so on
  3. Standalone C/C++ codebases with Python wrappers: for newer projects that target inter-language operation, like XTensor and Arrow

Each of these choices has tradeoffs in performance, packaging, attracting new developers and so on. Ideally we want a solution that is …

  1. Fast: about as fast as C/Fortran
  2. Easy: Is accessible to a broad base of developers and maintainers
  3. Builds easily: Introduces few complications in building and packaging
  4. Installs easily: Introduces few install and runtime dependencies
  5. Trustworthy: Is well trusted within the community, both in terms of governance and long term maintenance

The two main approaches today, Cython and C/C++, both do well on most of these objectives. However neither is perfect. Some issues that arise include the following:

  • Cython
    • Often requires effort to make fast
    • Is often only used by core developers. Requires expertise to use well.
    • Introduces mild packaging pain, though this pain is solved frequently enough that experienced community members are used to dealing with it
  • Standalone C/C++
    • Sometimes introduces complex build and packaging concerns
    • Is often only used by core developers. These projects have difficulty attracting the Python community’s standard developer pool (though they do attract developers from other communities).

There are some other options out there like Numba and Pythran that, while they provide excellent performance and usability benefits, are rarely used. Let’s look into Numba’s benefits and drawbacks more closely.

Numba Benefits

Numba is generally well regarded from a technical perspective (it’s fast, easy to use, well maintained, etc.) but has historically not been trusted due to packaging and community concerns.

In any test of either performance or usability Numba almost always wins (or ties for the win). It does all of the compiler optimization tricks you expect. It supports both for-loopy code as well as Numpy-style slicing and bulk operation code. It requires almost no additional information from the user (assuming that you’re ok with JIT behavior) and so is very approachable, and very easy for novices to use well.

This means that we get phrases like the following:

  • https://dionhaefner.github.io/2016/11/suck-less-scientific-python-part-2-efficient-number-crunching/
    • “This is rightaway faster than NumPy.”
    • “In fact, we can infer from this that numba managed to generate pure C code from our function and that it did it already previously.”
    • “Numba delivered the best performance on this problem, while still being easy to use.”
  • https://dionhaefner.github.io/2016/11/suck-less-scientific-python-part-2-efficient-number-crunching/
    • “Using numba is very simple; just apply the jit decorator to the function you want to get compiled. In this case, the function code is exactly the same as before”
    • “Wow! A speedup by a factor of about 400, just by applying a decorator to the function. “
  • http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/
    • “Much better! We’re now within about a factor of 3 of the Fortran speed, and we’re still writing pure Python!”
    • “I should emphasize here that I have years of experience with Cython, and in this function I’ve used every Cython optimization there is … By comparison, the Numba version is a simple, unadorned wrapper around plainly-written Python code.”
  • http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/
    • Numba is extremely simple to use. We just wrap our python function with autojit (JIT stands for “just in time” compilation) to automatically create an efficient, compiled version of the function
    • Adding this simple expression speeds up our execution by over a factor of over 1400! For those keeping track, this is about 50% faster than the version of Numba that I tested last August on the same machine.
    • The Cython version, despite all the optimization, is a few percent slower than the result of the simple Numba decorator!
  • http://stephanhoyer.com/2015/04/09/numba-vs-cython-how-to-choose/
    • “Using Numba is usually about as simple as adding a decorator to your functions”
    • “Numba is usually easier to write for the simple cases where it works”
  • https://murillogroupmsu.com/numba-versus-c/
    • “Numba allows for speedups comparable to most compiled languages with almost no effort”
    • “We find that Numba is more than 100 times as fast as basic Python for this application. In fact, using a straight conversion of the basic Python code to C++ is slower than Numba.”

In all cases where authors compared Numba to Cython for numeric code (Cython is probably the standard for these cases) Numba always performs as-well-or-better and is always much simpler to write.

Here is a code example from Jake’s second blogpost:

Example: Code Complexity

# From http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

# Numba                                 # Cython
import numpy as np                      import numpy as np
import numba                            cimport cython
                                        from libc.math cimport sqrt

                                        @cython.boundscheck(False)
@numba.jit                              @cython.wraparound(False)
def pairwise_python(X):                 def pairwise_cython(double[:, ::1] X):
    M = X.shape[0]                          cdef int M = X.shape[0]
    N = X.shape[1]                          cdef int N = X.shape[1]
                                            cdef double tmp, d
    D = np.empty((M, M), dtype=np.float)    cdef double[:, ::1] D = np.empty((M, M),
                                                                             dtype=np.float64)
    for i in range(M):                      for i in range(M):
        for j in range(M):                      for j in range(M):
            d = 0.0                                 d = 0.0
            for k in range(N):                      for k in range(N):
                tmp = X[i, k] - X[j, k]                 tmp = X[i, k] - X[j, k]
                d += tmp * tmp                          d += tmp * tmp
            D[i, j] = np.sqrt(d)                    D[i, j] = sqrt(d)
    return D                                return np.asarray(D)

The algorithmic body of each function (the nested for loops) is identical. However the Cython code is more verbose with annotations, both at the function definition (which we would expect for any AOT compiler), but also within the body of the function for various utility variables. The Numba code is just straight Python + Numpy code. We could remove the @numba.jit decorator and step through our function with normal Python.

Example: Numpy Operations

Additionally Numba lets us use Numpy syntax directly in the function, so for example the following function is well accelerated by Numba, even though it already fits NumPy’s syntax well.

# from https://flothesof.github.io/optimizing-python-code-numpy-cython-pythran-numba.html@numba.jitdeflaplace_numba(image):"""Laplace operator in NumPy for 2D images. Accelerated using numba."""laplacian=(image[:-2,1:-1]+image[2:,1:-1]+image[1:-1,:-2]+image[1:-1,2:]-4*image[1:-1,1:-1])thresh=np.abs(laplacian)>0.05returnthresh

Mixing and matching Numpy-style with for-loop style is often helpful when writing complex numeric algorithms.

Benchmarks in the these blogposts show that Numba is both simpler to use and often as-fast-or-faster than more commonly used technologies like Cython.

Numba drawbacks

So, given these advantages why didn’t Jake’s original prophecy hold true?

I believe that there are three primary reasons why Numba has not been more widely adopted among other open source projects:

  1. LLVM Dependency: Numba depends on LLVM, which was historically difficult to install without a system package manager (like apt-get, brew) or conda. Library authors are not willing to exclude users that use other packaging toolchains, particularly Python’s standard tool, pip.
  2. Community Trust: Numba is largely developed within a single for-profit company (Anaconda Inc.) and its developers are not well known by other library maintainers.
  3. Lack of Interpretability: Numba’s output, LLVM, is less well understood by the community than Cython’s output, C (discussed in original-author comments in the last section)

All three of these are excellent reasons to avoid adding a dependency. Technical excellence alone is insufficient, and must be considered alongside community and long-term maintenance concerns.

But Numba has evolved recently

LLVM

Numba now depends on the easier-to-install library, llvmlite which, as of a few months ago is pip installable with binary wheels on Windows, Mac, and Linux. The llvmlite package is still a heavy-ish runtime dependency (42MB), but that’s significantly less than large Cython libraries like Pandas or SciPy.

If your concern was about the average user’s inability to install Numba, then I think that this concern has been resolved.

Community

Numba has three community problems:

  1. Development of Numba has traditionally happened within the closed walls of Anaconda Inc (formerly Continuum Analytics)
  2. The Numba maintainers are not well known within the broader Python community
  3. There used to be a proprietary version, Numba Pro

This combination strongly attached Numba’s image to Continuum’s for-profit ventures, making community-oriented software maintainers understandably wary of dependence, for fear that dependence on this library might be used for Continuum’s financial gain at the expense of community users.

Things have changed significantly.

Numba Pro was abolished years ago. The funding for the project today comes more often from Anaconda Inc. consulting revenue, hardware vendors looking to ensure that Python runs as efficiently as possible on their systems, and from generous donations from the Gordon and Betty Moore foundation to ensure that Numba serves the open source Python community.

Developers outside of Anaconda Inc. now have core commit access, which forces communication to happen in public channels, notably GitHub (which was standard before) and Gitter chat (which is relatively new).

The maintainers are still fairly relatively unknown within the broader community. This isn’t due to any sort of conspiracy, but is instead due more to shyness or having interests outside of OSS. Antoine, Siu, Stan, and Stuart are all considerate, funny, and clever fellows with strong enthusiasm for compilers, OSS, and performance. They are quite responsive on the Numba mailing list should you have any questions or concerns.

If your concern was about Numba trapping users into a for-profit mode, then that seems to have been resolved years ago.

If your concern is more about not knowing who is behind the project then I encourage you to reach out. I would be surprised if you don’t walk away pleased.

The Continued Cases Against Numba

For completeness, let’s list a number of reasons why it is still quite reasonable to avoid Numba today:

  1. It isn’t a community standard
  2. Numba hasn’t attracted a wide developer base (compilers are hard), and so is probably still dependent on financial support for paid developers
  3. You want to speed up non-numeric code that includes classes, dicts, lists, etc. for which I need Cython or PyPy
  4. You want to build a library that is useful outside of Python, and so plan to build most numeric algorithms on C/C++/Fortran
  5. You prefer ahead-of-time compilation and want to avoid JIT times
  6. While llvmlite is cheaper than LLVM, it’s still 50MB
  7. Understanding the compiled results is hard, and you don’t have good familiarity with LLVM

Numba features we didn’t talk about

  1. Multi-core parallelism
  2. GPUs
  3. Run-time Specialization to the CPU you’re running on
  4. Easy to swap out for other JIT compilers, like PyPy, if they arise in the future

Update from the original blogpost authors

After writing the above I reached out both to Stan and Siu from Numba and to the original authors of the referenced blogposts to get some of their impressions now having the benefit of additional experience.

Here are a few choice responses:

  1. Stan:

    I think one of the biggest arguments against Numba still is time. Due to a massive rewrite of the code base, Numba, in its present form, is ~3 years old, which isn’t that old for a project like this. I think it took PyPy at least 5-7 years to reach a point where it was stable enough to really trust. Cython is 10 years old. People have good reason to be conservative with taking on new core dependencies.

  2. Jake:

    One thing I think would be worth fleshing-out a bit (you mention it in the final bullet list) is the fact that numba is kind of a black box from the perspective of the developer. My experience is that it works well for straightforward applications, but when it doesn’t work well it’s *extremely difficult to diagnose what the problem might be.*

    Contrast that with Cython, where the html annotation output does wonders for understanding your bottlenecks both at a non-technical level (“this is dark yellow so I should do something different”) and a technical level (“let me look at the C code that was generated”). If there’s a similar tool for numba, I haven’t seen it.

  • Florian:

    Elaborating on Jake’s answer, I completely agree that Cython’s annotation tool does wonders in terms of understanding your code. In fact, numba does possess this too, but as a command-line utility. I tried to demonstrate this in my blogpost, but exporting the CSS in the final HTML render kind of mangles my blog post so here’s a screenshot:

    Numba HTML annotations

    This is a case where jit(nopython=True) works, so there seems to be no coloring at all.

    Florian also pointed to the SciPy 2017 tutorial by Gil Forsyth and Lorena Barba

  • Dion:

    I hold Numba in high regard, and the speedups impress me every time. I use it quite often to optimize some bottlenecks in our production code or data analysis pipelines (unfortunately not open source). And I love how Numba makes some functions like scipy.optimize.minimize or scipy.ndimage.generic_filter well-usable with minimal effort.

    However, I would never use Numba to build larger systems, precisely for the reason Jake mentioned. Subjectively, Numba feels hard to debug, has cryptic error messages, and seemingly inconsistent behavior. It is not a “decorate and forget” solution; instead it always involves plenty of fiddling to get right.

    That being said, if I were to build some high-level scientific library à la Astropy with some few performance bottlenecks, I would definitely favor Numba over Cython (and if it’s just to spare myself the headache of getting a working C compiler on Windows).

  • Stephan:

    I wonder if there are any examples of complex codebases (say >1000 LOC) using Numba. My sense is that this is where Numba’s limitations will start to become more evident, but maybe newer features like jitclass would make this feasible.

As a final take-away, you might want to follow Florian’s advice and watch Gil and Lorena’s tutorial here:

HDF in the Cloud

$
0
0

Multi-dimensional data, such as is commonly stored in HDF and NetCDF formats, is difficult to access on traditional cloud storage platforms. This post outlines the situation, the following possible solutions, and their strengths and weaknesses.

  1. Cloud Optimized GeoTIFF: We can use modern and efficient formats from other domains, like Cloud Optimized GeoTIFF
  2. HDF + FUSE: Continue using HDF, but mount cloud object stores as a file system with FUSE
  3. HDF + Custom Reader: Continue using HDF, but teach it how to read from S3, GCS, ADL, …
  4. Build a Distributed Service: Allow others to serve this data behind a web API, built however they think best
  5. New Formats for Scientific Data: Design a new format, optimized for scientific data in the cloud

Not Tabular Data

If your data fits into a tabular format, such that you can use tools like SQL, Pandas, or Spark, then this post is not for you. You should consider Parquet, ORC, or any of a hundred other excellent formats or databases that are well designed for use on cloud storage technologies.

Multi-Dimensional Data

We’re talking about data that is multi-dimensional and regularly strided. This data often occurs in simulation output (like climate models), biomedical imaging (like an MRI scan), or needs to be efficiently accessed across a number of different dimensions (like many complex time series). Here is an image from the popular XArray library to put you in the right frame of mind:

This data is often stored in blocks such that, say, each 100x100x100 chunk of data is stored together, and can be accessed without reading through the rest of the file.

A few file formats allow this layout, the most popular of which is the HDF format, which has been the standard for decades and forms the basis for other scientific formats like NetCDF. HDF is a powerful and efficient format capable of handling both complex hierarchical data systems (filesystem-in-a-file) and efficiently blocked numeric arrays. Unfortunately HDF is difficult to access from cloud object stores (like S3), which presents a challenge to the scientific community.

The Opportunity and Challenge of Cloud Storage

The scientific community generates several petabytes of HDF data annually. Supercomputer simulations (like a large climate model) produce a few petabytes. Planned NASA satellite missions will produce hundreds of petabytes a year of observational data. All of these tend to be stored in HDF.

To increase access, institutions now place this data on the cloud. Hopefully this generates more social value from existing simulations and observations, as they are ideally now more accessible to any researcher or any company without coordination with the host institution.

Unfortunately, the layout of HDF files makes them difficult to query efficiently on cloud storage systems (like Amazon’s S3, Google’s GCS, or Microsoft’s ADL). The HDF format is complex and metadata is strewn throughout the file, so that a complex sequence of reads is required to reach a specific chunk of data. The only pragmatic way to read a chunk of data from an HDF file today is to use the existing HDF C library, which expects to receive a C FILE object, pointing to a normal file system (not a cloud object store) (this is not entirely true, as we’ll see below).

So organizations like NASA are dumping large amounts of HDF onto Amazon’s S3 that no one can actually read, except by downloading the entire file to their local hard drive, and then pulling out the particular bits that they need with the HDF library. This is inefficient. It misses out on the potential that cloud-hosted public data can offer to our society.

The rest of this post discusses a few of the options to solve this problem, including their advantages and disadvantages.

  1. Cloud Optimized GeoTIFF: We can use modern and efficient formats from other domains, like Cloud Optimized GeoTIFF

    Good: Fast, well established

    Bad: Not sophisticated enough to handle some scientific domains

  2. HDF + FUSE: Continue using HDF, but mount cloud object stores as a file system with Filesystem in Userspace, aka FUSE

    Good: Works with existing files, no changes to the HDF library necessary, useful in non-HDF contexts as well

    Bad: It’s complex, probably not performance-optimal, and has historically been brittle

  3. HDF + Custom Reader: Continue using HDF, but teach it how to read from S3, GCS, ADL, …

    Good: Works with existing files, no complex FUSE tricks

    Bad: Requires plugins to the HDF library and tweaks to downstream libraries (like Python wrappers). Will require effort to make performance optimal

  4. Build a Distributed Service: Allow others to serve this data behind a web API, built however they think best

    Good: Lets other groups think about this problem and evolve complex backend solutions while maintaining stable frontend API

    Bad: Complex to write and deploy. Probably not free. Introduces an intermediary between us and our data.

  5. New Formats for Scientific Data: Design a new format, optimized for scientific data in the cloud

    Good: Fast, intuitive, and modern

    Bad: Not a community standard

Now we discuss each option in more depth.

Use Other Formats, like Cloud Optimized GeoTIFF

We could use formats other than HDF and NetCDF that are already well established. The two that I hear most often proposed are Cloud Optimized GeoTIFF and Apache Parquet. Both are efficient, well designed for cloud storage, and well established as community standards. If you haven’t already, I strongly recommend reading Chris Holmes’ (Planet) blog series on Cloud Native Geospatial.

These formats are well designed for cloud storage because they support random access well with relatively few communications and with relatively simple code. If you needed to you could look at the Cloud Optimized GeoTIFF spec, and within an hour of reading, get an image that you wanted using nothing but a few curl commands. Metadata is in a clear centralized place. That metadata provides enough information to issue further commands to get the relevant bytes from the object store. Those bytes are stored in a format that is easily interpreted by a variety of common tools across all platforms.

However, neither of these formats are sufficiently expressive to handle some of the use cases of HDF and NetCDF. Recall our image earlier about atmospheric data:

Our data isn’t a parquet table, nor is it a stack of geo-images. While it’s true that you could store any data in these formats, for example by saving each horizontal slice as a GeoTIFF, or each spatial point as a row in a Parquet table, these storage layouts would be inefficient for regular access patterns. Some parts of the scientific community need blocked layouts for multi-dimensional array data.

HDF and Filesystems in Userspace (FUSE)

We could access HDF data on the cloud now if we were able to convince our operating system that S3 or GCS or ADL were a normal file system. This is a reasonable goal; cloud object stores look and operate much like normal file systems. They have directories that you can list and navigate. They have files/objects that you can copy, move, rename, and from which you can read or write small sections.

We can achieve this using an operating systems trick, FUSE, or Filesystem in Userspace. This allows us to make a program that the operating system treats as a normal file system. Several groups have already done this for a variety of cloud providers. Here is an example with the gcsfs Python library

$ pip install gcsfs --upgrade
$ mkdir /gcs
$ gcsfuse bucket-name /gcs --background
Mounting bucket bucket-name to directory /gcs

$ ls /gcs
my-file-1.hdf
my-file-2.hdf
my-file-3.hdf
...

Now we point our HDF library to a NetCDF file in that directory (which actually points to an object on Google Cloud Storage), and it happily uses C File objects to read and write data. The operating system passes the read/write requests to gcsfs, which goes out to the cloud to get data, and then hands it back to the operating system, which hands it to HDF. All normal HDF operations just work, although they may now be significantly slower. The cloud is further away than local disk.

This slowdown is significant because the HDF library makes many small 4kB reads in order to gather the metadata necessary to pull out a chunk of data. Each of those tiny reads made sense when the data was local, but now that we’re sending out a web request each time. This means that users can sit for minutes just to open a file.

Fortunately, we can be clever. By buffering and caching data, we can reduce the number of web requests. For example, when asked to download 4kB we actually download 100kB or 1MB. If some of the future 4kB reads are within this 1MB then we can return them immediately., Looking at HDF traces it looks like we can probably reduce “dozens” of web requests to “a few”.

FUSE also requires elevated operating system permissions, which can introduce challenges if working from Docker containers (which is likely on the cloud). Docker containers running FUSE need to be running in privileged mode. There are some tricks around this, but generally FUSE brings some excess baggage.

HDF and a Custom Reader

The HDF library doesn’t need to use C File pointers, we can extend it to use other storage backends as well. Virtual File Layers are an extension mechanism within HDF5 that could allow it to target cloud object stores. This has already been done to support Amazon’s S3 object store twice:

  1. Once by the HDF group, S3VFD (currently private),
  2. Once by Joe Jevnik and Scott Sanderson (Quantopian) at https://h5s3.github.io/h5s3/ (highly experimental)

This provides an alternative to FUSE that is better because it doesn’t require privileged access, but is worse because it only solves this problem for HDF and not all file access.

In either case we’ll need to do look-ahead buffering and caching to get reasonable performance (or see below).

Centralize Metadata

Alternatively, we might centralize metadata in the HDF file in order to avoid many hops throughout that file. This would remove the need to perform clever file-system caching and buffering tricks.

Here is a brief technical explanation from Joe Jevnik:

Regarding the centralization of metadata: this is already a feature of hdf5 and is used by many of the built-in drivers. This optimization is enabled by setting the H5FD_FEAT_AGGREGATE_METADATA and H5FD_FEAT_ACCUMULATE_METADATA feature flags in your VFL driver’s query function. The first flag says that the hdf5 library should pre-allocate a large region to store metadata, future metadata allocations will be served from this pool. The second flag says that the library should buffer metadata changes into large chunks before dispatching the VFL driver’s write function. Both the default driver (sec2) and h5s3 enable these optimizations. This is further supported by using the H5FD_FLMAP_DICHOTOMY free list option which uses one free list for metadata allocations and another for non-metadata allocations. If you really want to ensure that the metadata is aggregated, even without a repack, you can use the built-in ‘multi’ driver which dispatches different allocation flavors to their own driver.

Distributed Service

We could offload this problem to a company, like the non-profit HDF group or a for-profit cloud provider like Google, Amazon, or Microsoft. They would solve this problem however they like, and expose a web API that we can hit for our data.

This would be a distributed service of several computers on the cloud near our data, that takes our requests for what data we want, perform whatever tricks they deem appropriate to get that data, and then deliver it to us. This fleet of machines will still have to address the problems listed above, but we can let them figure it out, and presumably they’ll learn as they go.

However, this has both technical and social costs. Technically this is complex, and they’ll have to handle a new set of issues around scalability, consistency, and so on that are already solved(ish) in the cloud object stores. Socially this creates an intermediary between us and our data, which we may not want both for reasons of cost and trust.

The HDF group is working on such a service, HSDS that works on Amazon’s S3 (or anything that looks like S3). They have created a h5pyd library that is a drop-in replacement for the popular h5py Python library.

Presumably a cloud provider, like Amazon, Google, or Microsoft could do this as well. By providing open standards like OpenDAP they might attract more science users onto their platform to more efficiently query their cloud-hosted datasets.

The satellite imagery company Planet already has such a service.

New Formats for Scientific Data

Alternatively, we can move on from the HDF file format, and invent a new data storage specification that fits cloud storage (or other storage) more cleanly without worrying about supporting the legacy layout of existing HDF files.

This has already been going on, informally, for years. Most often we see people break large arrays into blocks, store each block as a separate object in the cloud object store with a suggestive name, and store a metadata file describing how the blocks relate to each other. This looks something like the following:

/metadata.json
/0.0.0.dat
/0.0.1.dat
/0.0.2.dat
 ...
/10.10.8.dat
/10.10.9.dat
/10.10.10.dat

There are many variants:

  • They might extend this to have groups or sub-arrays in sub-directories.
  • They might choose to compress the individual blocks in the .dat files or not.
  • They might choose different encoding schemes for the metadata and the binary blobs.

But generally most array people on the cloud do something like this with their research data, and they’ve been doing it for years. It works efficiently, is easy to understand and manage, and transfers well to any cloud platform, onto a local file system, or even into a standalone zip file or small database.

There are two groups that have done this in a more mature way, defining both modular standalone libraries to manage their data, as well as proper specification documents that inform others how to interpret this data in a long-term stable way.

These are both well maintained and well designed libraries (by my judgment), in Python and Java respectively. They offer layouts like the layout above, although with more sophistication. Entertainingly theirspecs are similar enough that another library, Z5, built a cross-compatible parser for each in C++. This unintended uniformity is a good sign. It means that both developer groups were avoiding creativity, and have converged on a sensible common solution. I encourage you to read the Zarr Spec in particular.

However, technical merits are not alone sufficient to justify a shift in data format, especially for archival datasets of record that we’re discussing. The institutions in charge of this data and have multi-decade horizons and so move slowly. For them, moving off of the historically community standard would be major shift.

And so we need to answer a couple of difficult questions:

  1. How hard is it to make HDF efficient in the cloud?
  2. How hard is it to shift the community to a new standard?

A Sense of Urgency

These questions are important now. NASA and other agencies are pushing NetCDF data into the Cloud today and will be increasing these rates substantially in the coming years.

From earthdata.nasa.gov/cmr-and-esdc-in-cloud (via Ryan Abernathey)

From its current cumulative archive size of almost 22 petabytes (PB), the volume of data in the EOSDIS archive is expected to grow to almost 247 PB by 2025, according to estimates by NASA’s Earth Science Data Systems (ESDS) Program. Over the next five years, the daily ingest rate of data into the EOSDIS archive is expected to reach 114 terabytes (TB) per day, with the upcoming joint NASA/Indian Space Research Organization Synthetic Aperture Radar (NISAR) mission (scheduled for launch by 2021) contributing an estimated 86 TB per day of data when operational.

This is only one example of many agencies in many domains pushing scientific data to the cloud.

Acknowledgements

Thanks to Joe Jevnik (Quantopian), John Readey (HDF Group), Rich Signell (USGS), and Ryan Abernathey (Columbia University) for their feedback when writing this article. This conversation started within the Pangeo collaboration.

Dask Release 0.17.0

$
0
0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.17.0. This a significant major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.16.0 release on November 21st.

You can conda install Dask:

conda install dask -c conda-forge

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Some notable changes follow.

Deprecations

  • Removed dask.dataframe.rolling_* methods, which were previously deprecated both in dask.dataframe and in pandas. These are replaced with the rolling.* namespace
  • We’ve generally stopped maintenance of the dask-ec2 project to launch dask clusters on Amazon’s EC2 using Salt. We generally recommend kubernetes instead both for Amazon’s EC2, and for Google and Azure as well

    dask.pydata.org/en/latest/setup/kubernetes.html

  • Internal state of the distributed scheduler has changed significantly. This may affect advanced users who were inspecting this state for debugging or diagnostics.

Task Ordering

As Dask encounters more complex problems from more domains we continually run into problems where its current heuristics do not perform optimally. This release includes a rewrite of our static task prioritization heuristics. This will improve Dask’s ability to traverse complex computations in a way that keeps memory use low.

To aid debugging we also integrated these heuristics into the GraphViz-style plots that come from the visualize method.

x=da.random.random(...)...x.visualize(color='order',cmap='RdBu')

Nested Joblib

Dask supports parallelizing Scikit-Learn by extending Scikit-Learn’s underlying library for parallelism, Joblib. This allows Dask to distribute some SKLearn algorithms across a cluster just by wrapping them with a context manager.

This relationship has been strengthened, and particular attention has been focused when nesting one parallel computation within another, such as occurs when you train a parallel estimator, like RandomForest, within another parallel computation, like GridSearchCV. Previously this would result in spawning too many threads/processes and generally oversubscribing hardware.

Due to recent combined development within both Joblib and Dask, these sorts of situations can now be resolved efficiently by handing them off to Dask, providing speedups even in single-machine cases:

fromsklearn.externalsimportjoblibimportdistributed.joblib# register the dask joblib backendfromdask.distributedimportClientclient=Client()est=ParallelEstimator()gs=GridSearchCV(est)withjoblib.parallel_backend('dask'):gs.fit()

See Tom Augspurger’s recent post with more details about this work:

Thanks to Tom Augspurger, Jim Crist, and Olivier Grisel who did most of this work.

Scheduler Internal Refactor

The distributed scheduler has been significantly refactored to change it from a forest of dictionaries:

priority={'a':1,'b':2,'c':3}dependencies={'a':{'b'},'b':{'c'},'c':[]}nbytes={'a':1000,'b':1000,'c':28}

To a bunch of objects:

tasks={'a':Task('a',priority=1,nbytes=1000,dependencies=...),'b':Task('b':priority=2,nbytes=1000,dependencies=...),'c':Task('c':priority=3,nbytes=28,dependencies=[])}

(there is much more state than what is listed above, but hopefully the examples above are clear.)

There were a few motivations for this:

  1. We wanted to try out Cython and PyPy, for which objects like this might be more effective than dictionaries.
  2. We believe that this is probably a bit easier for developers new to the schedulers to understand. The proliferation of state dictionaries was not highly discoverable.

Goal one ended up not working out. We have not yet been able to make the scheduler significantly faster under Cython or PyPy with this new layout. There is even a slight memory increase with these changes. However we have been happy with the results in code readability, and we hope that others find this useful as well.

Thanks to Antoine Pitrou, who did most of the work here.

User Priorities

You can now submit tasks with different priorities.

x=client.submit(f,1,priority=10)# Higher priority preferredy=client.submit(f,1,priority=-10)# Lower priority happens later

To be clear, Dask has always had priorities, they just weren’t easily user-settable. Higher priorities are given precedence. The default priority for all tasks is zero. You can also submit priorities for collections (like arrays and dataframes)

df=df.persist(priority=5)# give this computation higher priority.

Several related projects are also undergoing releases:

  • Tornado is updating to version 5.0 (there is a beta out now). This is a major change that will put Tornado on the Asyncio event loop in Python 3. It also includes many performance enhancements for high-bandwidth networks.
  • Bokeh 0.12.14 was just released.

    Note that you will need to update Dask to work with this version of Bokeh

  • Daskernetes, a new project for launching Dask on Kubernetes clusters

Acknowledgements

The following people contributed to the dask/dask repository since the 0.16.0 release on November 14th:

  • Albert DeFusco
  • Apostolos Vlachopoulos
  • castalheiro
  • James Bourbeau
  • Jon Mease
  • Ian Hopkinson
  • Jakub Nowacki
  • Jim Crist
  • John A Kirkham
  • Joseph Lin
  • Keisuke Fujii
  • Martijn Arts
  • Martin Durant
  • Matthew Rocklin
  • Markus Gonser
  • Nir
  • Rich Signell
  • Roman Yurchak
  • S. Andrew Sheppard
  • sephib
  • Stephan Hoyer
  • Tom Augspurger
  • Uwe L. Korn
  • Wei Ji
  • Xander Johnson

The following people contributed to the dask/distributed repository since the 1.20.0 release on November 14th:

  • Alexander Ford
  • Antoine Pitrou
  • Brett Naul
  • Brian Broll
  • Bruce Merry
  • Cornelius Riemenschneider
  • Daniel Li
  • Jim Crist
  • Kelvin Yang
  • Matthew Rocklin
  • Min RK
  • rqx
  • Russ Bubley
  • Scott Sievert
  • Tom Augspurger
  • Xander Johnson

Craft Minimal Bug Reports

$
0
0

Following up on a post on supporting users in open source this post lists some suggestions on how to ask a maintainer to help you with a problem.

You don’t have to follow these suggestions. They are optional. They make it more likely that a project maintainer will spend time helping you. It’s important to remember that their willingness to support you for free is optional too.

Crafting minimal bug reports is essential for the life and maintenance of community-driven open source projects. Doing this well is an incredible service to the community.

Minimal Complete Verifiable Examples

I strongly recommend following Stack Overflow’s guidelines on Minimal Complete Verifiable Exmamples. I’ll include brief highlights here:

… code should be …

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts needed to reproduce the problem
  • Verifiable – Test the code you’re about to provide to make sure it reproduces the problem

Lets be clear, this is hard and takes time.

As a question-asker I find that creating an MCVE often takes 10-30 minutes for a simple problem. Fortunately this work is usually straightforward, even if I don’t know very much about the package I’m having trouble with. Most of the work to create a minimal example is about removing all of the code that was specific to my application, and as the question-asker I am probably the most qualified person to do that.

When answering questions I often point people to StackOverflow’s MCVE document. They sometimes come back with a better-but-not-yet-minimal example. This post clarifies a few common issues.

As an running example I’m going to use Pandas dataframe problems.

Don’t post data

You shouldn’t post the file that you’re working with. Instead, try to see if you can reproduce the problem with just a few lines of data rather than the whole thing.

Having to download a file, unzip it, etc. make it much less likely that someone will actually run your example in their free time.

Don’t

I’ve uploaded my data to Dropbox and you can get it here: my-data.csv.gz

importpandasaspddf=pd.read_csv('my-data.csv.gz')

Do

You should be able to copy-paste the following to get enough of my data to cause the problem:

importpandasaspddf=pd.DataFrame({'account-start':['2017-02-03','2017-03-03','2017-01-01'],'client':['Alice Anders','Bob Baker','Charlie Chaplin'],'balance':[-1432.32,10.43,30000.00],'db-id':[1234,2424,251],'proxy-id':[525,1525,2542],'rank':[52,525,32],...})

Actually don’t include your data at all

Actually, your data probably has lots of information that is very specific to your application. Your eyes gloss over it but a maintainer doesn’t know what is relevant and what isn’t, so it will take them time to digest it if you include it. Instead see if you can reproduce your same failure with artificial or random data.

Don’t

Here is enough of my data to reproduce the problem

importpandasaspddf=pd.DataFrame({'account-start':['2017-02-03','2017-03-03','2017-01-01'],'client':['Alice Anders','Bob Baker','Charlie Chaplin'],'balance':[-1432.32,10.43,30000.00],'db-id':[1234,2424,251],'proxy-id':[525,1525,2542],'rank':[52,525,32],...})

Do

My actual problem is about finding the best ranked employee over a certain time period, but we can reproduce the problem with this simpler dataset. Notice that the dates are out of order in this data (2000-01-02 comes after 2000-01-03). I found that this was critical to reproducing the error.

importpandasaspddf=pd.DataFrame({'account-start':['2000-01-01','2000-01-03','2000-01-02'],'db-id':[1,2,3],'name':['Alice','Bob','Charlie'})

As we shrink down our example problem we often discover a lot about what causes the problem. This discovery is valuable and something that only the question-asker is capable of doing efficiently.

See how small you can make things

To make it even easier, see how small you can make your data. For example if working with tabular data (like Pandas), then how many columns do you actually need to reproduce the failure? How many rows do you actually need to reproduce the failure? Do the columns need to be named as you have them now or could they be just “A” and “B” or descriptive of the types within?

Do

importpandasaspddf=pd.DataFrame({'datetime':['2000-01-03','2000-01-02'],'id':[1,2]})

Remove unnecessary steps

Is every line in your example absolutely necessary to reproduce the error? If you’re able to delete a line of code then please do. Because you already understand your problem you are much more efficient at doing this than the maintainer is. They probably know more about the tool, but you know more about your code.

Don’t

The groupby step below is raising a warning that I don’t understand

df=pd.DataFrame(...)df=df[df.value>0]df=df.fillna(0)df.groupby(df.x).y.mean()# <-- this produces the error

Do

The groupby step below is raising a warning that I don’t understand

df=pd.DataFrame(...)df.groupby(df.x).y.mean()# <-- this produces the error

Use Syntax Highlighting

When using Github you can enclose code blocks in triple-backticks (the character on the top-left of your keyboard on US-standard QWERTY keyboards). It looks like this:

```python
x = 1
```

Provide complete tracebacks

You know all of that stuff between your code and the exception that is hard to make sense of? You should include it.

Don’t

I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

Do

I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

```python
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-4-7b96263abbfa> in <module>()
----> 1 div(1, 0)

<ipython-input-3-7685f97b4ce5> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
      3

ZeroDivisionError: division by zero
```

If the traceback is long that’s ok. If you really want to be clean you can put it in <details> brackets.

I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

### Traceback

<details>

```python
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-4-7b96263abbfa> in <module>()
----> 1 div(1, 0)

<ipython-input-3-7685f97b4ce5> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
      3

ZeroDivisionError: division by zero
```

</details>

Summer Student Projects 2018

$
0
0

Around this time of year students look for Summer projects. Often they get internships at potential future employers. Sometimes they become more engaged in open source software.

This blogpost contains some projects that I think are appropriate for a summer student in a computational field. They reflect my biases (which, assuming you read my blog, you’re ok with) and are by no means comprehensive of opportunities within the Scientific Python ecosystem. To be perfectly clear I’m only providing ideas and context here, I offer neither funding nor mentorship.

Criteria for a good project

  1. Is well defined and tightly scoped to reduce uncertainty about what a successful outcome looks like, and to reduce the necessity for high-level advising
  2. Is calibrated so that an industrious student can complete it in a few months
  3. It’s useful, but also peripheral. It has value to the ecosystem but is not critical enough that a core devs is likely to complete the task in the next few months, or be overly picky about the implementation.
  4. It’s interesting, and is likely to stimulate thought within the student
  5. It teaches valuable skills that will help the student in a future job search
  6. It can lead to future work, if the student makes a strong connection

The projects listed here target someone who already has decent knowledge of the fundamentals PyData or SciPy ecosystem (numpy, pandas, general understanding of programming, etc..). They are somewhat focused around Dask and other projects that I personally work on.

Distributed GPU NDArrays with CuPy, Bohrium, or other

Dask arrays coordinate many NumPy arrays to operate in parallel. It includes all of the parallel algorithms, leaving the in-memory implementation to NumPy chunks.

But the chunk arrays don’t actually have to be NumPy arrays, they just have to look similar enough to NumPy arrays to fool Dask Array. We’ve done this before with sparse arrays which implement a subset of the numpy.ndarray API, but with sparse storage, and it has worked nicely.

There are a few GPU NDArray projects out there that satisfy much of the NumPy interface:

It would be valuable to do the same thing with Dask Array with them. This might give us a decent general purpose distributed GPU array relatively cheaply. This would engage the following:

  1. Knowledge of GPUs and performance implications of using them
  2. NumPy protocols (assuming that the GPU library will still need some changes to make it fully compatible)
  3. Distributed performance, focusing on bandwidths between various parts of the architecture
  4. Profiling and benchmarking

Github issue for conversation is here: dask/dask #3007

Use Numba and Dask for Numerical Simulations

While Python is very popular in data analytics it has been less successful in hard-core numeric algorithms and simulation, which are typically done in C++/Fortran and MPI. This is because Python is perceived to be too slow for serious numerical computing.

Yet with recent advances in Numba for fast in-core computing and Dask for parallel computing things may be changing. Certainly fine-tuned C++/Fortran + MPI can out-perform Numba and Dask, but by how much? If the answer is only 10% or so then it could be that the lower barrier to entry of Numba, or the dynamic scaling of Dask, can make them competitive in fields where Python has not previously had a major impact.

For which kinds of problems is a dynamic JITted language almost-as-good as C++/MPI? For which kinds of problems is the dynamic nature of these tools valuable, either due to more rapid development, greater flexibility in accepting community created modules, dynamic load balancing, or other reasons?

This project would require the student to come in with an understanding of their own field, the kinds of computational problems that are relevant there, and an understanding of the performance characteristics that might make dynamic systems tolerable. They would learn about optimization and profiling, and would characterize the relevant costs of dynamic languages in a slightly more modern era.

Blocked Numerical Linear Algebra

Dask arrays contain some algorithms for blocked linear algebra, like least squares, QR, LU, Cholesky, etc.., but no particular attention has been paid to them.

It would be interesting to investigate the performance of these algorithms and compare them to proper distributed BLAS/LAPACK implementations. This will very likely lead to opportunities to improve the algorithms and possibly some of Dask’s internal machinery.

Dask-R and Dask-Julia

Someone with understanding of R’s or Julia’s networking stack could adapt Dask’s distributed scheduler for those languages. Recall that the dask.distributed network consists of a central scheduler, many distributed workers, one or more user-facing clients. Currently these are all written in Python and only really useful from that language.

Making this system useful in another language would require rewriting the client and worker code, but would not require rewriting the scheduler code, which is intentionally language agnostic. Fortunately the client and worker are both relatively simple codebases (relative to the scheduler at least) and minimal implementations could probably be written in around 1-2k lines each.

This would not provide the high-level collections like dask.array or dask.dataframe, but would provide all of the distributed networking, load balancing, resilience, etc.. that is necessary to build a distributed computing stack. It would also allow others to come later and build the high level collections that would be appropriate for that language (presumably R and Julia user communities don’t want exactly Pandas-style dataframe semantics anyway).

This is discussed further in dask/distributed #586 and has actually been partially implemented in Julia in the Invenia project.

This would require some knowledge of network programming and, ideally, async programming in either R or Julia.

High-Level NumPy Optimizations

Projects like Numpy and Dask array compute what the user says, even if a more efficient solution exists.

(x+1)[:5]# what user said
x[:5]+1# faster and equivalent solution

It would be useful to have a project that exactly copies the Numpy API, but constructs a symbolic representation of that computation instead of performs work. This would enable a few important use cases that we’ve seen arise recently. These include both applications from just analyzing the symbolic representation and also applications from changing it to a more optimal form:

  1. You could analyze this representation and warn users about intermediate stages that require a lot of RAM or compute time
  2. You could suggest ideal chunking patterns based on the full computation
  3. You could communicate this computation over the network to a remote server to perform the computation
  4. You could visualize the computation to help users or students understand what they’re computing
  5. You could manipulate the representation into more efficient forms (such as what is shown above)

The first part of this would be to construct a class that behaves like a Numpy array but constructs a symbolic tree representation instead. This would be similar to Sympy, Theano, Tensorflow, Blaze.expr or similar projects, but it would have much smaller scope and would not be at all creative in designing new APIs. I suspect that you could bootstrap this project quickly using systems like dask.array, which already do all of the shape and dtype computations for you. This is also a good opportunity to connect to recent and ongoing work in Numpy to establish protocols that allow other array libraries (like this one) to work smoothly with existing Numpy code.

Then you would start to build some of the analyses listed above on top of this representation. Some of these are harder than others to do robustly, but presumably they would get easier in time.

Dask Release 0.17.2

$
0
0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.17.2. This is a minor release with new features and stability improvements. This blogpost outlines notable changes since the 0.17.0 release on February 12th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Some notable changes follow:

Tornado 5.0

Tornado is a popular framework for concurrent network programming that Dask relies on heavily. Tornado recently released a major version update that included both some major features for Dask as well as a couple of bugs.

The new IOStream.read_into method allows Dask communications (or anyone using this API) to move large datasets more efficiently over the network with fewer copies. This enables Dask to take advantage of high performance networking available on modern super-computers. On the Cheyenne system, where we tested this, we were able to get the full 3GB/s bandwidth available through the Infiniband network with this change (when using a few worker processes).

Many thanks to Antoine Pitrou and Ben Darnell for their efforts on this.

At the same time there were some unforeseen issues in the update to Tornado 5.0. More pervasive use of bytearrays over bytes caused issues with compression libraries like Snappy and Python 2 that were not expecting these types. There is a brief window in distributed.__version__ == 1.21.3 that enables this functionality if Tornado 5.0 is present but will misbehave if Snappy is also present.

HTTP File System

Dask leverages a file-system-like protocol for access to remote data. This is what makes commands like the following work:

importdask.dataframeasdddf=dd.read_parquet('s3://...')df=dd.read_parquet('hdfs://...')df=dd.read_parquet('gcs://...')

We have now added http and https file systems for reading data directly from web servers. These also support random access if the web server supports range queries.

df=dd.read_parquet('https://...')

As with S3, HDFS, GCS, … you can also use these tools outside of Dask development. Here we read the first twenty bytes of the Pandas license:

fromdask.bytes.httpimportHTTPFileSystemhttp=HTTPFileSystem()withhttp.open('https://raw.githubusercontent.com/pandas-dev/pandas/master/LICENSE')asf:print(f.read(20))
b'BSD 3-Clause License'

Thanks to Martin Durant who did this work and manages Dask’s byte handling generally. See remote data documentation for more information.

Fixed a correctness bug in Dask dataframe’s shuffle

We identified and resolved a correctness bug in dask.dataframe’s shuffle that resulted in some rows being dropped during complex operations like joins and groupby-applies with many partitions.

See dask/dask #3201 for more information.

Cluster super-class and intelligent adaptive deployments

There are many Python subprojects that help you deploy Dask on different cluster resource managers like Yarn, SGE, Kubernetes, PBS, and more. These have all converged to have more-or-less the same API that we have now combined into a consistent interface that downstream projects can inherit from in distributed.deploy.Cluster.

Now that we have a consistent interface we have started to invest more in improving the interface and intelligence of these systems as a group. This includes both pleasant IPython widgets like the following:

as well as improved logic around adaptive deployments. Adaptive deployments allow clusters to scale themselves automatically based on current workload. If you have recently submitted a lot of work the scheduler will estimate its duration and ask for an appropriate number of workers to finish the computation quickly. When the computation has finished the scheduler will release the workers back to the system to free up resources.

The logic here has improved substantially including the following:

  • You can specify minimum and maximum limits on your adaptivity
  • The scheduler estimates computation duration and asks for workers appropriately
  • There is some additional delay in giving back workers to avoid hysteresis, or cases where we repeatedly ask for and return workers

Some news from related projects:

  • The young daskernetes project was renamed to dask-kubernetes. This displaces a previous project (that had not been released) for launching Dask on Google Cloud Platform. That project has been renamed to dask-gke.
  • A new project, dask-jobqueue was started to handle launching Dask clusters on traditional batch queuing systems like PBS, SLURM, SGE, TORQUE, etc.. This projet grew out of the Pangeo collaboration
  • A Dask Helm chart has been added to Helm’s stable channel

Acknowledgements

The following people contributed to the dask/dask repository since the 0.17.0 release on February 12h:

  • Anderson Banihirwe
  • Dan Collins
  • Dieter Weber
  • Gabriele Lanaro
  • John Kirkham
  • James Bourbeau
  • Julien Lhermitte
  • Matthew Rocklin
  • Martin Durant
  • Max Epstein
  • nkhadka
  • okkez
  • Pangeran Bottor
  • Rich Postelnik
  • Scott M. Edenbaum
  • Simon Perkins
  • Thrasibule
  • Tom Augspurger
  • Tor E Hagemann
  • Uwe L. Korn
  • Wes Roach

The following people contributed to the dask/distributed repository since the 1.21.0 release on February 12th:

  • Alexander Ford
  • Andy Jones
  • Antoine Pitrou
  • Brett Naul
  • Joe Hamman
  • John Kirkham
  • Loïc Estève
  • Matthew Rocklin
  • Matti Lyra
  • Sven Kreiss
  • Thrasibule
  • Tom Augspurger

Beyond Numpy Arrays in Python

$
0
0

Executive Summary

In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development.

However to solidify this growth and apply it across the ecosystem we now need to do some central planning to move from a pair-wise model where packages need to know about each other to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.

With moderate effort we can define a subset of the Numpy API that works well across all of them, allowing the ecosystem to more smoothly transition between hardware. This post describes the opportunities and challenges to accomplish this.

We start by discussing two kinds of libraries:

  1. Libraries that implement the Numpy API
  2. Libraries that consume the Numpy API and build new functionality on top of it

Libraries that Implement the Numpy API

The Numpy array is one of the foundations of the numeric Python ecosystem, and serves as the standard model for similar libraries in other languages. Today it is used to analyze satellite and biomedical imagery, financial models, genomes, oceans and the atmosphere, super-computer simulations, and data from thousands of other domains.

However, Numpy was designed several years ago, and its implementation is no longer optimal for some modern hardware, particularly multi-core workstations, many-core GPUs, and distributed clusters.

Fortunately other libraries implement the Numpy array API on these other architectures:

  • CuPy: implements the Numpy API on GPUs with CUDA
  • Sparse: implements the Numpy API for sparse arrays that are mostly zeros
  • Dask array: implements the Numpy API in parallel for multi-core workstations or distributed clusters

So even when the Numpy implementation is no longer ideal, the Numpy API lives on in successor projects.

Note: the Numpy implementation remains ideal most of the time. Dense in-memory arrays are still the common case. This blogpost is about the minority of cases where Numpy is not ideal

So today we can write code similar code between all of Numpy, GPU, sparse, and parallel arrays:

importnumpyasnpx=np.random.random(...)# Runs on a single CPUy=x.T.dot(np.log(x)+1)z=y-y.mean(axis=0)print(z[:5])importcupyascpx=cp.random.random(...)# Runs on a GPUy=x.T.dot(cp.log(x)+1)z=y-y.mean(axis=0)print(z[:5].get())importdask.arrayasdax=da.random.random(...)# Runs on many CPUsy=x.T.dot(da.log(x)+1)z=y-y.mean(axis=0)print(z[:5].compute())...

Additionally, each of the deep learning frameworks (TensorFlow, PyTorch, MXNet) has a Numpy-like thing that is similar-ish to Numpy’s API, but definitely not trying to be an exact match.

Libraries that consume and extend the Numpy API

At the same time as the development of Numpy APIs for different hardware, many libraries today build algorithmic functionality on top of the Numpy API:

  1. XArray for labeled and indexed collections of arrays
  2. Autograd and Tangent: for automatic differentiation
  3. TensorLy for higher order array factorizations
  4. Dask array which coordinates many Numpy-like arrays into a logical parallel array

    (dask array both consumes and implements the Numpy API)

  5. Opt Einsum for more efficient einstein summation operations

These projects and more enhance array computing in Python, building on new features beyond what Numpy itself provides.

There are also projects like Pandas, Scikit-Learn, and SciPy, that use Numpy’s in-memory internal representation. We’re going to ignore these libraries for this blogpost and focus on those libraries that only use the high-level Numpy API and not the low-level representation.

Opportunities and Challenges

Given the two groups of projects:

  1. New libraries that implement the Numpy API (CuPy, Sparse, Dask array)
  2. New libraries that consume and extend the Numpy API (XArray, Autograd/tangent, TensorLy, Einsum)

We want to use them together, applying Autograd to CuPy, TensorLy to Sparse, and so on, including all future implementations that might follow. This is challenging.

Unfortunately, while all of the array implementations APIs are very similar to Numpy’s API, they use different functions.

>>>numpy.siniscupy.sinFalse

This creates problems for the consumer libraries, because now they need to switch out which functions they use depending on which array-like objects they’ve been given.

deff(x):ifisinstance(x,numpy.ndarray):returnnp.sin(x)elifisinstance(x,cupy.ndarray):returncupy.sin(x)elif...

Today each array project implements a custom plugin system that they use to switch between some of the array options. Links to these plugin mechanisms are below if you’re interested:

For example XArray can use either Numpy arrays or Dask arrays. This has been hugely beneficial to users of that project, which today seamlessly transition from small in-memory datasets on their laptops to 100TB datasets on clusters, all using the same programming model. However when considering adding sparse or GPU arrays to XArray’s plugin system, it quickly became clear that this would be expensive today.

Building, maintaining, and extending these plugin mechanisms is costly. The plugin systems in each project are not alike, so any new array implementation has to go to each library and build the same code several times. Similarly, any new algorithmic library must build plugins to every ndarray implementation. Each library has to explicitly import and understand each other library, and has to adapt as those libraries change over time. This coverage is not complete, and so users lack confidence that their applications are portable between hardware.

Pair-wise plugin mechanisms make sense for a single project, but are not an efficient choice for the full ecosystem.

Solutions

I see two solutions today:

  1. Build a new library that holds dispatch-able versions of all of the relevant Numpy functions and convince everyone to use it instead of Numpy internally

  2. Build this dispatch mechanism into Numpy itself

Each has challenges.

Build a new centralized plugin library

We can build a new library, here called arrayish, that holds dispatch-able versions of all of the relevant Numpy functions. We then convince everyone to use it instead of Numpy internally.

So in each array-like library’s codebase we write code like the following:

# inside numpy's codebaseimportarrayishimportnumpy@arrayish.sin.register(numpy.ndarray,numpy.sin)@arrayish.cos.register(numpy.ndarray,numpy.cos)@arrayish.dot.register(numpy.ndarray,numpy.ndarray,numpy.dot)...
# inside cupy's codebaseimportarrayishimportcupy@arrayish.sin.register(cupy.ndarray,cupy.sin)@arrayish.cos.register(cupy.ndarray,cupy.cos)@arrayish.dot.register(cupy.ndarray,cupy.ndarray,cupy.dot)...

and so on for Dask, Sparse, and any other Numpy-like libraries.

In all of the algorithm libraries (like XArray, autograd, TensorLy, …) we use arrayish instead of Numpy

# inside XArray's codebase# import numpyimportarrayishasnumpy

This is the same plugin solution as before, but now we build a community standard plugin system that hopefully all of the projects can agree to use.

This reduces the big n by m cost of maintaining several plugin systems, to a more manageable n plus m cost of using a single plugin system in each library. This centralized project would also benefit, perhaps, from being better maintained than any individual project is likely to do on its own.

However this has costs:

  1. Getting many different projects to agree on a new standard is hard
  2. Algorithmic projects will need to start using arrayish internally, adding new imports like the following:

    importarrayishasnumpy

    And this wll certainly cause some complications interally

  3. Someone needs to build an maintain the central infrastructure

Hameer Abbasi put together a rudimentary prototype for arrayish here: github.com/hameerabbasi/arrayish. There has been some discussion about this topic, using XArray+Sparse as an example, in pydata/sparse #1

Dispatch from within Numpy

Alternatively, the central dispatching mechanism could live within Numpy itself.

Numpy functions could learn to hand control over to their arguments, allowing the array implementations to take over when possible. This would allow existing Numpy code to work on externally developed array implementations.

There is precedent for this. The array_ufunc protocol allows any class that defines the __array_ufunc__ method to take control of any Numpy ufunc like np.sin or np.exp. Numpy reductions like np.sum already look for .sum methods on their arguments and defer to them if possible.

Some array projects, like Dask and Sparse, already implement the __array_ufunc__ protocol. There is also an open PR for CuPy. Here is an example showing Numpy functions on Dask arrays cleanly.

>>>importnumpyasnp>>>importdask.arrayasda>>>x=da.ones(10,chunks=(5,))# A Dask array>>>np.sum(np.exp(x))# Apply Numpy function to a Dask arraydask.array<sum-aggregate,shape=(),dtype=float64,chunksize=()># get a Dask array

I recommend that all Numpy-API compatible array projects implement the __array_ufunc__ protocol.

This works for many functions, but not all. Other operations like tensordot, concatenate, and stack occur frequently in algorithmic code but are not covered here.

This solution avoids the community challenges of the arrayish solution above. Everyone is accustomed to aligning themselves to Numpy’s decisions, and relatively little code would need to be rewritten.

The challenge with this approach is that historically Numpy has moved more slowly than the rest of the ecosystem. For example the __array_ufunc__ protocol mentioned above was discussed for several years before it was merged. Fortunately Numpy has recently receivedfunding to help it make changes like this more rapidly. The full time developers hired under this funding have just started though, and it’s not clear how much of a priority this work is for them at first.

For what it’s worth I’d prefer to see this Numpy protocol solution take hold.

Final Thoughts

In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development.

However to solidify this growth and apply it across the ecosystem we now need to do some central planning to move from a pair-wise model where packages need to know about each other to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.

The community has done this transition before (Numeric + Numarray -> Numpy, the Scikit-Learn fit/predict API, etc..) usually with surprisingly positive results.

The open questions I have today are the following:

  1. How quickly can Numpy adapt to this demand for protocols while still remaining stable for its existing role as foundation of the ecosystem
  2. What algorithmic domains can be written in a cross-hardware way that depends only on the high-level Numpy API, and doesn’t require specialization at the data structure level. Clearly some domains exist (XArray, automatic differentiation), but how common are these?
  3. Once a standard protocol is in place, what other array-like implementations might arise? In-memory compression? Probabilistic? Symbolic?

Update

After discussing this topic at the May NumPy Developer Sprint at BIDS a few of us have drafted a Numpy Enhancement Proposal (NEP) available here.

Dask Release 0.18.0

$
0
0

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the the last release blogpost for 0.17.2 on March 21st.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

We list some breaking changes below, followed up by changes that are less important, but still fun afterwards.

Notable Breaking changes

Centralized configuration

Taking full advantage of Dask sometimes requires user configuration. This might be to control logging verbosity, specify cluster configuration, provide credentials for security, or any of several other options that arise in production.

We’ve found that different computing cultures like to specify configuration in several different ways:

  1. Configuration files
  2. Environment variables
  3. Directly within Python code

Previously this was handled with a variety of different solutions among the different dask-foo subprojects. Dask.distributed had one system, dask-kubernetes had another, and so on.

Now we centralize configuration in the dask.config module, which collects configuration from config files, environment variables, and runtime code, and makes it centrally available to all Dask subprojects. A number of Dask subprojects (dask.distributed, dask-kubernetes, and dask-jobqueue), are being co-released at the same time to take advantage of this.

If you were actively using Dask.distributed’s configuration files some things have changed:

  1. The configuration is now namespaced and more heavily nested. Here is an example from the dask.distributed default config file today:

    distributed:version:2scheduler:allowed-failures:3# number of retries before a task is considered badbandwidth:100000000# 100 MB/s estimated worker-worker bandwidthwork-stealing:True# workers should steal tasks from each otherworker-ttl:null# like '60s'. Workers must heartbeat faster than thisworker:multiprocessing-method:forkserveruse-file-locking:True
  2. The default configuration location has moved from ~/.dask/config.yaml to ~/.config/dask/distributed.yaml, where it will live along side several other files like kubernetes.yaml, jobqueue.yaml, and so on.

However, your old configuration files will still be found and their values will be used appropriately. We don’t make any attempt to migrate your old config values to the new location though. You may want to delete the auto-generated ~/.dask/config.yaml file at some point, if you felt like being particularly clean.

You can learn more about Dask’s configuration in Dask’s configuration documentation

Replaced the common get= keyword with scheduler=

Dask can execute code with a variety of scheduler backends based on threads, processes, single-threaded execution, or distributed clusters.

Previously, users selected between these backends using the somewhat generically named get= keyword:

x.compute(get=dask.threaded.get)x.compute(get=dask.multiprocessing.get)x.compute(get=dask.local.get_sync)

We’ve replaced this with a newer, and hopefully more clear, scheduler= keyword:

x.compute(scheduler='threads')x.compute(scheduler='processes')x.compute(scheduler='single-threaded')

The get= keyword has been deprecated and will raising a warning. It will be removed entirely on the next major release.

For more information, see documentation on selecting different schedulers

Deplaced dask.set_options with dask.config.set

Related to the configuration changes, we now include runtime state in the configuration. Previously people used to set runtime state with the dask.set_options context manager. Now we recommend using dask.config.set:

withdask.set_options(scheduler='threads'):# Before...withdask.config.set(scheduler='threads'):# After...

The dask.set_options function is now an alias to dask.config.set.

Removed the dask.array.learn subpackage

This was unadvertised and saw very little use. All functionality (and much more) is now available in Dask-ML.

Other

  • We’ve removed the token= keyword from map_blocks and moved the functionality to the name= keyword.
  • The dask.distributed.worker_client automatically rejoins the threadpool when you close the context manager
  • The Dask.distributed protocol now interprets msgpack arrays as tuples rather than lists. As a reminder, using different versions of Dask.distributed workers/scheduler/clients is not supported.

Fun changes

Arrays

Generalized Universal Functions

TODO: shorten

Dask.array now supports Numpy-style Generalized Universal Functions (gufuncs) transparently. As a reminder, Numpy’s gufuncs specify the axes along which they can be easily broadcast and the axes that they need to have fully in memory. This means that you can apply a broader set of Numpy functions on Dask arrays and have everything just work.

Here are a few examples:

  1. Apply a normal Numpy GUFunc, eig, directly onto a Dask array

    importdask.arrayasdaimportnumpyasnp# Apply a Numpy GUFunc, eig, directly onto a Dask arrayx=da.random.normal(size=(3,10,10),chunks=(2,10,10))w,v=np.linalg._umath_linalg.eig(x,output_dtypes=(float,float))

    Numpy has gufuncs of many of its internal functions, but they haven’t yet decided to switch these out to the public API.

  2. We can define our own gufunc from a Python function

    @da.as_gufunc(signature="(i)->()",output_dtypes=float,vectorize=True)defgufoo(x):returnnp.mean(x,axis=-1)y=gufoo(x)

    Here we use just np.mean, but you could possibly imagine something more complex.

  3. We can also define GUFuncs with other projects, like Numba

    importnumba@numba.vectorize([float64(float64,float64)])deff(x,y):returnx+yf(x,y)

    What I like about this is that Dask and Numba developers didn’t coordinate at all on this feature, it’s just that they both support the Numpy GUFunc protocol, so you get interactions like this for free.

However, there are several limitations to Dask’s use of gufuncs. Inner or core dimensions must be single-chunked. This is a good opportunity to use the da.rechunk method and rechunking.

For more information see Dask’s GUFunc documentation. This work was done by Markus Gonser (@magonser).

New “auto” value for rechunking

How you lay out your arrays into chunks can greatly affect the performance of many algorithms. Often expert users will explicitly rechunk their arrays before doing certain algorithms like FFTs, matrix multiplies, covariance computations, etc.. For example they might want to rechunk an array so that the 0th dimension has a single chunk and the 1st dimension expands or contracts as is necessary to maintain a good chunk size. Previously they had to do a bit of math to determine what size to make the chunks in that 1st dimension. Now they can write something like this:

withdask.config.set({'array.chunk-size':'100MiB'}):x=x.rechunk({0:x.shape[0],# single chunk in this dimension# 1: 100e6 / x.dtype.itemsize / x.shape[0],  # before we had to calculate manually1:'auto'# respond as necessary in this dimension})

This is especially valuable for higher dimensional arrays. Auto can be used anywhere that accepts chunks.

x=da.from_array(x,chunks=(10,'auto'))

To be clear though, this doesn’t do “automatic chunking”, which is a very hard problem in general. Users still need to be aware of their computations and how they want to chunk, this just makes it marginally easier to make good decisions.

Algorithmic improvements

Dask.array gained a full einsum implementation thanks to Simon Perkins.

Also, thanks to work from Jeremy Chan, Dask.array’s QR decompositions became nicer in two ways:

  1. They support short-and-fat arrays
  2. The tall-and-skinny variant now operates more robustly in less memory. Here is a friendly GIF of execution:

Native support for the Zarr format for chunked n-dimensional arrays landed thanks to Martin Durant and John A Kirkham. Zarr has been especially useful due to its speed, simple spec, support of the full NetCDF style conventions, and amenability to cloud storage.

Dataframes

TODO: Add details

Pandas 0.23.0 support

Thanks to Tom Augspurger Dask.dataframe is consistent with changes in the recent Pandas 0.23 release.

Orc support

Dask.dataframe has grown a reader for the Apache ORC format.

Orc is a format for tabular data storage that is common in the Hadoop ecosystem. The new dd.read_orc function parallelizes around similarly new ORC functionality within PyArrow . Thanks to Jim Crist for the work on the Arrow side and Martin Durant for parallelizing it with Dask.

Read_json support

Dask.dataframe now has also grown a reader for JSON files.

The dd.read_json function matches most of the the pandas.read_json API.

This came about shortly after a recent PyCon 2018 talk comparing Spark and Dask dataframe where Irina Truong mentioned that it was missing. Thanks to Martin Durant and Irina Truong for this contribution.

See the dataframe data ingestion documentation for more information about either JSON, ORC, or any of the other formats supported by Dask.dataframe.

Joblib

TODO: shorten

Joblib, the parallelism library behind most of Scikit-Learn has had a Dask backend for a while now. While it has always been pretty easy to use, it’s now becoming much easier to use well. After using this in practice for a while together with the Scikit-Learn developers, we’ve identified and smoothed over a number of usability issues:

  • Simpler startup:

    Previously you had to explicitly import distributed.joblib, in order to register the Dask backend, this is no longer necessary. Scikit-Learn will the find the backend automatically.

  • Using Dask globally:

    Previously you always had to wrap Joblib/Scikit-Learn commands within a context manager:

    fromdask.distributedimportClientclient=Client()# start a local clusterfromsklearn.externalsimportjoblibwithjoblib.parallel_backend("dask"):# run my scikit-learn code here

    This is still the right way to do things when writing serious software, but the context manager was a bit of a pain for interactive use. Now you can set Joblib to use Dask globally.

    ```

This release is timed with the following packages:

  1. dask
  2. distributed
  3. dask-kubernetes
  4. dask-jobqueue

There is also a new repository for deploying applications on YARN (a job scheduler common in Hadoop environments) called skein. Early adopters welcome.

Acknowledgements

Since March 21st, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

  • Andrethrill
  • Beomi
  • Brendan Martin
  • Christopher Ren
  • Guido Imperiale
  • Diane Trout
  • fjetter
  • Frederick
  • Henry Doupe
  • James Bourbeau
  • Jeremy Chen
  • Jim Crist
  • John A Kirkham
  • Jon Mease
  • Jörg Dietrich
  • Kevin Mader
  • Ksenia Bobrova
  • Larsr
  • Marc Pfister
  • Markus Gonser
  • Martin Durant
  • Matt Lee
  • Matthew Rocklin
  • Pierre-Bartet
  • Scott Sievert
  • Simon Perkins
  • Stefan van der Walt
  • Stephan Hoyer
  • Tom Augspurger
  • Uwe L. Korn
  • Yu Feng

The dask/distributed repository for distributed computing:

  • Bmaisonn
  • Grant Jenks
  • Henry Doupe
  • Irene Rodriguez
  • Irina Truong
  • John A Kirkham
  • Joseph Atkins-Turkish
  • Kenneth Koski
  • Loïc Estève
  • Marius van Niekerk
  • Martin Durant
  • Matthew Rocklin
  • Olivier Grisel
  • Russ Bubley
  • Tom Augspurger
  • Tony Lorenzo

The dask-kubernetes repository for deploying Dask on Kubernetes

  • Brendan Martin
  • J Gerard
  • Matthew Rocklin
  • Olivier Grisel
  • Yuvi Panda

The dask-jobqueue repository for deploying Dask on HPC job schedulers

  • Guillaume Eynard-Bontemps
  • jgerardsimcock
  • Joe Hamman
  • Joseph Hamman
  • Loïc Estève
  • Matthew Rocklin
  • Ray Bell
  • Rich Signell
  • Shawn Taylor
  • Spencer Clark

The dask-ml repository for scalable machine learning:

  • Christopher Ren
  • Jeremy Chen
  • Matthew Rocklin
  • Scott Sievert
  • Tom Augspurger

Dask Release 0.18.0

$
0
0

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the the last release blogpost for 0.17.2 on March 21st.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

We list some breaking changes below, followed up by changes that are less important, but still fun.

Notable Breaking changes

Centralized configuration

Taking full advantage of Dask sometimes requires user configuration, especially in a distributed setting. This might be to control logging verbosity, specify cluster configuration, provide credentials for security, or any of several other options that arise in production.

We’ve found that different computing cultures like to specify configuration in several different ways:

  1. Configuration files
  2. Environment variables
  3. Directly within Python code

Previously this was handled with a variety of different solutions among the different dask subprojects. The dask-distributed project had one system, dask-kubernetes had another, and so on.

Now we centralize configuration in the dask.config module, which collects configuration from config files, environment variables, and runtime code, and makes it centrally available to all Dask subprojects. A number of Dask subprojects (dask.distributed, dask-kubernetes, and dask-jobqueue), are being co-released at the same time to take advantage of this.

If you were actively using Dask.distributed’s configuration files some things have changed:

  1. The configuration is now namespaced and more heavily nested. Here is an example from the dask.distributed default config file today:

    distributed:version:2scheduler:allowed-failures:3# number of retries before a task is considered badwork-stealing:True# workers should steal tasks from each otherworker-ttl:null# like '60s'. Workers must heartbeat faster than thisworker:multiprocessing-method:forkserveruse-file-locking:True
  2. The default configuration location has moved from ~/.dask/config.yaml to ~/.config/dask/distributed.yaml, where it will live along side several other files like kubernetes.yaml, jobqueue.yaml, and so on.

However, your old configuration files will still be found and their values will be used appropriately. We don’t make any attempt to migrate your old config values to the new location though. You may want to delete the auto-generated ~/.dask/config.yaml file at some point, if you felt like being particularly clean.

You can learn more about Dask’s configuration in Dask’s configuration documentation

Replaced the common get= keyword with scheduler=

Dask can execute code with a variety of scheduler backends based on threads, processes, single-threaded execution, or distributed clusters.

Previously, users selected between these backends using the somewhat generically named get= keyword:

x.compute(get=dask.threaded.get)x.compute(get=dask.multiprocessing.get)x.compute(get=dask.local.get_sync)

We’ve replaced this with a newer, and hopefully more clear, scheduler= keyword:

x.compute(scheduler='threads')x.compute(scheduler='processes')x.compute(scheduler='single-threaded')

The get= keyword has been deprecated and will raising a warning. It will be removed entirely on the next major release.

For more information, see documentation on selecting different schedulers

Deplaced dask.set_options with dask.config.set

Related to the configuration changes, we now include runtime state in the configuration. Previously people used to set runtime state with the dask.set_options context manager. Now we recommend using dask.config.set:

withdask.set_options(scheduler='threads'):# Before...withdask.config.set(scheduler='threads'):# After...

The dask.set_options function is now an alias to dask.config.set.

Removed the dask.array.learn subpackage

This was unadvertised and saw very little use. All functionality (and much more) is now available in Dask-ML.

Other

  • We’ve removed the token= keyword from map_blocks and moved the functionality to the name= keyword.
  • The dask.distributed.worker_client automatically rejoins the threadpool when you close the context manager
  • The Dask.distributed protocol now interprets msgpack arrays as tuples rather than lists. As a reminder, using different versions of Dask.distributed workers/scheduler/clients is not supported.

Fun changes

Arrays

Generalized Universal Functions

Dask.array now supports Numpy-style Generalized Universal Functions (gufuncs) transparently. This means that you can apply normal Numpy GUFuncs, like eig in the example below, directly onto a Dask arrays:

importdask.arrayasdaimportnumpyasnp# Apply a Numpy GUFunc, eig, directly onto a Dask arrayx=da.random.normal(size=(10,10,10),chunks=(2,10,10))w,v=np.linalg._umath_linalg.eig(x,output_dtypes=(float,float))# w and v are dask arrays with eig applied along the latter two axes

Numpy has gufuncs of many of its internal functions, but they haven’t yet decided to switch these out to the public API. Additionally we can define GUFuncs with other projects, like Numba:

importnumba@numba.vectorize([float64(float64,float64)])deff(x,y):returnx+yz=f(x,y)# if x and y are dask arrays, then z will be too

What I like about this is that Dask and Numba developers didn’t coordinate at all on this feature, it’s just that they both support the Numpy GUFunc protocol, so you get interactions like this for free.

For more information see Dask’s GUFunc documentation. This work was done by Markus Gonser (@magonser).

New “auto” value for rechunking

Dask arrays now accept a value, “auto”, wherever a chunk value would previously be accepted. This asks Dask to rechunk those dimensions to achieve a good default chunk size.

x=x.rechunk({0:x.shape[0],# single chunk in this dimension# 1: 100e6 / x.dtype.itemsize / x.shape[0],  # before we had to calculate manually1:'auto'# Now we allow this dimension to respond to get ideal chunk size})# orx=da.from_array(img,chunks='auto')

This also checks the array.chunk-size config value for optimal chunk sizes

>>>dask.config.get('array.chunk-size')'128MiB'

To be clear though, this doesn’t do “automatic chunking”, which is a very hard problem in general. Users still need to be aware of their computations and how they want to chunk, this just makes it marginally easier to make good decisions.

Algorithmic improvements

Dask.array gained a full einsum implementation thanks to Simon Perkins.

Also, thanks to work from Jeremy Chan, Dask.array’s QR decompositions became nicer in two ways:

  1. They support short-and-fat arrays
  2. The tall-and-skinny variant now operates more robustly in less memory. Here is a friendly GIF of execution:

Native support for the Zarr format for chunked n-dimensional arrays landed thanks to Martin Durant and John A Kirkham. Zarr has been especially useful due to its speed, simple spec, support of the full NetCDF style conventions, and amenability to cloud storage.

Dataframes

As usual, Dask Dataframes had many small improvements. Of note is continued compatibility with Pandas as they released 0.23, and some new data ingestion formats.

Pandas 0.23.0 support

Thanks to Tom Augspurger Dask.dataframe is consistent with changes in the recent Pandas 0.23 release.

Orc support

Dask.dataframe has grown a reader for the Apache ORC format.

Orc is a format for tabular data storage that is common in the Hadoop ecosystem. The new dd.read_orc function parallelizes around similarly new ORC functionality within PyArrow . Thanks to Jim Crist for the work on the Arrow side and Martin Durant for parallelizing it with Dask.

Read_json support

Dask.dataframe now has also grown a reader for JSON files.

The dd.read_json function matches most of the the pandas.read_json API.

This came about shortly after a recent PyCon 2018 talk comparing Spark and Dask dataframe where Irina Truong mentioned that it was missing. Thanks to Martin Durant and Irina Truong for this contribution.

See the dataframe data ingestion documentation for more information about either JSON, ORC, or any of the other formats supported by Dask.dataframe.

Joblib

The Joblib library for parallel computing within Scikit-Learn has had a Dask backend for a while now. While it has always been pretty easy to use, it’s now becoming much easier to use well without much expertise. After using this in practice for a while together with the Scikit-Learn developers, we’ve identified and smoothed over a number of usability issues. These changes will only be fully available after the next Scikit-Learn release (hopefully soon) at which point we’ll probably release a new blogpost dedicated to the topic.

This release is timed with the following packages:

  1. dask
  2. distributed
  3. dask-kubernetes

There is also a new repository for deploying applications on YARN (a job scheduler common in Hadoop environments) called skein. Early adopters welcome.

Acknowledgements

Since March 21st, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

  • Andrethrill
  • Beomi
  • Brendan Martin
  • Christopher Ren
  • Guido Imperiale
  • Diane Trout
  • fjetter
  • Frederick
  • Henry Doupe
  • James Bourbeau
  • Jeremy Chen
  • Jim Crist
  • John A Kirkham
  • Jon Mease
  • Jörg Dietrich
  • Kevin Mader
  • Ksenia Bobrova
  • Larsr
  • Marc Pfister
  • Markus Gonser
  • Martin Durant
  • Matt Lee
  • Matthew Rocklin
  • Pierre-Bartet
  • Scott Sievert
  • Simon Perkins
  • Stefan van der Walt
  • Stephan Hoyer
  • Tom Augspurger
  • Uwe L. Korn
  • Yu Feng

The dask/distributed repository for distributed computing:

  • Bmaisonn
  • Grant Jenks
  • Henry Doupe
  • Irene Rodriguez
  • Irina Truong
  • John A Kirkham
  • Joseph Atkins-Turkish
  • Kenneth Koski
  • Loïc Estève
  • Marius van Niekerk
  • Martin Durant
  • Matthew Rocklin
  • Olivier Grisel
  • Russ Bubley
  • Tom Augspurger
  • Tony Lorenzo

The dask-kubernetes repository for deploying Dask on Kubernetes

  • Brendan Martin
  • J Gerard
  • Matthew Rocklin
  • Olivier Grisel
  • Yuvi Panda

The dask-jobqueue repository for deploying Dask on HPC job schedulers

  • Guillaume Eynard-Bontemps
  • jgerardsimcock
  • Joe Hamman
  • Joseph Hamman
  • Loïc Estève
  • Matthew Rocklin
  • Ray Bell
  • Rich Signell
  • Shawn Taylor
  • Spencer Clark

The dask-ml repository for scalable machine learning:

  • Christopher Ren
  • Jeremy Chen
  • Matthew Rocklin
  • Scott Sievert
  • Tom Augspurger
Viewing all 100 articles
Browse latest View live