Quantcast
Channel: Working notes by Matthew Rocklin - SciPy
Viewing all 100 articles
Browse latest View live

Dask Scaling Limits

$
0
0

This work is supported by Anaconda Inc.

History

For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when avoiding the pain of deploying and configuring distributed systems. We still believe in the efficiency of single-node parallelism, but in the years since, Dask has extended itself to support larger distributed systems.

After that first year, Dask focused equally on both single-node and distributed parallelism. We maintain two entirely separate schedulers, one optimized for each case. This allows Dask to be very simple to use on single machines, but also scale up to thousand-node clusters and 100+TB datasets when needed with the same API.

Dask’s distributed system has a single central scheduler and many distributed workers. This is a common architecture today that scales out to a few thousand nodes. Roughly speaking Dask scales about the same as a system like Apache Spark, but less well than a high-performance system like MPI.

An Example

Most Dask examples in blogposts or talks are on modestly sized datasets, usually in the 10-50GB range. This, combined with Dask’s history with medium-data on single-nodes may have given people a more humble impression of Dask than is appropriate.

As a small nudge, here is an example using Dask to interact with 50 36-core nodes on an artificial terabyte dataset.

This is a common size for a typical modestly sized Dask cluster. We usually see Dask deployment sizes either in the tens of machines (usually with Hadoop style or ad-hoc enterprise clusters), or in the few-thousand range (usually with high performance computers or cloud deployments). We’re showing the modest case here just due to lack of resources. Everything in that example should work fine scaling out a couple extra orders of magnitude.

Challenges to Scaling Out

For the rest of the article we’ll talk about common causes that we see today that get in the way of scaling out. These are collected from experience working both with people in the open source community, as well as private contracts.

Simple Map-Reduce style

If you’re doing simple map-reduce style parallelism then things will be pretty smooth out to a large number of nodes. However, there are still some limitations to keep in mind:

  1. The scheduler will have at least one, and possibly a few connections open to each worker. You’ll want to ensure that your machines can have many open file handles at once. Some Linux distributions cap this at 1024 by default, but it is easy to change.

  2. The scheduler has an overhead of around 200 microseconds per task. So if each task takes one second then your scheduler can saturate 5000 cores, but if each task takes only 100ms then your scheduler can only saturate around 500 cores, and so on. Task duration imposes an inversely proportional constraint on scaling.

    If you want to scale larger than this then your tasks will need to start doing more work in each task to avoid overhead. Often this involves moving inner for loops within tasks rather than spreading them out to many tasks.

More complex algorithms

If you’re doing more complex algorithms (which is common among Dask users) then many more things can break along the way. High performance computing isn’t about doing any one thing well, it’s about doing nothing badly. This section lists a few issues that arise for larger deployments:

  1. Dask collection algorithms may be suboptimal.

    The parallel algorithms in Dask-array/bag/dataframe/ml are pretty good, but as Dask scales out to larger clusters and its algorithms are used by more domains we invariably find that small corners of the API fail beyond a certain point. Luckily these are usually pretty easy to fix after they are reported.

  2. The graph size may grow too large for the scheduler

    The metadata describing your computation has to all fit on a single machine, the Dask scheduler. This metadata, the task graph, can grow big if you’re not careful. It’s nice to have a scheduler process with at least a few gigabytes of memory if you’re going to be processing million-node task graphs. A task takes up around 1kB of memory if you’re careful to avoid closing over any unnecessary local data.

  3. The graph serialization time may become annoying for interactive use

    Again, if you have million node task graphs you’re going to be serializaing them up and passing them from the client to the scheduler. This is fine, assuming they fit at both ends, but can take up some time and limit interactivity. If you press compute and nothing shows up on the dashboard for a minute or two, this is what’s happening.

  4. The interactive dashboard plots stop being as useful

    Those beautiful plots on the dashboard were mostly designed for deployments with 1-100 nodes, but not 1000s. Seeing the start and stop time of every task of a million-task computation just isn’t something that our brains can fully understand.

    This is something that we would like to improve. If anyone out there is interested in scalable performance diagnostics, please get involved.

  5. Other components that you rely on, like distributed storage, may also start to break

    Dask provides users more power than they’re accustomed to. It’s easy for them to accidentally clobber some other component of their systems, like distributed storage, a local database, the network, and so on, with too many requests.

    Many of these systems provide abstractions that are very well tested and stable for normal single-machine use, but that quickly become brittle when you have a thousand machines acting on them with the full creativity of a novice user. Dask provies some primitives like distributed locks and queues to help control access to these resources, but it’s on the user to use them well and not break things.

Conclusion

Dask scales happily out to tens of nodes, like in the example above, or to thousands of nodes, which I’m not showing here simply due to lack of resources.

Dask provides this scalability while still maintaining the flexibility and freedom to build custom systems that has defined the project since it began. However, the combination of scalability and freedom makes it hard for Dask to fully protect users from breaking things. It’s much easier to protect users when you can constrain what they can do. When users stick to standard workflows like Dask dataframe or Dask array they’ll probably be ok, but when operating with full creativity at the thousand-node scale some expertise will invariably be necessary. We try hard to provide the diagnostics and tools necessary to investigate issues and control operation. The project is getting better at this every day, in large part due to some expert users out there.

A Call for Examples

Do you use Dask on more than one machine to do interesting work? We’d love to hear about it either in the comments below, or in this online form.


Dask Development Log

$
0
0

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Current efforts for June 2018 in Dask and Dask-related projects include the following:

  1. Yarn Deployment
  2. More examples for machine learning
  3. Incremental machine learning
  4. HPC Deployment configuration

Yarn deployment

Dask developers often get asked How do I deploy Dask on my Hadoop/Spark/Hive cluster?. We haven’t had a very good answer until recently.

Most Hadoop/Spark/Hive clusters are actually Yarn clusters. Yarn is the most common cluster manager used by many clusters that are typically used to run Hadoop/Spark/Hive jobs including any cluster purchased from a vendor like Cloudera or Hortonworks. If your application can run on Yarn then it can be a first class citizen here.

Unfortunately Yarn has really only been accessible through a Java API, and so has been difficult for Dask to interact with. That’s changing now with a few projects, including:

  • dask-yarn: an easy way to launch Dask on Yarn clusters
  • skein: an easy way to launch generic services on Yarn clusters (this is primarily what backs dask-yarn)
  • conda-pack: an easy way to bundle together a conda package into a redeployable environment, such as is useful when launching Python applications on Yarn

This work is all being done by Jim Crist who is, I believe, currently writing up a blogpost about the topic at large. Dask-yarn was soft-released last week though, so people should give it a try and report feedback on the dask-yarn issue tracker. If you ever wanted direct help on your cluster, now is the right time because Jim is working on this actively and is not yet drowned in user requests so generally has a fair bit of time to investigate particular cases.

fromdask_yarnimportYarnClusterfromdask.distributedimportClient# Create a cluster where each worker has two cores and eight GB of memorycluster=YarnCluster(environment='environment.tar.gz',worker_vcores=2,worker_memory="8GB")# Scale out to ten such workerscluster.scale(10)# Connect to the clusterclient=Client(cluster)

More examples for machine learning

Dask maintains a Binder of simple examples that show off various ways to use the project. This allows people to click a link on the web and quickly be taken to a Jupyter notebook running on the cloud. It’s a fun way to quickly experience and learn about a new project.

Previously we had a single example for arrays, dataframes, delayed, machine learning, etc.

Now Scott Sievert is expanding the examples within the machine learning section. He has submitted the following two so far:

  1. Incremental training with Scikit-Learn and large datasets
  2. Dask and XGBoost

I believe he’s planning on more. If you use dask-ml and have recommendations or want to help, you might want to engage in the dask-ml issue tracker or dask-examples issue tracker.

Incremental training

The incremental training mentioned as an example above is also new-ish. This is a Scikit-Learn style meta-estimator that wraps around other estimators that support the partial_fit method. It enables training on large datasets in an incremental or batchwise fashion.

Before

fromsklearn.linear_modelimportSGDClassifiersgd=SGDClassifier(...)importpandasaspdforfilenameinfilenames:df=pd.read_csv(filename)X,y=...sgd.partial_fit(X,y)

After

fromsklearn.linear_modelimportSGDClassifierfromdask_ml.wrappersimportIncrementalsgd=SGDClassifier(...)inc=Incremental(sgd)importdask.dataframeasdddf=dd.read_csv(filenames)X,y=...inc.fit(X,y)

Analysis

From a parallel computing perspective this is a very simple and un-sexy way of doing things. However my understanding is that it’s also quite pragmatic. In a distributed context we leave a lot of possible computation on the table (the solution is inherently sequential) but it’s fun to see the model jump around the cluster as it absorbs various chunks of data and then moves on.

Incremental training with Dask-ML

There’s ongoing work on how best to combine this with other work like pipelines and hyper-parameter searches to fill in the extra computation.

This work was primarily done by Tom Augspurger with help from Scott Sievert

Dask User Stories

Dask developers are often asked “Who uses Dask?”. This is a hard question to answer because, even though we’re inundated with thousands of requests for help from various companies and research groups, it’s never fully clear who minds having their information shared with others.

We’re now trying to crowdsource this information in a more explicit way by having users tell their own stories. Hopefully this helps other users in their field understand how Dask can help and when it might (or might not) be useful to them.

We originally collected this information in a Google Form but have since then moved it to a Github repository. Eventually we’ll publish this as a proper web site and include it in our documentation.

If you use Dask and want to share your story this is a great way to contribute to the project. Arguably Dask needs more help with spreading the word than it does with technical solutions.

HPC Deployments

The Dask Jobqueue package for deploying Dask on traditional HPC machines is nearing another release. We’ve changed around a lot of the parameters and configuration options in order to improve the onboarding experience for new users. It has been going very smoothly in recent engagements with new groups, but will mean a breaking change for existing users of the sub-project.

Who uses Dask?

$
0
0

This work is supported by Anaconda Inc

People often ask general questions like “Who uses Dask?” or more specific questions like the following:

  1. For what applications do people use Dask dataframe?
  2. How many machines do people often use with Dask?
  3. How far does Dask scale?
  4. Does dask get used on imaging data?
  5. Does anyone use Dask with Kubernetes/Yarn/SGE/Mesos/… ?
  6. Does anyone in the insurance industry use Dask?

This yields interesting and productive conversations where new users can dive into historical use cases which informs their choices if and how they use the project in the future.

New users can learn a lot from existing users.

To further enable this conversation we’ve made a new tiny project, dask-stories. This is a small documentation page where people can submit how they use Dask and have that published for others to see.

To seed this site six generous users have written down how their group uses Dask. You can read about them here:

  1. Sidewalk Labs: Civic Modeling
  2. Genome Sequencing for Mosquitoes
  3. Full Spectrum: Credit and Banking
  4. Ice Cube: Detecting Cosmic Rays
  5. Pangeo: Earth Science
  6. NCAR: Hydrologic Modeling

We’ve focused on a few questions, available in our template that focus on problems over technology, and include negative as well as positive feedback to get a complete picture.

  1. Who am I?
  2. What problem am I trying to solve?
  3. How Dask helps?
  4. What pain points did I run into with Dask?
  5. What technology do I use around Dask?

Easy to Contribute

Contributions to this site are simple Markdown documents submitted as pull requests to github.com/dask/dask-stories. The site is then built with ReadTheDocs and updated immediately. We tried to make this as smooth and familiar to our existing userbase as possible.

This is important. Sharing real-world experiences like this are probably more valuable than code contributions to the Dask project at this stage. Dask is more technically mature than it is well-known. Users look to other users to help them understand a project (think of every time you’ve Googled for “some tool in some topic”)

If you use Dask today in an interesting way then please share your story. The world would love to hear your voice.

If you maintain another project you might consider implementing the same model. I hope that this proves successful enough for other projects in the ecosystem to reuse.

Dask Development Log, Scipy 2018

$
0
0

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Last week many Dask developers gathered for the annual SciPy 2018 conference. As a result, very little work was completed, but many projects were started or discussed. To reflect this change in activity this blogpost will highlight possible changes and opportunities for readers to further engage in development.

Dask on HPC Machines

The dask-jobqueue project was a hit at the conference. Dask-jobqueue helps people launch Dask on traditional job schedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonly found on high performance computers. These are very common among scientific, research, and high performance machine learning groups but commonly a bit hard to use with anything other than MPI.

This project came up in the Pangeo talk, lightning talks, and the Dask Birds of a Feather session.

During sprints a number of people came up and we went through the process of configuring Dask on common supercomputers like Cheyenne, Titan, and Cori. This process usually takes around fifteen minutes and will likely be the subject of a future blogpost. We published known-good configurations for these clusters on our configuration documentation

Additionally, there is a JupyterHub issue to improve documentation on best practices to deploy JupyterHub on these machines. The community has done this well a few times now, and it might be time to write up something for everyone else.

Get involved

If you have access to a supercomputer then please try things out. There is a 30-minute Youtube video screencast on the dask-jobqueue documentation that should help you get started.

If you are an administrator on a supercomputer you might consider helping to build a configuration file and place it in /etc/dask for your users. You might also want to get involved in the JupyterHub on HPC conversation.

Dask / Scikit-learn talk

Olivier Grisel and Tom Augspurger prepared and delivered a great talk on the current state of the new Dask-ML project.

MyBinder and Bokeh Servers

Not a Dask change, but Min Ragan-Kelley showed how to run services through mybinder.org that are not only Jupyter. As an example, here is a repository that deploys a Bokeh server application with a single click.

I think that by composing with Binder Min effectively just created the free-to-use hosted Bokeh server service. Presumably this same model could be easily adapted to other applications just as easily.

Dask and Automated Machine Learning with TPOT

Dask and TPOT developers are discussing paralellizing the automatic-machine-learning tool TPOT.

TPOT uses genetic algorithms to search over a space of scikit-learn style pipelines to automatically find a decently performing pipeline and model. This involves a fair amount of computation which Dask can help to parallelize out to multiple machines.

Get involved

Trivial things work now, but to make this efficient we’ll need to dive in a bit more deeply. Extending that pull request to dive within pipelines would be a good task if anyone wants to get involved. This would help to share intermediate results between pipelines.

Dask and Scikit-Optimize

Among various features, Scikit-optimize offers a BayesSearchCV object that is like Scikit-Learn’s GridSearchCV and RandomSearchCV, but is a bit smarter about how to choose new parameters to test given previous results. Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today, so we investigated how the project might help here.

So far we’re just experimenting using Scikit-Learn/Dask integration through joblib to see what opportunities there are. Dicussion among Dask and Scikit-Optimize developers is happening here:

Centralize PyData/Scipy tutorials on Binder

We’re putting a bunch of the PyData/Scipy tutorials on Binder, and hope to embed snippets of Youtube videos into the notebooks themselves.

This effort lives here:

Motivation

The PyData and SciPy community delivers tutorials as part of most conferences. This activity generates both educational Jupyter notebooks and explanatory videos that teach people how to use the ecosystem.

However, this content isn’t very discoverable after the conference. People can search on Youtube for their topic of choice and hopefully find a link to the notebooks to download locally, but this is a somewhat noisy process. It’s not clear which tutorial to choose and it’s difficult to match up the video with the notebooks during exercises. We’re probably not getting as much value out of these resources as we could be.

To help increase access we’re going to try a few things:

  1. Produce a centralized website with links to recent tutorials delivered for each topic
  2. Ensure that those notebooks run easily on Binder
  3. Embed sections of the talk on Youtube within each notebook so that the explanation of the section is tied to the exercises

Get involved

This only really works long-term under a community maintenance model. So far we’ve only done a few hours of work and there is still plenty to do in the following tasks:

  1. Find good tutorials for inclusion
  2. Ensure that they work well on mybinder.org
    • are self-contained and don’t rely on external scripts to run
    • have an environment.yml or requirements.txt
    • don’t require a lot of resources
  3. Find video for the tutorial
  4. Submit a pull request to the tutorial repository that embeds a link to the youtube talk at the top cell of the notebook at the proper time for each notebook

Dask, Actors, and Ray

I really enjoyed the talk on Ray another distributed task scheduler for Python. I suspect that Dask will steal ideas for actors for stateful operation. I hope that Ray takes on ideas for using standard Python interfaces so that more of the community can adopt it more quickly. I encourage people to check out the talk and give Ray a try. It’s pretty slick.

Planning conversations for Dask-ML

Dask and Scikit-learn developers had the opportunity to sit down again and raise a number of issues to help plan near-term development. This focused mostly around building important case studies to motivate future development, and identifying algorithms and other projects to target for near-term integration.

Case Studies

Algorithms

Get involved

We could use help in building out case studies to drive future development in the project. There are also several algorithmic places to get involved. Dask-ML is a young and fast-moving project with many opportunities for new developers to get involved.

Dask and UMAP for low-dimensional embeddings

Leland McKinnes gave a great talk Uniform Manifold Approximation and Projection for Dimensionality Reduction in which he lays out a well founded algorithm for dimensionality reduction, similar to PCA or T-SNE, but with some nice properties. He worked together with some Dask developers where we identified some challenges due to dask array slicing with random-ish slices.

A proposal to fix this problem lives here, if anyone wants a fun problem to work on:

Dask stories

We soft-launched Dask Stories a webpage and project to collect user and share stories about how people use Dask in practice. We’re also delivering a separate blogpost about this today.

See blogpost: Who uses Dask?

If you use Dask and want to share your story we would absolutely welcome your experience. Having people like yourself share how they use Dask is incredibly important for the project.

Pickle isn't slow, it's a protocol

$
0
0

This work is supported by Anaconda Inc

tl;dr:Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems.

A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue).

This turned out to be because serializing PyTorch models with pickle was very slow (1 MB/s for GPU based models, 50 MB/s for CPU based models). There is no architectural reason why this needs to be this slow. Every part of the hardware pipeline is much faster than this.

We could have fixed this in Dask by special-casing PyTorch models (Dask has it’s own optional serialization system for performance), but being good ecosystem citizens, we decided to raise the performance problem in an issue upstream (PyTorch Github issue). This resulted in a five-line-fix to PyTorch that turned a 1-50 MB/s serialization bandwidth into a 1 GB/s bandwidth, which is more than fast enough for many use cases (PR to PyTorch).

     def __reduce__(self):
-        return type(self), (self.tolist(),)
+        b = io.BytesIO()
+        torch.save(self, b)
+        return (_load_from_bytes, (b.getvalue(),))
+def _load_from_bytes(b):
+    return torch.load(io.BytesIO(b))

Thanks to the PyTorch maintainers this problem was solved pretty easily. PyTorch tensors and models now serialize efficiently in Dask or in any other Python library that might want to use them in distributed systems like PySpark, IPython parallel, Ray, or anything else without having to add special-case code or do anything special. We didn’t solve a Dask problem, we solved an ecosystem problem.

However before we solved this problem we discussed things a bit. This comment stuck with me:

Github Image of maintainer saying that PyTorch's pickle implementation is slow

This comment contains two beliefs that are both very common, and that I find somewhat counter-productive:

  1. Pickle is slow
  2. You should use our specialized methods instead

I’m sort of picking on the PyTorch maintainers here a bit (sorry!) but I’ve found that they’re quite widespread, so I’d like to address them here.

Pickle is slow

Pickle is not slow. Pickle is a protocol. We implement pickle. If it’s slow then it is our fault, not Pickle’s.

To be clear, there are many reasons not to use Pickle.

  • It’s not cross-language
  • It’s not very easy to parse
  • It doesn’t provide random access
  • It’s insecure
  • etc..

So you shouldn’t store your data or create public services using Pickle, but for things like moving data on a wire it’s a great default choice if you’re moving strictly from Python processes to Python processes in a trusted and uniform environment.

It’s great because it’s as fast as you can make it (up a a memory copy) and other libraries in the ecosystem can use it without needing to special case your code into theirs.

This is the change we did for PyTorch.

     def __reduce__(self):
-        return type(self), (self.tolist(),)
+        b = io.BytesIO()
+        torch.save(self, b)
+        return (_load_from_bytes, (b.getvalue(),))
+def _load_from_bytes(b):
+    return torch.load(io.BytesIO(b))

The slow part wasn’t Pickle, it was the .tolist() call within __reduce__ that converted a PyTorch tensor into a list of Python ints and floats. I suspect that the common belief of “Pickle is just slow” stopped anyone else from investigating the poor performance here. I was surprised to learn that a project as active and well maintained as PyTorch hadn’t fixed this already.

As a reminder, you can implement the pickle protocol by providing the __reduce__ method on your class. The __reduce__ function returns a loading function and sufficient arguments to reconstitute your object. Here we used torch’s existing save/load functions to create a bytestring that we could pass around.

Just use our specialized option

Specialized options can be great. They can have nice APIs with many options, they can tune themselves to specialized communication hardware if it exists (like RDMA or NVLink), and so on. But people need to learn about them first, and learning about them can be hard in two ways.

Hard for users

Today we use a large and rapidly changing set of libraries. It’s hard for users to become experts in all of them. Increasingly we rely on new libraries making it easy for us by adhering to standard APIs, providing informative error messages that lead to good behavior, and so on..

Hard for other libraries

Other libraries that need to interact definitely won’t read the documentation, and even if they did it’s not sensible for every library to special case every other library’s favorite method to turn their objects into bytes. Ecosystems of libraries depend strongly on the presence of protocols and a strong consensus around implementing them consistently and efficiently.

Sometimes Specialized Options are Appropriate

There are good reasons to support specialized options. Sometimes you need more than 1GB/s bandwidth. While this is rare in general (very few pipelines process faster than 1GB/s/node), it is true in the particular case of PyTorch when they are doing parallel training on a single machine with multiple processes. Soumith (PyTorch maintainer) writes the following:

When sending Tensors over multiprocessing, our custom serializer actually shortcuts them through shared memory, i.e. it moves the underlying Storages to shared memory and restores the Tensor in the other process to point to the shared memory. We did this for the following reasons:

  • Speed: we save on memory copies, especially if we amortize the cost of moving a Tensor to shared memory before sending it into the multiprocessing Queue. The total cost of actually moving a Tensor from one process to another ends up being O(1), and independent of the Tensor’s size

  • Sharing: If Tensor A and Tensor B are views of each other, once we serialize and send them, we want to preserve this property of them being views. This is critical for neural-nets where it’s common to re-view the weights / biases and use them for another. With the default pickle solution, this property is actually lost.

Dask Development Log

$
0
0

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Over the last two weeks we’ve seen activity in the following areas:

  1. An experimental Actor solution for stateful processing
  2. Machine learning experiments with hyper-parameter selection and parameter servers.
  3. Development of more preprocessing transformers
  4. Statistical profiling of the distributed scheduler’s internal event loop thread and internal optimizations
  5. A new release of dask-yarn
  6. A new narrative on dask-stories about modelling mobile networks
  7. Support for LSF clusters in dask-jobqueue
  8. Test suite cleanup for intermittent failures

Stateful processing with Actors

Some advanced workloads want to directly manage and mutate state on workers. A task-based framework like Dask can be forced into this kind of workload using long-running-tasks, but it’s an uncomfortable experience. To address this we’ve been adding an experimental Actors framework to Dask alongside the standard task-scheduling system. This provides reduced latencies, removes scheduling overhead, and provides the ability to directly mutate state on a worker, but loses niceties like resilience and diagnostics.

The idea to adopt Actors was shamelessly stolen from the Ray Project :)

Work for Actors is happening in dask/distributed #2133.

classCounter:def__init__(self):self.n=0defincrement(self):self.n+=1returnself.ncounter=client.submit(Counter,actor=True).result()>>>future=counter.increment()>>>future.result()1

Machine learning experiments

Hyper parameter optimization on incrementally trained models

Many Scikit-Learn-style estimators feature a partial_fit method that enables incremental training on batches of data. This is particularly well suited for systems like Dask array or Dask dataframe, that are built from many batches of Numpy arrays or Pandas dataframes. It’s a nice fit because all of the computational algorithm work is already done in Scikit-Learn, Dask just has to administratively move models around to data and call scikit-learn (or other machine learning models that follow the fit/transform/predict/score API). This approach provides a nice community interface between parallelism and machine learning developers.

However, this training is inherently sequential because the model only trains on one batch of data at a time. We’re leaving a lot of processing power on the table.

To address this we can combine incremental training with hyper-parameter selection and train several models on the same data at the same time. This is often required anyway, and lets us be more efficient with our computation.

However there are many ways to do incremental training with hyper-parameter selection, and the right algorithm likely depends on the problem at hand. This is an active field of research and so it’s hard for a general project like Dask to pick and implement a single method that works well for everyone. There is probably a handful of methods that will be necessary with various options on them.

To help experimentation here we’ve been experimenting with some lower-level tooling that we think will be helpful in a variety of cases. This accepts a policy from the user as a Python function that gets scores from recent evaluations, and asks for how much further to progress on each set of hyper-parameters before checking in again. This allows us to model a few common situations like random search with early stopping conditions, successive halving, and variations of those easily without having to write any Dask code:

This work is done by Scott Sievert and myself

Successive halving and random search

Parameter Servers

To improve the speed of training large models Scott Sievert has been using Actors (mentioned above) to develop simple examples for parameter servers. These are helping to identify and motivate performance and diagnostic improvements improvements within Dask itself:

Dataframe Preprocessing Transformers

We’ve started to orient some of the Dask-ML work around case studies. Our first, written by Scott Sievert, uses the Criteo dataset for ads. It’s a good example of a combined dense/sparse dataset that can be somewhat large (around 1TB). The first challenge we’re running into is preprocessing. These have lead to a few preprocessing improvements:

Some of these are also based off of improved dataframe handling features in the upcoming 0.20 release for Scikit-Learn.

This work is done by Roman Yurchak, James Bourbeau, Daniel Severo, and Tom Augspurger.

Profiling the main thread

Profiling concurrent code is hard. Traditional profilers like CProfile become confused by passing control between all of the different coroutines. This means that we haven’t done a very comprehensive job of profiling and tuning the distributed scheduler and workers. Statistical profilers on the other hand tend to do a bit better. We’ve taken the statistical profiler that we usually use on Dask worker threads (available in the dashboard on the “Profile” tab) and have applied it to the central administrative threads running the Tornado event loop as well. This has highlighted a few issues that we weren’t able to spot before, and should hopefully result in reduced overhead in future releases.

Profile of event loop thread

New release of Dask-Yarn

There is a new release of Dask-Yarn and the underlying library for managing Yarn jobs, Skein. These include a number of bug-fixes and improved concurrency primitives for YARN applications

This work was done by Jim Crist

Support for LSF clusters in Dask-Jobqueue

Dask-jobqueue supports Dask use on traditional HPC cluster managers like SGE, SLURM, PBS, and others. We’ve recently added support for LSF clusters

Work was done in dask/dask-jobqueue #78 by Ray Bell.

New Dask Story on mobile networks

The Dask Stories repository holds narrative about how people use Dask. Sameer Lalwani recently added a story about using Dask to model mobile communication networks. It’s worth a read.

Test suite cleanup

The dask.distributed test suite has been suffering from intermittent failures recently. These are tests that fail very infrequently, and so are hard to catch when writing them, but show up when future unrelated PRs run the test suite on continuous integration and get failures. They add friction to the development process, but are expensive to track down (testing distributed systems is hard).

We’re taking a bit of time this week to track these down. Progress here:

Building SAGA optimization for Dask arrays

$
0
0

This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science

At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, Fabian Pedregosa (a machine learning researcher and Scikit-learn developer) and Matthew Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm SAGA on parallel Dask datasets. The result is a sequential algorithm that can be run on any dask array, and so allows the data to be stored on disk or even distributed among different machines.

It was interesting both to see how the algorithm performed and also to see the ease and challenges to run a research algorithm on a Dask distributed dataset.

Start

We started with an initial implementation that Fabian had written for Numpy arrays using Numba. The following code solves an optimization problem of the form

importnumpyasnpfromnumbaimportnjitfromsklearn.linear_model.sagimportget_auto_step_sizefromsklearn.utils.extmathimportrow_norms@njitdefderiv_logistic(p,y):# derivative of logistic loss# same as in lightning (with minus sign)p*=yifp>0:phi=1./(1+np.exp(-p))else:exp_t=np.exp(p)phi=exp_t/(1.+exp_t)return(phi-1)*y@njitdefSAGA(A,b,step_size,max_iter=100):"""
  SAGA algorithm

  A : n_samples x n_features numpy array
  b : n_samples numpy array with values -1 or 1
  """n_samples,n_features=A.shapememory_gradient=np.zeros(n_samples)gradient_average=np.zeros(n_features)x=np.zeros(n_features)# vector of coefficientsstep_size=0.3*get_auto_step_size(row_norms(A,squared=True).max(),0,'log',False)for_inrange(max_iter):# sample randomlyidx=np.arange(memory_gradient.size)np.random.shuffle(idx)# .. inner iteration ..foriinidx:grad_i=deriv_logistic(np.dot(x,A[i]),b[i])# .. update coefficients ..delta=(grad_i-memory_gradient[i])*A[i]x-=step_size*(delta+gradient_average)# .. update memory terms ..gradient_average+=(grad_i-memory_gradient[i])*A[i]/n_samplesmemory_gradient[i]=grad_i# monitor convergenceprint('gradient norm:',np.linalg.norm(gradient_average))returnx

This implementation is a simplified version of the SAGA implementation that Fabian uses regularly as part of his research, and that assumes that $f$ is the logistic loss, i.e., $f(z) = \log(1 + \exp(-z))$. It can be used to solve problems with other values of $f$ by overwriting the function deriv_logistic.

We wanted to apply it across a parallel Dask array by applying it to each chunk of the Dask array, a smaller Numpy array, one at a time, carrying along a set of parameters along the way.

Development Process

In order to better understand the challenges of writing Dask algorithms, Fabian did most of the actual coding to start. Fabian is good example of a researcher who knows how to program well and how to design ML algorithms, but has no direct exposure to the Dask library. This was an educational opportunity both for Fabian and for Matt. Fabian learned how to use Dask, and Matt learned how to introduce Dask to researchers like Fabian.

Step 1: Build a sequential algorithm with pure functions

To start we actually didn’t use Dask at all, instead, Fabian modified his implementation in a few ways:

  1. It should operate over a list of Numpy arrays. A list of Numpy arrays is similar to a Dask array, but simpler.
  2. It should separate blocks of logic into separate functions, these will eventually become tasks, so they should be sizable chunks of work. In this case, this led to the creating of the function _chunk_saga that performs an iteration of the SAGA algorithm on a subset of the data.
  3. These functions should not modify their inputs, nor should they depend on global state. All information that those functions require (like the parameters that we’re learning in our algorithm) should be explicitly provided as inputs.

These requested modifications affect performance a bit, we end up making more copies of the parameters and more copies of intermediate state. In terms of programming difficulty this took a bit of time (around a couple hours) but is a straightforward task that Fabian didn’t seem to find challenging or foreign.

These changes resulted in the following code:

fromnumbaimportnjitfromsklearn.utils.extmathimportrow_normsfromsklearn.linear_model.sagimportget_auto_step_size@njitdef_chunk_saga(A,b,n_samples,f_deriv,x,memory_gradient,gradient_average,step_size):# Make explicit copies of inputsx=x.copy()gradient_average=gradient_average.copy()memory_gradient=memory_gradient.copy()# Sample randomlyidx=np.arange(memory_gradient.size)np.random.shuffle(idx)# .. inner iteration ..foriinidx:grad_i=f_deriv(np.dot(x,A[i]),b[i])# .. update coefficients ..delta=(grad_i-memory_gradient[i])*A[i]x-=step_size*(delta+gradient_average)# .. update memory terms ..gradient_average+=(grad_i-memory_gradient[i])*A[i]/n_samplesmemory_gradient[i]=grad_ireturnx,memory_gradient,gradient_averagedeffull_saga(data,max_iter=100,callback=None):"""
  data: list of (A, b), where A is a n_samples x n_features
  numpy array and b is a n_samples numpy array
  """n_samples=0forA,bindata:n_samples+=A.shape[0]n_features=data[0][0].shape[1]memory_gradients=[np.zeros(A.shape[0])for(A,b)indata]gradient_average=np.zeros(n_features)x=np.zeros(n_features)steps=[get_auto_step_size(row_norms(A,squared=True).max(),0,'log',False)for(A,b)indata]step_size=0.3*np.min(steps)for_inrange(max_iter):fori,(A,b)inenumerate(data):x,memory_gradients[i],gradient_average=_chunk_saga(A,b,n_samples,deriv_logistic,x,memory_gradients[i],gradient_average,step_size)ifcallbackisnotNone:print(callback(x,data))returnx

Step 2: Apply dask.delayed

Once functions neither modified their inputs nor relied on global state we went over a dask.delayed example, and then applied the @dask.delayed decorator to the functions that Fabian had written. Fabian did this at first in about five minutes and to our mutual surprise, things actually worked

@dask.delayed(nout=3)# <<<---- New@njitdef_chunk_saga(A,b,n_samples,f_deriv,x,memory_gradient,gradient_average,step_size):...deffull_saga(data,max_iter=100,callback=None):n_samples=0forA,bindata:n_samples+=A.shape[0]data=dask.persist(*data)# <<<---- New...for_inrange(max_iter):fori,(A,b)inenumerate(data):x,memory_gradients[i],gradient_average=_chunk_saga(A,b,n_samples,deriv_logistic,x,memory_gradients[i],gradient_average,step_size)cb=dask.delayed(callback)(x,data)# <<<---- Changedx,cb=dask.persist(x,cb)# <<<---- Newprint(cb.compute()

However, they didn’t work that well. When we took a look at the dask dashboard we find that there is a lot of dead space, a sign that we’re still doing a lot of computation on the client side.

Step 3: Diagnose and add more dask.delayed calls

While things worked, they were also fairly slow. If you notice the dashboard plot above you’ll see that there is plenty of white in between colored rectangles. This shows that there are long periods where none of the workers is doing any work.

This is a common sign that we’re mixing work between the workers (which shows up on the dashbaord) and the client. The solution to this is usually more targetted use of dask.delayed. Dask delayed is trivial to start using, but does require some experience to use well. It’s important to keep track of which operations and variables are delayed and which aren’t. There is some cost to mixing between them.

At this point Matt stepped in and added delayed in a few more places and the dashboard plot started looking cleaner.

@dask.delayed(nout=3)# <<<---- New@njitdef_chunk_saga(A,b,n_samples,f_deriv,x,memory_gradient,gradient_average,step_size):...deffull_saga(data,max_iter=100,callback=None):n_samples=0forA,bindata:n_samples+=A.shape[0]n_features=data[0][0].shape[1]data=dask.persist(*data)# <<<---- Newmemory_gradients=[dask.delayed(np.zeros)(A.shape[0])for(A,b)indata]# <<<---- Changedgradient_average=dask.delayed(np.zeros)(n_features)#  Changedx=dask.delayed(np.zeros)(n_features)# <<<---- Changedsteps=[dask.delayed(get_auto_step_size)(dask.delayed(row_norms)(A,squared=True).max(),0,'log',False)for(A,b)indata]# <<<---- Changedstep_size=0.3*dask.delayed(np.min)(steps)# <<<---- Changedfor_inrange(max_iter):fori,(A,b)inenumerate(data):x,memory_gradients[i],gradient_average=_chunk_saga(A,b,n_samples,deriv_logistic,x,memory_gradients[i],gradient_average,step_size)cb=dask.delayed(callback)(x,data)# <<<---- Changedx,memory_gradients,gradient_average,step_size,cb= \
            dask.persist(x,memory_gradients,gradient_average,step_size,cb)# Newprint(cb.compute())# <<<---- changedreturnx

From a dask perspective this now looks good. We see that one partial_fit call is active at any given time with no large horizontal gaps between partial_fit calls. We’re not getting any parallelism (this is just a sequential algorithm) but we don’t have much dead space. The model seems to jump between the various workers, processing on a chunk of data before moving on to new data.

Step 4: Profile

The dashboard image above gives confidence that our algorithm is operating as it should. The block-sequential nature of the algorithm comes out cleanly, and the gaps between tasks are very short.

However, when we look at the profile plot of the computation across all of our cores (Dask constantly runs a profiler on all threads on all workers to get this information) we see that most of our time is spent compiling Numba code.

We started a conversation for this on the numba issue tracker which has since been resolved. That same computation over the same time now looks like this:

The tasks, which used to take seconds, now take tens of milliseconds, so we can process through many more chunks in the same amount of time.

Future Work

This was a useful experience to build an interesting algorithm. Most of the work above took place in an afternoon. We came away from this activity with a few tasks of our own:

  1. Build a normal Scikit-Learn style estimator class for this algorithm so that people can use it without thinking too much about delayed objects, and can instead just use dask arrays or dataframes
  2. Integrate some of Fabian’s research on this algorithm that improves performance with sparse data and in multi-threaded environments.
  3. Think about how to improve the learning experience so that dask.delayed can teach new users how to use it correctly

Cloud Lock-in and Open Standards

$
0
0

This post is from conversations with Peter Wang, Yuvi Panda, and several others. Yuvi expresses his own views on this topic on his blog.

Summary

When moving to the cloud we should be mindful to avoid vendor lock-in by adopting open standards.

Adoption of cloud computing

Cloud computing is taking over both for-profit enterprises and public/scientific institutions. The Cloud is cheap, flexible, requires little up-front investment, and enables greater collaboration. Cloud vendors like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure compete to create stable, easy to use platforms to serve the needs of a variety of institutions, both big and small. This presents both a great opportunity for society, but also a risk of future lock-in at a large scale.

Cloud vendors build services to lock in users

Some of the competition between cloud vendors is about providing lower costs, higher availability, improved scaling, and so on, that are strictly a benefit for consumers. This is great.

However some of the competition is in the form of services that are specialized to a particular commercial cloud, like Amazon Lambda, Google Tensor Processing Units (TPUs), or Azure Notebooks. These services are valuable to enterprise and public institutions alike, but they lock users in long-term. If you build your system around one of these systems then moving to a different cloud in the future becomes expensive. This stickiness to your current cloud “locks you in” to using that particular cloud, even if policies change in the future in ways that you dislike.

This is OK, lock-in is a standard business practice. We shouldn’t fault these commercial companies for making good business decisions. However, it’s something that we should keep in mind as we invest public effort in these technologies.

Open standards counter lock-in technologies

One way to counter lock-in is to promote the adoption of open standard technologies that are shared among cloud platforms. If these open standard technologies become popular enough then cloud platforms must offer them alongside their proprietary technologies in order to stay competitive, removing one of their options for lock-in.

Examples with Kubernetes and Parquet

For example, consider Kubernetes, a popular resource manager for clusters. While Kubernetes was originally promoted by Google, it was developed in the open by a broader community, gained global adoption, and is now available across all three major commercial clouds. Today if you write your infrastructure on Kubernetes you can move your distributed services between clouds easily, or can move your system onto an on-premises cluster if that becomes necessary. You retain the freedom to move around in the future with low cost.

Consider also the open Parquet data format. If you store your data in Parquet then you can move that data between any cloud’s storage system easily, or can move that data to your in-house hardware without going through a painful database export process.

Technologies like Kubernetes and Parquet displace proprietary technologies like Amazon’s Elastic Container Service (ECS), which locks users into AWS, or Google’s BigQuery which keeps users on GCP with data gravity. This is fine, Amazon and Google can still compete for users with any of their other excellent services, but they’ve been pushed up the stack a bit, away from technologies that are infrastructural and towards technologies that are more about convenience and high-level usability.

What we can do

Wide adoption of open standard infrastructure protects us from the control of cloud vendors.

If you are a public institution considering the cloud then please consider the services that you plan to adopt, and their potential to lock your institutions in the long run. These services may still make sense, but you should probably have a conversation with your team and do it mindfully. You might consider developing a plan to extract yourself from that cloud in the future and see how your decisions affect the cost of that plan.

If you are an open source developer then please consider investing your effort around open standards instead of around proprietary tooling. By focusing our effort on open standards we provide public institutions with viable options for a safe cloud.


High level performance of Pandas, Dask, Spark, and Arrow

$
0
0

This work is supported by Anaconda Inc

Question

How does Dask dataframe performance compare to Pandas? Also, what about Spark dataframes and what about Arrow? How do they compare?

I get this question every few weeks. This post is to avoid repetition.

Caveats

  1. This answer is likely to change over time. I’m writing this in August 2018
  2. This question and answer are very high level. More technical answers are possible, but not contained here.

Answers

Pandas

If you’re coming from Python and have smallish datasets then Pandas is the right choice. It’s usable, widely understood, efficient, and well maintained.

Benefits of Parallelism

The performance benefit (or drawback) of using a parallel dataframe like Dask dataframes or Spark dataframes over Pandas will differ based on the kinds of computations you do:

  1. If you’re doing small computations then Pandas is always the right choice. The administrative costs of parallelizing will outweigh any benefit. You should not parallelize if your computations are taking less than, say, 100ms.

  2. For simple operations like filtering, cleaning, and aggregating large data you should expect linear speedup by using a parallel dataframes.

    If you’re on a 20-core computer you might expect a 20x speedup. If you’re on a 1000-core cluster you might expect a 1000x speedup, assuming that you have a problem big enough to spread across 1000 cores. As you scale up administrative overhead will increase, so you should expect the speedup to decrease a bit.

  3. For complex operations like distributed joins it’s more complicated. You might get linear speedups like above, or you might even get slowdowns. Someone experienced in database-like computations and parallel computing can probably predict pretty well which computations will do well.

However, configuration may be required. Often people find that parallel solutions don’t meet expectations when they first try them out. Unfortunately most distributed systems require some configuration to perform optimally.

There are other options to speed up Pandas

Many people looking to speed up Pandas don’t need parallelism. There are often several other tricks like encoding text data, using efficient file formats, avoiding groupby.apply, and so on that are more effective at speeding up Pandas than switching to parallelism.

Comparing Apache Spark and Dask

Assuming that yes, I do want parallelism, should I choose Apache Spark, or Dask dataframes?

This is often decided more by cultural preferences (JVM vs Python, all-in-one-tool vs integration with other tools) than performance differences, but I’ll try to outline a few things here:

  • Spark dataframes will be much better when you have large SQL-style queries (think 100+ line queries) where their query optimizer can kick in.
  • Dask dataframes will be much better when queries go beyond typical database queries. This happens most often in time series, random access, and other complex computations.
  • Spark will integrate better with JVM and data engineering technology. Spark will also come with everything pre-packaged. Spark is its own ecosystem.
  • Dask will integrate better with Python code. Dask is designed to integrate with other libraries and pre-existing systems. If you’re coming from an existing Pandas-based workflow then it’s usually much easier to evolve to Dask.

Generally speaking for most operations you’ll be fine using either one. People often choose between Pandas/Dask and Spark based on cultural preference. Either they have people that really like the Python ecosystem, or they have people that really like the Spark ecosystem.

Dataframes are also only a small part of each project. Spark and Dask both do many other things that aren’t dataframes. For example Spark has a graph analysis library, Dask doesn’t. Dask supports multi-dimensional arrays, Spark doesn’t. Spark is generally higher level and all-in-one while Dask is lower-level and focuses on integrating into other tools.

For more information, see Dask’s “Comparison to Spark documentation”.

Apache Arrow

What about Arrow? Is Arrow faster than Pandas?

This question doesn’t quite make sense… yet.

Arrow is not a replacement for Pandas. Arrow is a way to move data around between different systems and different file formats. Arrow does not do computation today. If you use Pandas or Spark or Dask you might be using Arrow without even knowing it. Today Arrow is more useful for other libraries than it is to end-users.

However, this is likely to change in the future. Arrow developers plan to write computational code around Arrow that we would expect to be faster than the code in either Pandas or Spark. This is probably a year or two away though. There will probably be some effort to make this semi-compatible with Pandas, but it’s much too early to tell.

Dask Release 0.19.0

$
0
0

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.19.0. This is a major release with bug fixes and new features. The last release was 0.18.2 on July 23rd. This blogpost outlines notable changes since the last release blogpost for 0.18.0 on June 14th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Notable Changes

A ton of work has happened over the past two months, but most of the changes are small and diffuse. Stability, feature parity with upstream libraries (like Numpy and Pandas), and performance have all significantly improved, but in ways that are difficult to condense into blogpost form.

That being said, here are a few of the more exciting changes in the new release.

Python Versions

We’ve dropped official support for Python 3.4 and added official support for Python 3.7.

Deploy on Hadoop Clusters

Over the past few months Jim Crist has bulit a suite of tools to deploy applications on YARN, the primary cluster manager used in Hadoop clusters.

  • Conda-pack: packs up Conda environments for redistribution to distributed clusters, especially when Python or Conda may not be present.
  • Skein: easily launches and manages YARN applications from non-JVM systems
  • Dask-Yarn: a thin library around Skein to launch and manage Dask clusters

Jim has written about Skein and Dask-Yarn in two recent blogposts:

Implement Actors

Some advanced workloads want to directly manage and mutate state on workers. A task-based framework like Dask can be forced into this kind of workload using long-running-tasks, but it’s an uncomfortable experience.

To address this we’ve added an experimental Actors framework to Dask alongside the standard task-scheduling system. This provides reduced latencies, removes scheduling overhead, and provides the ability to directly mutate state on a worker, but loses niceties like resilience and diagnostics. The idea to adopt Actors was shamelessly stolen from the Ray Project :)

classCounter:def__init__(self):self.n=0defincrement(self):self.n+=1returnself.ncounter=client.submit(Counter,actor=True).result()>>>future=counter.increment()>>>future.result()1

You can read more about actors in the Actors documentation.

Dashboard improvements

The Dask dashboard is a critical tool to understand distributed performance. There are a few accessibility issues that trip up beginning users that we’ve addressed in this release.

Save task stream plots

You can now save a task stream record by wrapping a computation in the get_task_stream context manager.

fromdask.distributedimportClient,get_task_streamclient=Client(processes=False)importdaskdf=dask.datasets.timeseries()withget_task_stream(plot='save',filename='my-task-stream.html')asts:df.x.std().compute()
>>>ts.data[{'key':"('make-timeseries-edc372a35b317f328bf2bb5e636ae038', 0)",'nbytes':8175440,'startstops':[('compute',1535661384.2876947,1535661384.3366017)],'status':'OK','thread':139754603898624,'worker':'inproc://192.168.50.100/15417/2'},...

This gives you the start and stop time of every task on every worker done during that time. It also saves that data as an HTML file that you can share with others. This is very valuable for communicating performance issues within a team. I typically upload the HTML file as a gist and then share it with rawgit.com

$ gist my-task-stream.html
https://gist.github.com/f48a121bf03c869ec586a036296ece1a

Robust to different screen sizes

The Dashboard’s layout was designed to be used on a single screen, side-by-side with a Jupyter notebook. This is how many Dask developers operate when working on a laptop, however it is not how many users operate for one of two reasons:

  1. They are working in an office setting where they have several screens
  2. They are new to Dask and uncomfortable splitting their screen into two halves

In these cases the styling of the dashboard becomes odd. Fortunately, Luke Canavan and Derek Ludwig recently improved the CSS for the dashboard considerably, allowing it to switch between narrow and wide screens. Here is a snapshot.

Jupyter Lab Extension

You can now embed Dashboard panes directly within Jupyter Lab using the newly updated dask-labextension.

jupyter labextension install dask-labextension

This allows you to layout your own dashboard directly within JupyterLab. You can combine plots from different pages, control their sizing, and so on. You will need to provide the address of the dashboard server (http://localhost:8787 by default on local machines) but after that everything should persist between sessions. Now when I open up JupyterLab and start up a Dask Client, I get this:

Thanks to Ian Rose for doing most of the work here.

Outreach

Dask Stories

People who use Dask have been writing about their experiences at Dask Stories. In the last couple months the following people have written about and contributed their experience:

  1. Civic Modelling at Sidewalk Labs by Brett Naul
  2. Genome Sequencing for Mosquitoes by Alistair Miles
  3. Lending and Banking at Full Spectrum by Hussain Sultan
  4. Detecting Cosmic Rays at IceCube by James Bourbeau
  5. Large Data Earth Science at Pangeo by Ryan Abernathey
  6. Hydrological Modelling at the National Center for Atmospheric Research by Joe Hamman
  7. Mobile Networks Modeling by Sameer Lalwani
  8. Satellite Imagery Processing at the Space Science and Engineering Center by David Hoese

These stories help people understand where Dask is and is not applicable, and provide useful context around how it gets used in practice. We welcome further contributions to this project. It’s very valuable to the broader community.

Dask Examples

The Dask-Examples repository maintains easy-to-run examples using Dask on a small machine, suitable for an entry-level laptop or for a small cloud instance. These are hosted on mybinder.org and are integrated into our documentation. A number of new examples have arisen recently, particularly in machine learning. We encourage people to try them out by clicking the link below.

Binder

Other Projects

  • The dask-image project was recently released. It includes a number of image processing routines around dask arrays.

    This project is mostly maintained by John Kirkham.

  • Dask-ML saw a recent bugfix release

  • The TPOT library for automated machine learning recently published a new release that adds Dask support to parallelize their model training. More information is available on the TPOT documentation

Acknowledgements

Since June 14th, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

  • Anderson Banihirwe
  • Andre Thrill
  • Aurélien Ponte
  • Christoph Moehl
  • Cloves Almeida
  • Daniel Rothenberg
  • Danilo Horta
  • Davis Bennett
  • Elliott Sales de Andrade
  • Eric Bonfadini
  • GPistre
  • George Sakkis
  • Guido Imperiale
  • Hans Moritz Günther
  • Henrique Ribeiro
  • Hugo
  • Irina Truong
  • Itamar Turner-Trauring
  • Jacob Tomlinson
  • James Bourbeau
  • Jan Margeta
  • Javad
  • Jeremy Chen
  • Jim Crist
  • Joe Hamman
  • John Kirkham
  • John Mrziglod
  • Julia Signell
  • Marco Rossi
  • Mark Harfouche
  • Martin Durant
  • Matt Lee
  • Matthew Rocklin
  • Mike Neish
  • Robert Sare
  • Scott Sievert
  • Stephan Hoyer
  • Tobias de Jong
  • Tom Augspurger
  • WZY
  • Yu Feng
  • Yuval Langer
  • minebogy
  • nmiles2718
  • rtobar

The dask/distributed repository for distributed computing:

  • Anderson Banihirwe
  • Aurélien Ponte
  • Bartosz Marcinkowski
  • Dave Hirschfeld
  • Derek Ludwig
  • Dror Birkman
  • Guillaume EB
  • Jacob Tomlinson
  • Joe Hamman
  • John Kirkham
  • Loïc Estève
  • Luke Canavan
  • Marius van Niekerk
  • Martin Durant
  • Matt Nicolls
  • Matthew Rocklin
  • Mike DePalatis
  • Olivier Grisel
  • Phil Tooley
  • Ray Bell
  • Tom Augspurger
  • Yu Feng

The dask/dask-examples repository for easy-to-run examples:

  • Albert DeFusco
  • Dan Vatterott
  • Guillaume EB
  • Matthew Rocklin
  • Scott Sievert
  • Tom Augspurger
  • mholtzscher

Public Institutions and Open Source Software

$
0
0

As general purpose open source software displaces domain-specific all-in-one solutions, many institutions are re-assessing how they build and maintain software to support their users. This is true across for-profit enterprises, government agencies, universities, and home-grown communities.

While this shift brings opportunities for growth and efficiency, it also raises questions and challenges about how these institutions should best serve their communities as they grow increasingly dependent on software developed and controlled outside of their organization.

  • How do they ensure that this software will persist for many years?
  • How do they influence this software to better serve the needs of their users?
  • How do they transition users from previous all-in-one solutions to a new open source platform?
  • How do they continue to employ their existing employees who have historically maintained software in this field?
  • If they have a mandate to support this field, what is the best role for them to play, and how can they justify their efforts to the groups that control their budget?

This blogpost investigates this situation from the perspective of large organizations that serve the public good, such as government funding agencies or research institutes like the National Science Foundation, NASA, DoD, and so on. I hope to write separately about this topic from both enterprise and community perspectives in the future.

This blogpost provides context, describes a few common approaches and their outcomes, and draws some general conclusions.

Example

To make this concrete, place yourself in the following situation:

You manage a software engineering department within a domain specific institution (like NASA). Your group produces the defacto standard software suite that almost all scientists in your domain use; or at least that used to be true. Today, a growing number of scientists use general-purpose software stacks, like Scientific Python or R-Stats, that are maintained by a much wider community outside of your institution. While the software solution that your group maintains is still very much in use today, you think that it is unlikely that it will continue to be relevant in five-to-ten years.

What should you do? How should your institution change its funding and employment model to reflect this new context, where the software they depend on is relevant to many other institutions as well?

Common approaches that sometimes end poorly

We list a few common approaches, and some challenges or potential outcomes.

  1. Open your existing stack for community development

    You’ve heard that by placing your code and development cycle on Github that your software will be adopted by other groups and that you’ll be able to share maintenance costs. You don’t consider your software to be intellectual property so you go ahead and hope for the best.

    Positive Outcome: Your existing user-base may appreciate this and some of them may start submitting bugs and even patches.

    Negative Outcome: Your software stack is already quite specific to your domain. It’s unlikely that you will see the same response as a general purpose project like Jupyter, Pandas, or Spark.

    Additionally, maintaining an open development model is hard. You have to record all of your conversations and decision-making online. When people from other domains come with ideas you will have to choose between supporting them and supporting your own mission, which will frequently come into conflict.

    In practice your engineers will start ignoring external users, and as a result your external users will stay away.

  2. Start a new general purpose open source framework

    Positive: If your project becomes widely adopted then you both get some additional development help, but perhaps more importantly your institution gains reputation as an interesting place to work. This helps tremendously with hiring.

    Negative: This is likely well beyond the mandate of your institution.

    Additionally, it’s difficult to produce general purpose software that attracts a broad audience. This is for both technical and administrative reasons:

    1. You need to have feedback from domains outside of your own. Lets say you’re NASA and you want to make a new image processing tool. You need to talk to satellite engineers (which you have), and also microscopists, medical researchers, astronomers, ecologists, machine learning researchers, and so on. It’s unlikely that your institution has enough exposure outside of your domain to design general purpose software successfully. Even very good software usually fails to attract a community.

    2. If you succeeded in a good general purpose design you would also have to employ people to support users from all of those domains, and this is probably outside of your mandate.

  3. Pull open source tools into your organization, but don’t engage externally.

    You’re happy to use open source tools but don’t see a need to engage external communities. You’ll pull tools inside of your organization and then manage and distribute them internally as usual.

    Positive: You get the best of software today, but also get to manage the experience internally, giving you the control that your organization needs. This is well within the comfort zone of both your legal and IT departments.

    Negative: As the outside world changes you will struggle to integrate these changes with your internal processes. Eventually your version of the software will diverge and become an internal fork that you have to maintain. This locks you into current-day functionality and puts you on the hook to integrate critical patches as they arise. Additionally you miss out on opportunities to move the software in directions that your organization would find valuable.

  4. Get out of the business of maintaining software entirely. Your institution may no longer be strictly needed in this role.

    The open source communities seem to have a decent pipeline to support users, lets just divert our userbase to them and focus on other efforts.

    Positive: You reduce operating budget and can allocate your people to other problems.

    Negative: This causes an upset to your employees, your user community, and to the maintainers of the open source software projects that are being asked to absorb your userbase. The maintainers are likely to burn-out from having to support so many people and so your users will be a bit lost. Additionally the open source software probably doesn’t do all of the things that the old software does and your existing users aren’t experts at the new software stack.

    Everyone needs help getting to know each other, and your institution is uniquely positioned to facilitate this.

Common approaches that sometimes end well

We now discuss a few concrete things to help where your institution is likely uniquely positioned to play a critical role:

  1. Maintain the existing stack

    It’s likely that the existing all-in-one stack will still be used for years by current researchers and automated systems. You’re not off the hook to provide maintenance. However you might want to communicate an end-of-life to your userbase though saying that new feature requests are unlikely to be implemented and that support will expire in a few years time (or whatever is appropriate for your domain)

  2. Develop learning materials to help users transition to the new stack

    Your institution uniquely understands the needs of your user community. There are common analyses, visualizations, storage formats, and so on that are common among your users and are easy with the all-in-one solution, but which take effort with the new system.

    You can build teaching materials like blogposts, how-to guides, and online tutorials that assist the non-early-adopters to transition more smoothly. You can provide in-person teaching at conferences and meetings common to your community. It’s likely that the open source ecosystem that you’re transitioning to already has materials that you can copy-and-modify for your domain.

    Assist users in reporting bugs, and filter these to reduce burden on the upstream OSS projects. It’s likely that your users will express problems in a way that is particular to their domain, but which doesn’t make sense to maintainers.

    • User says: Satellite mission MODIS data doesn’t load properly
    • Developer wants to hear: GeoTIFF loader fails when given S3 route

    Your engineers can help to filter and translate these requests into technical terms appropriate for the upstream project. The people previously working on the all-in-one solution have a unique combination of technical and domain experience that is essential to mediate these interactions.

  3. Contribute code upstream either to mainline packages or new subpackages

    After you’ve spent some time developing training materials and assisting users to transition you’ll have a much better sense of what needs to be fixed upstream, or what small new packages might be helpful. You’ll also have much more experience interacting with the upstream development community so that you get a sense of how they operate and they’ll better understand your needs. Now is a good time to start contributing actual code, either into the package itself if sufficiently general purpose, or into a spinoff project that is small in scope and focuses on your domain.

    It’s very tempting to jump directly to this step and create new software, but waiting a while and focusing on teaching can often yield more long-lasting results.

  4. Enable employees to spend time maintaining upstream projects and ensure that this time counts towards their career advancement

    By this time you will have a few employees who have taken on a maintenance role within the upstream projects. You should explicitly give them time within the work schedule to continue these activities. The work that they do may not be explicitly within scope for your institution, but having them active within the core of a project makes sure that the mission of your institution will be well represented in the future of the open source project.

    These people will often do this work on their own at home or on the side in personal time if you don’t step in. Your institution should make it clear to them that it values these activities and that they are good for the employee’s career within your institution. Besides ensuring that your institution’s perspective is represented within the core of the upstream projects you’re also improving retention of a core employee.

  5. Co-write grant proposals with maintainers of the core project

    If your institution provides or applies for external funding, consider writing up a grant proposal that has your institution and the OSS project collaborating for a few years. You focus on your domain and leave the OSS project maintainers to handle some of the gritty technical details that you anticipate will arise. Most of those details are hopefully things that are general purpose and that they wanted to fix anyway. It’s usually pretty easy to find a set of features that both parties would be very happy to see completed.

    Many OSS maintainers work at an institution that allows them to accept grant funding (at least as a sub-contractor under your award), but may not have access to the same domain-specific funding channels to which your institution has access. If they don’t work at an institution for which accepting a grant sub-award is reasonable (maybe they work at a bank) then they might consider accepting the award through a non-profit foundation like NumFOCUS.

Summary

As primary development of software moves from inside the institution to outside, a cultural shift occurs. The institution transitions from leading development and being fully in charge of its own destiny to playing a support role in a larger ecosystem. They can first help to lead their their user community through this transition, and through that develop the experience and relationships to act effectively within the broader ecosystem.

From this process they often find roles within the ecosystem that are critical. Large public institutions are some of the only organizations with the public mandate and long time horizon to maintain public software over decade-long time scales. They have a crucial role to play both in domain specific software packages that directly overlap with their mandate and also with more general purpose packages on which their domain depends indirectly. Finding their new role and learning to engage with external organizations is a skill that they’ll need to re-learn as they engage with an entirely new set of partners.

This growth requires time and patience at several levels of the organization. At the end of the day many people within the institution will have to collaborate with people outside while still staying within the mandate and career track of the institution itself.

Dask Development Log

$
0
0

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Since the last update in the 0.19.0 release blogpost two weeks ago we’ve seen activity in the following areas:

  1. Update Dask examples to use JupyterLab on Binder
  2. Render Dask examples into static HTML pages for easier viewing
  3. Consolidate and unify disparate documentation
  4. Retire the hdfs3 library in favor of the solution in Apache Arrow.
  5. Continue work on hyper-parameter selection for incrementally trained models
  6. Publish two small bugfix releases
  7. Blogpost from the Pangeo community about combining Binder with Dask
  8. Skein/Yarn Update

1: Update Dask Examples to use JupyterLab extension

The new dask-labextension embeds Dask’s dashboard plots into a JupyterLab session so that you can get easy access to information about your computations from Jupyter directly. This was released a few weeks ago as part of the previous release post.

However since then we’ve hooked this up to our live examples system that lets users try out Dask on a small cloud instance using mybinder.org. If you want to try out Dask and JupyterLab together then head here:

Binder

Thanks to Ian Rose for managing this.

2: Render Dask Examples as static documentation

Using the nbsphinx Sphinx extension to automatically run and render Jupyter Notebooks we’ve turned our live examples repository into static documentation for easy viewing.

These examples are currently available at https://dask.org/dask-examples/ but will soon be available at examples.dask.org and from the navbar at all dask pages.

Thanks to Tom Augspurger for putting this together.

3: Consolidate documentation under a single org and style

Dask documentation is currently spread out in many small hosted sites, each associated to a particular subpackage like dask-ml, dask-kubernetes, dask-distributed, etc.. This eases development (developers are encouraged to modify documentation as they modify code) but results in a fragmented experience because users don’t know how to discover and efficiently explore our full documentation.

To resolve this we’re doing two things:

  1. Moving all sites under the dask.org domain

    Anaconda Inc, the company that employs several of the Dask developers (myself included) recently donated the domain dask.org to NumFOCUS. We’ve been slowly moving over all of our independent sites to use that location for our documentation.

  2. Develop a uniform Sphinx theme dask-sphinx-theme

    This has both uniform styling and also includes a navbar that gets automatically shared between the projects. The navbar makes it easy to discover and explore content and is something that we can keep up-to-date in a single repository.

You can see how this works by going to any of the Dask sites, like docs.dask.org.

Thanks to Tom Augspurger for managing this work and Andy Terrel for patiently handling things on the NumFOCUS side and domain name side.

4: Retire the hdfs3 library

For years the Dask community has maintained the hdfs3 library that allows for native access to the Hadoop file system from Python. This used Pivotal’s libhdfs3 library written in C++ and was, for a long while the only performant way to maturely manipulate HDFS from Python.

Since then though PyArrow has developed efficient bindings to the standard libhdfs library and exposed it through their Pythonic file system interface, which is fortunately Dask-compatible.

We’ve been telling people to use the Arrow solution for a while now and thought we’d now do so officially (see dask/hdfs3 #170). As of the last bugfix release Dask will use Arrow by default and, while the hdfs3 library is still available, Dask maintainers probably won’t spend much time on it in the future.

Thanks to Martin Durant for building and maintaining HDFS3 over all this time.

5: Hyper-parameter selection for incrementally trained models

In Dask-ML we continue to work on hyper-parameter selection for models that implement the partial_fit API. We’ve built algorithms and infrastructure to handle this well, and are currently fine tuning API, parameter names, etc..

If you have any interest in this process, come on over to dask/dask-ml #356.

Thanks to Tom Augspurger and Scott Sievert for this work.

6: Two small bugfix releases

We’ve been trying to increase the frequency of bugfix releases while things are stable. Since our last writing there have been two minor bugfix releases. You can read more about them here:

7: Binder + Dask

The Pangeo community has done work to integrate Binder with Dask and has written about the process here: Pangeo meets Binder

Thanks to Joe Hamman for this work and the blogpost.

8: Skein/Yarn Update

The Dask-Yarn connection to deploy Dask on Hadoop clusters uses a library Skein to easily manage Yarn jobs from Python.

Skein has seen a lot of activity over the last few weeks, including the following:

  1. A Web UI for the project. See jcrist/skein #68
  2. A Tensorflow on Yarn project from Criteo that uses Skein. See github.com/criteo/tf-yarn

This work is mostly managed by Jim Crist and other Skein contributors.

So you want to contribute to open source

$
0
0

Welcome new open source contributor!

I appreciated receiving the e-mail where you said you were excited about getting into open source and were particularly interested in working on a project that I maintain. This post has a few thoughts on the topic.

First, please forgive me for sending you to this post rather than responding with a personal e-mail. Your situation is common today, so I thought I’d write up thoughts in a public place, rather than respond personally.

This post has two parts:

  1. Some pragmatic steps on how to get started
  2. A personal recommendation to think twice about where you focus your time

Look for good first issues on Github

Most open source software (OSS) projects have a “Good first issue” label on their Github issue tracker. Here is a screenshot of how to find the “good first issue” label on the Pandas project:

(note that this may be named something else like “Easy to fix”)

This contains a list of issues that are important, but also good introductions for new contributors like yourself. I recommend looking through that list to see if something interests you. The issue should include a clear description of the problem, and some suggestions on how it might be resolved. If you need to ask questions, you can make an account on Github and ask them there.

It is very common for people to ask questions on Github. We understand that this may cause some anxiety your first time (I always find it really hard to introduce myself to a new community), but a “Good first issue” issue is a safe place to get started. People expect newcomers to show up there.

Read developer guidelines

Many projects will specify developer guidelines like how to check out a codebase, run tests, write and style code, formulate a pull request, and so on. This is usually in their documentation under a label like “Developer guidelines”, “Developer docs”, or “Contributing”.

If you do a web search for “pandas developer docs” then this page in the first hit: pandas.pydata.org/pandas-docs/stable/contributing.html

These pages can be long, but they have a lot of good information. Reading through them is a good learning experience.

But this may not be as fun as you think

Open source software is a field of great public interest today, but day-to-day it may be more tedious than you expect. Most OSS work is dull. Maintainers spend most of their time discussing grammatical rules for documentation, discovering obscure compiler bugs, or handling e-mails. They spend very little time inventing cool algorithms. You may notice this yourself as you look through the issue list. What fraction of them excite you?

I say this not to discourage you (indeed, please come help!) but just as a warning. Many people leave OSS pretty quickly. This can be for many reasons, but lack of interest is certainly one of them.

The desire to maintain software is rarely enough to keep people engaged in open source long term

So work on projects that are personal to you

You are more than a programmer. You already have life experience and skills that can benefit your programming, and you have life needs that programming can enhance.

The people who stay with an OSS project are often people who need that project for something else.

  • A musician may contribute to a composition or recording software that they use at work
  • A teacher may contribute to educational software to help their students
  • Community organizers may contribute to geospatial software to help them plan activities or understand local issues.

So my biggest piece of advice to you is not to try to contribute to a package because it is popular or exciting, but rather wait until you run into a problem with a piece of software that you use daily, and then contribute a fix to that project. It can be more rewarding to contribute to something that is already in your life and as an active user you already have a ton of experience and a place in the community. You are much more likely to be successful contributing to a project if you have been using it for a long time.

Anatomy of an OSS Institutional Visit

$
0
0

I recently visited the UK Meteorology Office, a moderately large organization that serves the weather and climate forecasting needs of the UK (and several other nations). I was there with other open source colleagues including Joe Hamman and Ryan May from open source projects like Dask, Xarray, JupyterHub, MetPy, Cartopy, and the broader Pangeo community.

This visit was like many other visits I’ve had over the years that are centered around showing open source tooling to large institutions, so I thought I’d write about it in hopes that it helps other people in this situation in the future.

My goals for these visits are the following:

  1. Teach the institution about software projects and approaches that may help them to have a more positive impact on the world
  2. Engage them in those software projects and hopefully spread around the maintenance and feature development burden a bit

Step 1: Meet allies on the ground

We were invited by early adopters within the institution, both within the UK Met Office’s Informatics Lab a research / incubation group within the broader organization, and the Analysis, Visualization, and Data group (AVD) who serve 500 analysts at the Met Office with their suite of open source tooling.

Both of these groups are forward thinking, already use and appreciate the tools that we were talking about, and hope to leverage our presence to evangelize what they’ve already been saying throughout their institution. They need outside experts to provide external validation; that’s our job.

The goals for the early adopters are the following:

  1. Reinforce the message they’ve already been saying internally, that these tools and approaches can improve operations within the institution
  2. Discuss specific challenges that they’ve been having with the software directly with maintainers
  3. Design future approaches to adopt within their forward-thinking groups

So our visit was split between meeting a variety of groups within the institution (analysts, IT, …) and talking shop.

Step 2: Talk to IT

One of our first visits was a discussion with a cross-department team of people architecting a variety of data processing systems throughout the company. Joe Hamman and I gave a quick talk about Dask, XArray, and the Pangeo community. Because this was more of an IT-focused group I went first, answered the standard onslaught of IT-related questions about Dask, and established credibility. Then Joe took over and demonstrated the practical relevance of the approach from their users’ perspective.

We’ve done this tag-team approach a number of times and its always effective. Having a technical person speak to technical concerns while also having a scientist demonstrating organizational value seems to establish credibility across a wide range of people.

However it’s still important to tailor the message to the group at hand. IT-focused groups like this one are usually quite conservative about adding new technology, and they have a constant pressure of users asking them for things that will generally cause problems. We chose to start with low-level technical details because it lets them engage with the problem at a level that they can meaningfully test and assess the situation.

Step 3: Give a talk to a broader audience

Our early-adopter allies had also arranged a tech-talk with a wider audience across the office. This was part of a normal lecture series, so we had a large crowd, along with a video recording within the institution for future viewers. The audience this time was a combination of analysts (users of our software), some IT, and an executive or two.

Joe and I gave essentially the same talk, but this time we reversed the order, focusing first on the scientific objectives, and then following up with a more brief summary on how the software accomplishes this. A pretty constant message in this talk was …

other institutions like yours have already adopted this approach and are experiencing transformative change as a result

We provide social proof by showing that lots of other popular projects and developer communities integrate with these tools, and that many large government organizations (peers to the UK Met Office) are already adopting these tools and seeing efficiency gains.

Our goals for this section are the following:

  1. Encourage the users within the audience to apply pressure to their management/IT to make it easier for them to integrate these tools to their everyday workflow

  2. Convince management that this is a good approach.

    This means two things for them:

    1. These methods are well established outside of the institution, and not just something that their engineers are presently enamored with
    2. These methods can enable transformative change within the organization

Step 4: Talk to many smaller groups

After we gave the talk to the larger audience we met with many smaller groups. These were groups that managed the HPC systems, were in charge of storing data on the cloud, ran periodic data processing pipelines, etc.. Doing this after the major talk is useful, because people arrive with a pretty good sense of what the software does, and how it might help them. Conversations then become more specific quickly.

Step 5: Engage with possible maintainers

During this process I had the good fortune to work with Peter Killick and Bill Little who had done a bit of work on Dask in the past and were interested in doing more. Before coming to the Met Office we found a bug that was of relevance to them, but also involved learning some more Dask skills. We worked on it off and on during the visit and it was great to get to know them better and hopefully they’re more likely to fix issues that arise in the future with more familiarity.

Step 6: Engage with other core developers

Between the visitors and our hosts we had several core developers present on related projects (XArray, Iris, Dask, Cartopy, Metpy, …). This was a good time not just for evangelism and growing the community, but also for making long-term plans about existing projects, identifying structural issues in the ecosystem, and identifying new projects to fix those issues.

There was good conversation around the future relationship between Xarray and Iris two similar packages that could play better together. We discussed current JupyterHub deployments both within the UK Met office, and without. Developers for the popular Cartopy library got together. A couple of us prototyped a very early stage unstructured mesh data structure.

These visits are one of the few times when a mostly distributed community gets together and can make in-person plans. Sitting down with a blank sheet of paper is a useful exercise that is still remarkably difficult to replicate remotely.

Step 7: Have a good time

It turns out that the Southwest corner of England is full of fine pubs, and even better walking. I’m thankful to Phil Elson and Jo Camp for hosting me over the weekend where we succeeded in chatting about things other than work.

Support Python 2 with Cython

$
0
0

Summary

Many popular Python packages are dropping support for Python 2 next month. This will be painful for several large institutions. Cython can provide a temporary fix by letting us compile a Python 3 codebase into something usable by Python 2 in many cases.

It’s not clear if we should do this, but it’s an interesting and little known feature of Cython.

Background: Dropping Python 2 Might be Harder than we Expect

Many major numeric Python packages are dropping support for Python 2 at the end of this year. This includes packages like Numpy, Pandas, and Scikit-Learn. Jupyter already dropped Python 2 earlier this year.

For most developers in the ecosystem this isn’t a problem. Most of our packages are Python-3 compatible and we’ve learned how to switch libraries. However, for larger companies or government organizations it’s often far harder to switch. The PyCon 2017 keynote by Lisa Guo and Hui Ding from Instagram gives a good look into why this can be challenging for large production codebases and also gives a good example of someone successfully navigating this transition.

It will be interesting to see what happens when Numpy, Pandas, and Scikit-Learn start publishing Python-3 only releases. We may uncover a lot of pain within larger institutions. In that case, what should we do?

(Although, to be fair, the data science stack tends to get used more often in isolated user environments, which tend to be more amenable to making the Python 2-3 switch than web-services production codebases).

Cython

The Cython compiler provides a possible solution that I don’t hear discussed very often, so I thought I’d cover it briefly.

The Cython compiler can convert a Python 3 codebase into a C-Extension module that is usable by both Python 2 and 3. We could probably use Cython to prepare Python 2 packages for a large subset of the numeric Python ecosystem after that ecosystem drops Python 2.

Lets see an example…

Example

Here we show a small Python project that uses Python 3 language features. (source code here)

py32test$ tree .
.
├── py32test
│   ├── core.py
│   └── __init__.py
└── setup.py

1 directory, 3 files
# py32test/core.pydefinc(x:int)->int:# Uses typing annotationsreturnx+1defgreet(name:str)->str:returnf'Hello, {name}!'# Uses format strings
# py32test/__init__.pyfrom.coreimportinc,greet

We see that this code uses both typing annotations and format strings, two language features that are well-loved by Python-3 enthusiasts, and entirely inaccessible if you want to continue supporting Python-2 users.

We also show the setup.py script, which includes a bit of Cython code if we’re running under Python 2.

# setup.pyimportosfromsetuptoolsimportsetup,find_packagesimportsysifsys.version_info[0]==2:fromCython.Buildimportcythonizekwargs={'ext_modules':cythonize(os.path.join("py32test","*.py"),language_level='3')}else:kwargs={}setup(name='py32test',version='1.0.0',packages=find_packages(),**kwargs)

This package works fine in Python 2

>>>importsys>>>sys.version_infosys.version_info(major=2,minor=7,micro=14,releaselevel='final',serial=0)>>>importpy32test>>>py32test.inc(100)101>>>py32test.greet(u'user')u'Hello, user!'

In general things seem to work fine. There are a couple of gotchas though

Potential problems

  1. We can’t use any libraries that are Python 3 only, like asyncio.

  2. Semantics may differ slightly, for example I was surprised (though pleased) to see the following behavior.

    >>>py32test.greet('user')# <<--- note that I'm sending a str, not unicode objectTypeError:Argument'name'hasincorrecttype(expectedunicode,gotstr)

    I suspect that this is tunable with a keyword parameter somewhere in Cython. More generally this is a warning that we would need to be careful because semantics may differ slightly between Cython and CPython.

  3. Introspection becomes difficult. Tools like pdb, getting frames and stack traces, and so forth will probably not be as easy when going through Cython.

  4. Python 2 users would have to go through a compilation step to get development versions. More Python 2 users will probably just wait for proper releases or will install compilers locally.

  5. Moved imports like the from collections.abc import Mapping are not supported, though presumably changes like this could be baked into Cython in the future.

So this would probably take a bit of work to make clean, but fortunately most of this work wouldn’t affect the project’s development day-to-day.

Should we do this?

Just because we can support Python 2 in this way doesn’t mean that we should. Long term, institutions do need to drop Python 2 and either move on to Python 3 or to some other language. Tricks like using Cython only extend the inevitable and, due to the complexities above, may end up adding as much headache for developers as Python 2.

However, as someone who maintains a sizable Python-2 compatible project that is used by large institutions, and whose livelihood depends a bit on continued uptake, I’ll admit that I’m hesitant to jump onto the Python 3 Statement. For me personally, seeing Cython as an option to provide continued support makes me much more comfortable with dropping Python 2.

I also think that maintaining a conda channel of Cython-compiled Python-2-compatible packages would be an excellent effort for a for-profit company like Anaconda Inc, Enthought, or Quansight (or someone new). Companies may be willing to pay for access to such a channel, and presumably the company providing these packages would then be incentivized to improve support for the Cython compiler.


First Impressions of GPUs and PyData

$
0
0

I recently moved from Anaconda to NVIDIA within the RAPIDS team, which is building a PyData-friendly GPU-enabled data science stack. For my first week I explored some of the current challenges of working with GPUs in the PyData ecosystem. This post shares my first impressions and also outlines plans for near-term work.

First, lets start with the value proposition of GPUs, significant speed increases over traditional CPUs.

GPU Performance

Like many PyData developers, I’m loosely aware that GPUs are sometimes fast, but don’t deal with them often enough to have strong feeling about them.

To get a more visceral feel for the performance differences, I logged into a GPU machine, opened up CuPy (a Numpy-like GPU library developed mostly by Chainer in Japan) and cuDF (a Pandas-like library in development at NVIDIA) and did a couple of small speed comparisons:

Compare Numpy and Cupy

>>>importnumpy,cupy>>>x=numpy.random.random((10000,10000))>>>y=cupy.random.random((10000,10000))>>>%timeit(np.sin(x)**2+np.cos(x)**2==1).all()402ms±8.59msperloop(mean±std.dev.of7runs,1loopeach)>>>%timeit(cupy.sin(y)**2+cupy.cos(y)**2==1).all()746µs±1.26msperloop(mean±std.dev.of7runs,1loopeach)

On this mundane example, the GPU computation is a full 500x faster.

Compare Pandas and cuDF

>>>importpandasaspd,numpyasnp,cudf>>>pdf=pd.DataFrame({'x':np.random.random(10000000),'y':np.random.randint(0,10000000,size=10000000)})>>>gdf=cudf.DataFrame.from_pandas(pdf)>>>%timeitpdf.x.mean()# 30x faster50.2ms±970µsperloop(mean±std.dev.of7runs,10loopseach)>>>%timeitgdf.x.mean()1.42ms±5.84µsperloop(mean±std.dev.of7runs,1000loopseach)>>>%timeitpdf.groupby('y').mean()# 40x faster1.15s±46.5msperloop(mean±std.dev.of7runs,1loopeach)>>>%timeitgdf.groupby('y').mean()54ms±182µsperloop(mean±std.dev.of7runs,10loopseach)>>>%timeitpdf.merge(pdf,on='y')# 30x faster10.3s±38.2msperloop(mean±std.dev.of7runs,1loopeach)>>>%timeitgdf.merge(gdf,on='y')280ms±856µsperloop(mean±std.dev.of7runs,1loopeach)

This is a 30-40x speedup for normal dataframe computing. Operations that previously took ten seconds now process in near-interactive speeds.

These were done naively on one GPU on a DGX machine. The dataframe examples were cherry-picked to find supported operations (see dataframe issues below).

Analysis

This speed difference is potentially transformative to a number of scientific disciplines. I intentionally tried examples that were more generic than typical deep learning workloads today, examples that might represent more traditional scientific computing and data processing tasks.

GPUs seem to offer orders-of-magnitude performance increases over traditional CPUs (at least in the naive cases presented above). This speed difference is an interesting lever for us to push on, and is what made me curious about working for NVIDIA in the first place.

Roadblocks

However, there are many reasons why people don’t use GPUs for general purpose computational programming today. I thought I’d go through a few of them in this blogpost so we can see the sorts of things that we would need to resolve.

  • Not everyone has a GPU. They can be large and expensive
  • Installing CUDA-enabled libraries can be tricky, even with conda
  • Current CUDA-enabled libraries don’t yet form a coherent ecosystem with established conventions
  • Many of the libraries around RAPIDS need specific help:
    • cuDF is immature, and needs many simple API and feature improvements
    • Array computing libraries need protocols to share data and functionality
    • Deep learning libraries have functionality, but don’t share functionality easily
    • Deploying Dask on multi-GPU systems can be improved
    • Python needs better access to high performance communication libraries

This is just my personal experience which, let me be clear, is only limited to a few days. I’m probably wrong about many topics I discuss below.

Not everyone has a GPU

GPUs can be expensive and hard to put into consumer laptops, so there is a simple availability problem. Most people can’t just crack open a laptop, start IPython or a Jupyter notebook, and try something out immediately.

However, most data scientists, actual scientists, and students that I run into today do have some access to GPU resources through their institution. Many companies, labs, and universities today have purchased a GPU cluster that, more often than not, sits under-utilized. These are often an ssh command away, and generally available.

Two weeks ago I visited with Joe Hamman, a scientific collaborator at NCAR and UW eScience institute and he said “Oh yeah, we have a GPU cluster at work that I never use”. About 20 minutes later he had a GPU stack installed and was running an experiment very much like what we did above.

Installing CUDA-enabled libraries is complicated by drivers

Before conda packages, wheels, Anaconda, and conda forge, installing the PyData software stack (Numpy, Pandas, Scikit-Learn, Matplotlib) was challenging. This was because users had to match a combination of system libraries, compiler stacks, and Python packages. “Oh, you’re on Mac? First brew install X, then make sure you have gfortran, then pip install scipy

The ecosystem solved this pain by bringing the entire stack under the single umbrella of conda where everything could be managed consistently, or alternatively was greatly diminished with pip wheels.

Unfortunately, CUDA drivers have to be managed on the system side, so we’re back to matching system libraries with Python libraries, depending on what CUDA version you’re using.

Here are PyTorch’s installation instructions as an example:

  • CUDA 8.0:conda install pytorch torchvision cuda80 -c pytorch
  • CUDA 9.2:conda install pytorch torchvision -c pytorch
  • CUDA 10.0:conda install pytorch torchvision cuda100 -c pytorch
  • No CUDA:conda install pytorch-cpu torchvision-cpu -c pytorch

Additionally, these conventions differ from the conventions used by Anaconda’s packaging of TensorFlow and NVIDIA’s packaging of RAPIDS. This inconsistency in convention makes it unlikely that a novice user will get a working system if they don’t do some research ahead of time. PyData survives by courting non-expert computer users (they’re often experts in some other field) so this is a challenge.

There is some work happening in Conda that can help with this in the future. Regardless, we will need to develop better shared conventions between the different Python GPU development groups.

No community standard build infrastructure exists

After speaking about this with John Kirkham (Conda Forge maintainer), he suggested that the situation is also a bit like the conda ecosystem before conda-forge, where everyone built their own packages however they liked and uploaded them to anaconda.org without agreeing on a common build environment. As much of the scientific community knows, this inconsistency can lead to a fragmented stack, where certain families of packages work well only with certain packages within their family.

Development teams are fragmented across companies

Many of the large GPU-enabled packages are being developed by large teams within private institutions. There isn’t a strong culture of cross-team collaboration.

After working with the RAPIDS team at NVIDIA for a week my sense is that this is only due to being unaware of how to act, and not any nefarious purpose (or they were very good at hiding that nefarious purpose). I suspect that the broader community will be able to bring these groups more into the open quickly if they engage.

RAPIDS and Dask need specific attention

Now we switch and discuss technical issues in the RAPIDS stack, and Dask’s engagement with GPUs generally. This will be lower-level, but shows the kinds of things that I hope to work on technically over the coming months.

Generally the goal of RAPIDS is to build a data science stack around conventional numeric computation that mimics the PyData/SciPy stack. They seem to be targeting libraries like:

  • Pandas by building a new library, cuDF
  • Scikit-Learn / traditional non-deep machine learning by building a new library cuML
  • Numpy by leveraging existing libraries like CuPy, PyTorch, TensorFlow, and focusing on improving interoperation within the ecosystem

Driven by the standard collection of scientific/data centric applications like imaging, earth science, ETL, finance, and so on.

Now lets talk about the current challenges for those systems. In general, none of this stack is yet mature (except for the array-computing-in-deep-learning case).

cuDF is missing Pandas functionality

When I showed cuDF at the top of this post, I ran the following computations, which ran 30-40x as fast as Pandas..

gdf.x.mean()gdf.groupby('y').mean()gdf.merge(gdf,on='y')

What I failed to show was that many operations erred. The cuDF library has great promise, but still needs work filling out the Pandas API.

# There are many holes in the cuDF APIcudf.read_csv(...)# workscudf.read_parquet(...)# fails if compression is presentdf.x.mean()# worksdf.mean()# failsdf.groupby('id').mean()# worksdf.groupby('id').x.mean()# failsdf.x.groupby(df.id).mean()# fails

Fortunately, this work is happening quickly (GitHub issues seem to turn quickly into PRs) and is mostly straightforward on the Python side. This is a good opportunity for community members who are looking to have a quick impact. There are lots of low-hanging fruit.

Additionally, there are areas where the cudf semantics don’t match Pandas semantics. In general this is fine (not everyone loves Pandas semantics) but it makes it difficult as we try to wrap Dask Dataframe around cuDF. We would like to grow Dask Dataframe so that it can accept Pandas-like dataframes and so then get out-of-core GPU dataframes on a single node, and distributed GPU dataframes on multi-GPU or multi-node, and we would like to grow cudf so that it can fit into this expectation.

This work has to happen both at the low-level C++/CUDA code, and also at the Python level. The sense I get is that NVIDIA has a ton of people available at the CUDA level, but only a few (very good) people at the Python level who are working to keep up (come help!).

Array computing is robust, but fragmented

The Numpy experience is much smoother, mostly because of the excitement around deep learning over the last few years. Many large tech companies have made their own deep learning framework, each of which contains a partial clone of the Numpy API. These include libraries like TensorFlow, PyTorch, Chainer/CuPy, and others.

This is great because these libraries provide high quality functionality to choose from, but is also painful because the ecosystem is heavily fragmented. Data allocated with TensorFlow can’t be computed on with Numba or CuPy operations.

We can help to heal this rift with a few technical approaches:

  • A standard to communicate low-level information about GPU arrays between frameworks. This would include information about an array like a device memory pointer, datatype, shape, strides, and so on, similar to what is in the Python buffer protocol today.

    This would allow people to allocate an array with one framework, but then use computational operations defined in another framework.

    The Numba team prototyped something like this a few months ago, and the CuPy team seemed happy enough with it. See cupy/cupy #1144

    @propertydef__cuda_array_interface__(self):desc={'shape':self.shape,'typestr':self.dtype.str,'descr':self.dtype.descr,'data':(self.data.mem.ptr,False),'version':0,}ifnotself._c_contiguous:desc['strides']=self._stridesreturndesc

    This was also, I believe, accepted into PyTorch.

  • A standard way for developers to write backend-agnostic array code. Currently my favorite approach is to use Numpy functions as a lingua franca, and to allow the frameworks to hijack those functions and interpret them as they will.

    This was proposed and accepted within Numpy itself in NEP-0018 and has been pushed forward by people like Stephan Hoyer, Hameer Abbasi, Marten van Kerkwijk, and Eric Wieser.

    This is also useful for other array libraries, like pydata/sparse and dask array, and would go a long way towards unifying operations with libraries like XArray.

cuML needs features, Scikit-Learn needs datastructure agnosticism

While deep learning on the GPU is commonplace today, more traditional algorithms like GLMs, random forests, preprocessing and so on haven’t received the same thorough treatment.

Fortunately the ecosystem is well prepared to accept work in this space, largely because Scikit Learn established a simple pluggable API early on. Building new estimators in external libraries that connect to the ecosystem well is straightforward.

We should be able to build isolated estimators that can be dropped into existing workflows piece by piece, leveraging the existing infrastructure within other Scikit-Learn-compatible projects.

# This code is aspirationalfromsklearn.model_selectionimportRandomSearchCVfromsklearn.pipelineimportmake_pipelinefromsklearn.feature_extraction.textimportTfidfTransformer# from sklearn.feature_extraction.text import HashingVectorizerfromcuml.feature_extraction.textimportHashingVectorizer# swap out for GPU versions# from sklearn.linear_model import LogisticRegression, RandomForestfromcuml.linear_modelimportLogisticRegression,RandomForestpipeline=make_pipeline([HashingVectorizer(),# use Scikit-Learn infrastructureTfidfTransformer(),LogisticRegression()])RandomSearchCV(pipeline).fit(data,labels)

Note, the example above is aspirational (that cuml code doesn’t exist yet) and probably naive (I don’t know ML well).

However, aside from the straightforward task of building these GPU-enabled estimators (which seems to be routine for the CUDA developers at NVIDIA) there are still challenges around cleanly passing non-Numpy arrays around, coercing only when necessary, and so on that we’ll need to work out within Scikit-Learn.

Fortunately this work has already started because of Dask Array, which has the same problem. The Dask and Scikit-Learn communities have been collaborating to better enable pluggability over the last year. Hopefully this additional use case proceeds along these existing efforts, but now with more support.

Deep learning frameworks are overly specialized

The SciPy/PyData stack thrived because it was modular and adaptable to new situations. There are many small issues around integrating components of the deep learning frameworks into the more general ecosystem.

We went through a similar experience with Dask early on, when the Python ecosystem wasn’t ready for parallel computing. As Dask expanded we ran into many small issues around parallel computing that hadn’t been addressed before because, for the most part, few people used Python for parallelism at the time.

  • Various libraries didn’t release the GIL (thanks for the work Pandas, Scikit-Image, and others!)
  • Various libraries weren’t threadsafe in some cases (like h5py, and even Scikit-Learn in one case)
  • Function serialization still needed work (thanks cloudpickle developers!)
  • Compression libraries were unmaintained (like LZ4)
  • Networking libraries weren’t used to high bandwidth workloads (thanks Tornado devs!)

These issues were fixed by a combination of Dask developers and the broader community (it’s amazing what people will do if you provide a well-scoped and well-described problem on GitHub). These libraries were designed to be used with other libraries, and so they were well incentivized to improve their usability by the broader ecosystem.

Today deep learning frameworks have these same problems. They rarely serialize well, aren’t threadsafe when called by external threads, and so on. This is to be expected, most people using a tool like TensorFlow or PyTorch are operating almost entirely within those frameworks. These projects aren’t being stressed against the rest of the ecossytem (no one puts PyTorch arrays as columns in Pandas, or pickles them to send across a wire). Taking tools that were designed for narrow workflows and encouraging them towards general purpose collaboration takes time and effort, both technically and socially.

The non-deep-learning OSS community has not yet made a strong effort to engage the deep-learning developer communities. This should be an interesting social experiment between two different of dev cultures. I suspect that these different groups have different styles and can learn from each other.

Note: Chainer/CuPy is a notable exception here. The Chainer library (another deep learning framework) explicitly separates its array library, CuPy, which makes it easier to deal with. This, combined with a strict adherence to the Numpy API, is probably why they’ve been the early target for most ongoing Python OSS interactions.

Dask needs convenience scripts for GPU deployment

On high-end systems it is common to have several GPUs on a single machine. Programming across these GPUs is challenging because you need to think about data locality, load balancing, and so on.

Dask is well-positioned to handle this for users. However, most people using Dask and GPUs today have a complex setup script that includes a combination of environment variables, dask-worker calls, additional calls to CUDA profiling utilities, and so on. We should make a simple LocalGPUCluster Python object that people can easily call within a local script, similar to how they do today for LocalCluster.

Additionally, this problem applies to the multi-gpu-multi-node case, and will require us to be creative with the existing distributed deployment solutions (like dask-kubernetes, dask-yarn, and dask-jobqueue). Of course, adding complexity like this without significantly impacting the non-GPU case and adding to community maintenance costs will be an interesting challenge, and will require creativity.

Python needs High Performance Communication libraries

High-end GPU systems often use high-end networking. This is especially important when our compute times drop significantly because communication time may quickly become our new bottleneck if we reduce computation time with GPUs.

Last year Antoine worked to improve Tornado’s handling of high-bandwidth connections to get about 1GB/s per process on Infiniband networks from Python. We may need to go well above this, both for Infiniband and for more exotic networking solutions. In particular NVIDIA has systems that support efficient transfer directly between GPU devices without going through host memory.

There is already work here that we can leverage. The OpenUCX project offloads exotic networking solutions (like Infiniband) to a uniform API. They’re now working to provide a Python accessible API that we can then then connect to Dask. This is good also for Dask-CPU users because Infiniband connections will become more efficient (HPC users rejoice) and also for the general Python HPC community, which will finally have a reasonable Python API for high performance networking. This work currently targets the Asyncio event loop.

As an aside, this is the kind of work I’d like to see coming out of NVIDIA (and other large technology companies) in the future. It helps to connect Python users to their specific hardware yes, but also helps lots of other systems and provides general infrastructure applicable across the community at the same time.

Come help!

This post started with the promise of 100x speed improvements (at least for computation), and then outlined many of the challenges to getting there. These challenges are serious, but also mostly straightforward technical and social engineering problems. There is a lot of basic work with a high potential payoff.

NVIDIA’s plan to build a GPU-compatible data science stack seems ambitious, but they seem to be treating the problem seriously, and seem willing to put resources and company focus behind the problem.

If any of the work above sounds interesting to you please engage either as an …

  • Individual: either as an open source contributor (RAPIDS is Apache 2.0 licensed and seems to be operating as a normal OSS project on GitHub) or as an employee (see active job postings. The team is currently remotely distributed.).

    There is lots of exciting work to do here.

    .. or as an …

  • Institution: You may already have both an under-utilized cluster of GPUs within your institution, and also large teams of data scientists familiar with the Python but unfamiliar with CUDA. NVIDIA seems eager to find partners who are interested in mutual arrangements to build out functionality for specific domains. Please reach out if this sounds familiar.

GPU Dask Arrays, first steps

$
0
0

The following code creates and manipulates 2 TB of randomly generated data.

importdask.arrayasdars=da.random.RandomState()x=rs.normal(10,1,size=(500000,500000),chunks=(10000,10000))(x+1)[::2,::2].sum().compute(scheduler='threads')

On a single CPU, this computation takes two hours.

On an eight-GPU single-node system this computation takes nineteen seconds.

Combine Dask Array with CuPy

Actually this computation isn’t that impressive. It’s a simple workload, for which most of the time is spent creating and destroying random data. The computation and communication patterns are simple, reflecting the simplicity commonly found in data processing workloads.

What is impressive is that we were able to create a distributed parallel GPU array quickly by composing these three existing libraries:

  1. CuPy provides a partial implementation of Numpy on the GPU.

  2. Dask Array provides chunked algorithms on top of Numpy-like libraries like Numpy and CuPy.

    This enables us to operate on more data than we could fit in memory by operating on that data in chunks.

  3. The Dask distributed task scheduler runs those algorithms in parallel, easily coordinating work across many CPU cores or GPUs.

These tools already exist. We had to connect them together with a small amount of glue code and minor modifications. By mashing these tools together we can quickly build and switch between different architectures to explore what is best for our application.

For this example we relied on the following changes upstream:

Comparison among single/multi CPU/GPU

We can now easily run some experiments on different architectures. This is easy because …

  • We can switch between CPU and GPU by switching between Numpy and CuPy.
  • We can switch between single/multi-CPU-core and single/multi-GPU by switching between Dask’s different task schedulers.

These libraries allow us to quickly judge the costs of this computation for the following hardware choices:

  1. Single-threaded CPU
  2. Multi-threaded CPU with 40 cores (80 H/T)
  3. Single-GPU
  4. Multi-GPU on a single machine with 8 GPUs

We present code for these four choices below, but first, we present a table of results.

Results

ArchitectureTime
Single CPU Core 2hr 39min
Forty CPU Cores 11min 30s
One GPU 1 min 37s
Eight GPUs 19s

Setup

importcupyimportdask.arrayasda# generate chunked dask arrays of mamy numpy random arraysrs=da.random.RandomState()x=rs.normal(10,1,size=(500000,500000),chunks=(10000,10000))print(x.nbytes/1e9)# 2 TB# 2000.0

CPU timing

(x+1)[::2,::2].sum().compute(scheduler='single-threaded')(x+1)[::2,::2].sum().compute(scheduler='threads')

Single GPU timing

We switch from CPU to GPU by changing our data source to generate CuPy arrays rather than NumPy arrays. Everything else should more or less work the same without special handling for CuPy.

(This actually isn’t true yet, many things in dask.array will break for non-NumPy arrays, but we’re working on it actively both within Dask, within NumPy, and within the GPU array libraries. Regardless, everything in this example works fine.)

# generate chunked dask arrays of mamy cupy random arraysrs=da.random.RandomState(RandomState=cupy.random.RandomState)# <-- we specify cupy herex=rs.normal(10,1,size=(500000,500000),chunks=(10000,10000))
(x+1)[::2,::2].sum().compute(scheduler='single-threaded')

Multi GPU timing

fromdask.distributedimportClient,LocalCUDACluster# this is experimentalcluster=LocalCUDACluster()client=Client(cluster)(x+1)[::2,::2].sum().compute()

And again, here are the results:

ArchitectureTime
Single CPU Core 2hr 39min
Forty CPU Cores 11min 30s
One GPU 1 min 37s
Eight GPUs 19s

First, this is my first time playing with an 40-core system. I was surprised to see that many cores. I was also pleased to see that Dask’s normal threaded scheduler happily saturates many cores.

Although later on it did dive down to around 5000-6000%, and if you do the math you’ll see that we’re not getting a 40x speedup. My guess is that performance would improve if we were to play with some mixture of threads and processes, like having ten processes with eight threads each.

The jump from the biggest multi-core CPU to a single GPU is still an order of magnitude though. The jump to multi-GPU is another order of magnitude, and brings the computation down to 19s, which is short enough that I’m willing to wait for it to finish before walking away from my computer.

Actually, it’s quite fun to watch on the dashboard (especially after you’ve been waiting for three hours for the sequential solution to run):

Conclusion

This computation was simple, but the range in architecture just explored was extensive. We swapped out the underlying architecture from CPU to GPU (which had an entirely different codebase) and tried both multi-core CPU parallelism as well as multi-GPU many-core parallelism.

We did this in less than twenty lines of code, making this experiment something that an undergraduate student or other novice could perform at home. We’re approaching a point where experimenting with multi-GPU systems is approachable to non-experts (at least for array computing).

Here is a notebook for the experiment above

Room for improvement

We can work to expand the computation above in a variety of directions. There is a ton of work we still have to do to make this reliable.

  1. Use more complex array computing workloads

    The Dask Array algorithms were designed first around Numpy. We’ve only recently started making them more generic to other kinds of arrays (like GPU arrays, sparse arrays, and so on). As a result there are still many bugs when exploring these non-Numpy workloads.

    For example if you were to switch sum for mean in the computation above you would get an error because our mean computation contains an easy to fix error that assumes Numpy arrays exactly.

  2. Use Pandas and cuDF instead of Numpy and CuPy

    The cuDF library aims to reimplement the Pandas API on the GPU, much like how CuPy reimplements the NumPy API. Using Dask DataFrame with cuDF will require some work on both sides, but is quite doable.

    I believe that there is plenty of low-hanging fruit here.

  3. Improve and move LocalCUDACluster

    The LocalCUDAClutster class used above is an experimental Cluster type that creates as many workers locally as you have GPUs, and assigns each worker to prefer a different GPU. This makes it easy for people to load balance across GPUs on a single-node system without thinking too much about it. This appears to be a common pain-point in the ecosystem today.

    However, the LocalCUDACluster probably shouldn’t live in the dask/distributed repository (it seems too CUDA specific) so will probably move to some dask-cuda repository. Additionally there are still many questions about how to handle concurrency on top of GPUs, balancing between CPU cores and GPU cores, and so on.

  4. Multi-node computation

    There’s no reason that we couldn’t accelerate computations like these further by using multiple multi-GPU nodes. This is doable today with manual setup, but we should also improve the existing deployment solutions dask-kubernetes, dask-yarn, and dask-jobqueue, to make this easier for non-experts who want to use a cluster of multi-GPU resources.

  5. Expense

    The machine I ran this on is expensive. Well, it’s nowhere close to as expensive to own and operate as a traditional cluster that you would need for these kinds of results, but it’s still well beyond the price point of a hobbyist or student.

    It would be useful to run this on a more budget system to get a sense of the tradeoffs on more reasonably priced systems. I should probably also learn more about provisioning GPUs on the cloud.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. The NVIDIA corporation is hiring around the use of Dask with GPUs.

That’s a fairly generic posting. If you’re interested the posting doesn’t seem to fit then please apply anyway and we’ll tweak things.

The Role of a Maintainer

$
0
0

What are the expectations and best practices for maintainers of open source software libraries? How can we do this better?

This post frames the discussion and then follows with best practices based on my personal experience and opinions. I make no claim that these are correct.

Let us Assume External Responsibility

First, the most common answer to to this question is the following:

  • Q:What are expectations on OSS maintainers?
  • A:Nothing at all. They’re volunteers.

However, let’s assume for a moment that these maintainers are paid to maintain the project some modest amount, like 10 hours a week.

How can they best spend this time?

What is a Maintainer?

Next, let’s disambiguate the role of developer, reviewer, and maintainer

  1. Developers fix bugs and create features. They write code and docs and generally are agents of change in a software project. There are often many more developers than reviewers or maintainers.

  2. Reviewers are known experts in a part of a project and are called on to review the work of developers, mostly to make sure that the developers don’t break anything, but also to point them to related work, ensure common development practices, and pass on institutional knowledge. There are often more developers than reviewers, and more reviewers than maintainers.

  3. Maintainers are loosely aware of the entire project. They track ongoing work and make sure that it gets reviewed and merged in a timely manner. They direct the orchestra of developers and reviewers, making sure that they connect to each other appropriately, often serving as dispatcher.

    Maintainers also have final responsibility. If no reviewer can be found for an important contribution, they review. If no developer can be found to fix an important bug, they develop. If something goes wrong, it’s eventually the maintainer’s fault.

Best Practices

Now let’s get into the best practices of a maintainer, again assuming the context that they are paid to do this about 25% time for a moderately busy software project (perhaps 10-50 issues/contributions per day).

Timely Response

“Welcome Bob! Nice question. I’m currently a bit busy right now, but I think that if you look through these notes that they might point you in the right direction. I should have time to check back here by Thursday.”

The maintainer is often the first person a new contributor meets. Like a concierge or greeter at a hotel or restaurant, it’s good to say “Hi” when someone shows up, even if that’s all you can say at that moment.

When someone is raising an issue or contributing a patch, try to give them a response within 24 hours, even if it’s not a very helpful response. It’s scary to ask something in public, and much scarier to try write code and contribute it to a project. Acknowledging the contributor’s question or work helps them to relax and feel welcome.

Answer Easy Questions Easily

“Thanks for the question Bob. I think that what you’re looking for is described in these docs. Also, in the future we welcome these sorts of questions on Stack Overflow.”

Answer simple questions quickly, and move on.

After answering, you might also direct new users and contributors to places like Stack Overflow where they can ask questions in the future.

Find Help for Hard Problems

“Hey Alice, you’re an expert on X, would you mind checking this out?”

You probably have a small team of expert reviewers to ask for help for tricky problems. Get to know them well and use them. It’s not your job to solve every problem, and you probably don’t have time anyway. It’s your job to make sure that the right reviewer sees the problem, and then track that it gets resolved.

But also, don’t overuse your reviewers. Everyone has a tolerance for how much they’re willing to help. You may have to spread things out a little. Getting to know your reviewers personally and learning their interests can help you to make decisions about when and where to use them.

But follow up

“Just checking in here. It looks like Reviewer Alice has asked for X, Developer Bob is this something that you have time to do? Also it looks like Developer Bob has a quesiton about Y. Reviewer Alice do you have any suggestions for Bob on this?”

The reviewers you rely on are likely to swoop in, convey expertise, and then disappear back to their normal lives. Thank them for their help, and don’t rely on them to track the work to completion, that’s your job. You may have to direct conversation a bit.

We often see the following timeline:

  • Developer:“Hi! I made a patch for X!”
  • Maintainer:“Welcome! Nice patch! Hey Reviewer, you know X really well, could you take a look?”
  • Reviewer:“Hi Developer! Great patch! I found a bunch of things that were wrong with it, but I think that you can probably fix them easily!”
  • Developer:“Oh yeah! You’re right, great, I’ve fixed everything you said!”
  • silence

At this point, jump in again!

  • Maintainer:“OK, it looks like you’ve handled everything that the reviewer asked for. Nice work! Reviewer, anything else? Otherwise, I plan to merge this shortly”

    “Also, I notice that your code didn’t pass the linter. Please check out this doc for how to run our auto-linter.”

  • Developer:“Done!”
  • Maintainer:“OK, merging. Thanks Developer!”

This situation is nice, but not ideal from the maintainer’s perspective. Ideally the reviewer would finish things up, but often they don’t. In some cases it might even be their job to finish things up, but even then, it’s also your job to make sure that things finish up so if they don’t show up, then it’s on the maintainer.

All responsibility eventually falls back onto the maintainer.

Prioritize your time

“It’d be great if you could provide a minimal example as described here. Please let us know if you’re able to manage this. otherwise given time constraints I’m not sure what we can do to help further.”

Some contributors are awesome and do everything right. They ask questions with minimal reproducible examples. They provide well-tested code. Everything. They’re awesome.

Others aren’t as awesome. They ask for a lot of your time to help them solve problems that have little to do with your software. You probably can’t solve every problem well, and working on their problems steals important time away that you could be spending improving documentation or process.

Politely thank misinformed users and direct them towards standard documentation on expectations on asking questions and raising bug reports like Stack Overflow’s MCVE or possibly this post on crafting minimal bug reports

Get used to a lot of e-mail

“Ah, the joy of e-mail filters”

If you’re tracking every conversation then you’ll be getting hundreds of e-mails a day. You don’t need to read all of these in depth (this would take most of your 10 hours a week) but you probably should skim them. Don’t worry, with practice you can do this quickly.

You certainly don’t want this in your personal inbox, so you may want to invest a bit of time in organizing e-mail filters and labels. Personally I use GMail and with their filters and keyboard shortcuts I can usually blast through e-mail pretty quickly. (Hooray for j/k Vim-style bindings in GMail)

I check my Github inbox a few times a day. It usually takes 20-30 minutes in the morning (due to all of the people active when I’m asleep) and 10 minutes during other times in the day. I look forward to this time. It’s nice seeing a lot of people being productive.

Mostly during this period I’m looking for anyone who is blocked, and if necessary I respond with something to help unblock them.

But check the issue tracker periodically

“Whoops! I totally dropped this! Sorry!”

You’ll miss things. People will force-push to Github (which triggers no alert) or the last person to reply to an issue will be you with no response. There are also just old issues and PRs that have slipped through and aren’t coming up on your e-mail.

So every day or two, it’s good to go through all issues and PRs that have a glimmer of life in them and check in. Often you’ll find that someone has done work and the reviewer has left, or that the developer has just left. Now is a good time to ask a gentle message

Just checking in, what’s the status here?

Sometimes they’ll be busy at work and will come back in a couple days. Sometimes they’ll need a push in the right direction. Sometimes they’ll disappear completely, and you have to decide whether to finish the work or let it linger.

Weekends

“Now that I’m done with work for the week, I can finally finish up this PR. Wait, where did everybody go?”

This is hard, but a lot of the developers aren’t contributing as part of their day job, and so they work primarily on the weekends. If they only work on the weekends and you only respond during the week then we’re going to have very long iteration cycles, which is frustrating for everyone and unlikely to result in successful work.

Personally I spend a bit of time in the morning and evening doing light maintenance. That’s a personal choice though.

Don’t let reviewers drag things out too long

“Can you just add one more thing?”

Developers want to get their work in quickly. Reviewers sometimes want to ask lots of questions, make small suggestions, and sometimes have very long conversations. At some point you need to step in and ask for a definitive set of necessary changes. This gives the developer a sense of finality, a light at the end of the tunnel.

Very long review periods are a common anti-behavior today, they are destructive to attracting new developers and contribute to reviewer burnout. A reviewer should ideally iterate once or twice with a developer with in-depth comments and then we should be done. This breaks down in a few ways:

  1. The Slow Rollout: The reviewer provides a few small suggestions, waits for changes, then provides a few more, and so on, resulting in many iterations.
  2. Serial Reviewers: After one reviewer finishes up, a second reviewer arrives with their own set of requested changes.
  3. Reviewer Disagreement: The two reviewers provide different suggestions and the developer makes changes and undoes those changes based on who spoke last.
  4. Endless Discussion: The reviewers then engage in very long and detailed technical conversation, scaring away the original developer.

This is reviewer breakdown, and it’s up to the maintainer to identify that it’s happening, step in and say the following:

OK, I think that what we have is a sufficient start. I recommend that we merge what we have here and continue the rest of this conversation in a separate issue

Thank Developers

“I appreciate all your work here. In particular I’m really happy about these tests!”

They look up to you. A small amount of praise from you can make their day and encourage them to continue contributing to open source.

Also, as with normal life, if you can call out some specific thing that they did well it becomes more personal.

Encourage Excellent Developers to Review, and Allow Excellent Reviewers to Maintain

“Hey Bob, this new work is similar to work that you’ve done before. Any interest in taking a look?”

Over time you will notice that a repeat developer returns to the project frequently, often to work on a particular topic. When a new contribution arrives in that part of the code, you might intentionally invite the repeat developer to review that work. You might invite them to become a reviewer.

Similarly, when you find that a skilled reviewer frequently handles issues on their own, you should find ways to give them ownership over that section of the code. Defer to their judgement. Make sure that they have commit rights and the ability to publish new packages (if the code is separate enough to allow for that). You should clear the way for them to become a maintainer.

To be clear, I wouldn’t encourage everyone. Even very good developers can be bad reviewers or maintainers. Bad reviewers can be unwelcoming and destructive to the process in a variety of ways. This activity requires social skills that aren’t universally held, regardless of programming skill.

Take a Vacation (But Tell Someone)

“I’d like to checking out for a week. Alice, would you mind keeping an eye on things?”

Maintaining a project with a few peers is wonderful because it’s easier for people to take breaks and attend to their mental health. However, it’s important to make people aware of your absence during vacations or illness. A quick word to a colleague about an absence, expected or otherwise, can help to keep things running smoothly.

Many OSS projects today have a single core maintainer. This is hard on them and hard on the project (solo-maintainers tend to quickly become gruff). This post is designed with this problem in mind. Hopefully as we develop a vocabulary and conversation around the administrative sides of maintenance it will become easier to identify and encourage these behaviors in new maintainers.

Summary

“Thanks for taking the time”

Maintaining a project is not about being a great developer or a clever reviewer. It’s about enabling others, removing road-blocks before they arise, and identifying and resolving difficult social situations. It has much more to do with logistics, coordination, and social behaviors than it has to do with algorithms and version control.

I have to admit, I’m not great at this, but I’m trying to become better. Maintaining a software project is a learned skill and takes effort. The reward can be significant though.

A well maintained project is pleasant to work on and attracts a productive team of friendly developers and reviewers that support each other. It’s also a great way to learn more about how people use your software project than you ever could while writing code. The activity of maintaining software gives you enough exposure to see where the project is headed and what’s possible going forward.

Write Short Blogposts

$
0
0

I encourage my colleagues to write blogposts more frequently. This is for a few reasons:

  1. It informs your broader community what you’re up to, and allows that community to communicate back to you quickly.

    You communicating to the community fosters a sense of collaboration, openness, and trust. You gain collaborators, build momentum behind your work, and curate a body of knowledge that early adopters can consume to become experts quickly.

    Getting feedback from your community helps you to course-correct early in your work, and stops you from wasting time in inefficient courses of action.

    You can only work for a long time without communicating if you are either entirely confident in what you’re doing, or reckless, or both.

  2. It increases your visibility, and so is good for your career.

    I have a great job. I find my work to be both intellectually stimulating and ethically worthwhile. I’m also decently well paid for it.

    I have this job largely because a potential employer a long time ago read my blog when I was just starting out.

  3. It focuses you towards tasks that are relevant to others.

    When you set an expectation of regularly publishing a blogpost, you find yourself seeking out work that will engage people. This is a nice objective to try to optimize, and keeps you on task serving the general public.

    Obviously, you don’t want to take this too far, but my guess is that most developers could use a little more orientation towards public interest.

  4. It improves your communication skills

    Your effort is valuable only if people understand and adopt it. This is true for technical programming, but also true for most efforts. Writing regularly encourages you to think about effective communication.

But I’m too busy

Ah! Indeed.

How are we supposed to get work done if we’re writing blogposts all the time?

We should reduce our standards. Blogposts don’t have to be big endeavors that solve a huge problem. They are not scientific journal articles. They can just be longer-than-average tweets.

We can write short blogposts that we write in a fixed amount of time. Personally, I like timing myself (I started this blopgost just 18 minutes ago) and time-boxing myself (I plan to publish this this afternoon before my next meeting).

One of my favorite bloggers, John D Cook writes an excellent blog that follows this concise approach. Here are some recent posts from his blog:

These are all about a page long, and are a great read. Honestly, I probably wouldn’t read a scientific journal on any of these topics (they’re not my field) but I know that John’s posts are concise, so I’m happier to dive in than for someone that I know produces very long posts.

Summary

You should write more. It can be easy if you don’t make a big deal out of it.

Timings

  • 12:30: Talking to a colleague about writing short blogposts. Decide that I should stop one-on-one communication, and just write a blogpost about this.
  • 12:51: Start writing
  • 13:16: Finish first draft, render it to take a look and edit
  • 13:25: Finish
  • 13:26: Publish
  • 13:27: Tweet
  • 13:29: Back to work!
  • 13:32: Oops! Typo.

HTML outputs in Jupyter

$
0
0

Summary

User interaction in data science projects can be improved by adding a small amount of visual deisgn.

To motivate effort around visual design we show several simple-yet-useful examples. The code behind these examples is small and accessible to most Python developers, even if they don’t have much HTML experience.

This post in particular focuses on Jupyter’s ability to add HTML output to any object. This can either be full-fledged interactive widgets, or just rich static outputs like tables or diagrams. We hope that by showing examples here we will inspire some throughts in other projects.

This post was supported by replies to this tweet. The rest of this post is just examples.

Iris

I originally decided to write this post after reading another blogpost from the UK Met office, where they included the HTML output of their library Iris in a a blogpost

(work by Peter Killick, post by Theo McCaie)

The fact that the output provided by an interactive session is the same output that you would provide in a published result helps everyone. The interactive user gets high quality feedback, and there is no productionizing step as we move from the nitty-gritty of development to polish of publication.

Shapely

Shapely deals with shapes, such as are used in geographic information systems. They render themselves just as shapes in Jupyter using SVG.

fromshapely.geometryimportPolygonPolygon([(0,0),(0,1),(1,0)])

Note that the image that you see above isn’t an image like a PNG or JPEG. It’s SVG. If you were to look at the source of this post you would see the actual values and content of the triangle.

(Thanks to Andy Jones for the pointer)

Logs

This logs widget provides collapsable log outputs from a set of workers in a cluster.

The code for these is simple. Here’s the implementation for the logs widget

classLog(str):"""A container for logs."""def_widget(self):fromipywidgetsimportHTMLreturnHTML(value="<pre><code>{logs}</code></pre>".format(logs=self))def_ipython_display_(self,**kwargs):returnself._widget()._ipython_display_(**kwargs)classLogs(dict):"""A container for multiple logs."""def_widget(self):fromipywidgetsimportAccordionaccordion=Accordion(children=[log._widget()forloginself.values()])[accordion.set_title(i,title)fori,titleinenumerate(self.keys())]returnaccordiondef_ipython_display_(self,**kwargs):returnself._widget()._ipython_display_(**kwargs)

This solves a common usability problem, of getting blasted with a ton of output in Jupyter and then having to scroll around. Using relatively simple HTML helps us avoid this deluge of information and makes it much more nicely navigable.

(Work by Jacob Tomlinson)

Xarray

Here is another in a PR for Xarray.

It’s all about optionally exposing information, you can play with it over at nbviewer.

(Work by Benoit Bovy)

Snakeviz

The Snakeviz project makes it easy to visualize and interact with profiling output. (I personally love this project (thanks Matt Davis!)

Now that it can embed its output directly in Jupyter it’s even easier to use.

Snakeviz output in Jupyter

(Thanks to Andy Jones for the tip)

SymPy

SymPy uses unicode for its rich outputs. This makes it valid both in the notebook and also in the console.

SymPy rich output

These outputs are critical when trying to understand complex mathematical equations. They make SymPy pragmatic for analyzing mathematical equations.

PyMC3

As Colin Carroll says, “PyMC3 borrowed _repr_latex_ from sympy, and the graphviz graph from dask-delayed”

PyMC3 rich output

Dask Array

Inspired by these I took a look at how we render Dask Arrays to the screen

In[1]:importdask.arrayasdaIn[2]:da.ones((10,1024,1024),chunks=(1,256,256))Out[2]:dask.array<ones,shape=(10,1024,1024),dtype=float64,chunksize=(1,256,256)>

Which, while compact, could be improved. I spent a day and made this.

dask/dask #4794

It was designed around common problems I’ve had in trying to convey information about chunking to new users of the library (this commonly translates into performance problems for novices). I found myself relying on it for chunking information almost immediately. I think it will do a lot of good, it looks nice, and was easy to implement.

Richer Visualization Libraries

The examples above are all fairly simple relative to proper visualization or widget libraries like Bokeh, Altair, IPyLeaflet, and more. These libraries use far more sophisticated HTML/JS to get far more exciting results.

I’m not going to go too deeply into these more sophisticated libraries because my main intention here is to encourage a broader use of rich outputs by the rest of us, even those of us that are not sophisticated with web frontend technologies.

Final thoughts

We should do more of this. It’s easy and impactful.

Even if you know only a little bit of HTML you can probably have a large impact here. Even very sophisticated and polished libraries like NumPy or Scikit-Learn have, I believe, a lot of low hanging fruit here. It’s a good place to have impact.

Viewing all 100 articles
Browse latest View live