This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.
Themes of the last couple of weeks:
- Stability enhancements for the distributed scheduler and micro-release
- NASA Grant writing
- Dask-EC2 script
- Dataframe categorical flexibility (work in progress)
- Communication refactor (work in progress)
Stability enhancements and micro-release
We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:
This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.
This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.
NASA Grant Writing
I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.
Dask-EC2 startup
The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:
pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
--keypair /path/to/ssh-key \
--type m4.2xlarge
--count 8
This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen `dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.
If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.
This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.
Dataframe Categorical Flexibility
Categoricals can significantly improve
performance on
text-based data. Currently Dask’s dataframes support categoricals, but they
expect to know all of the categories up-front. This is easy if this set is
small, like the ["Healthy", "Sick"]
categories that might arise in medical
research, but requires a full dataset read if the categories are not known
ahead of time, like the names of all of the patients.
Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.
Communication Refactor
Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.