Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

Stability enhancements for the distributed scheduler and micro-release
NASA Grant writing
Dask-EC2 script
Dataframe categorical flexibility (work in progress)
Communication refactor (work in progress)

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.

NASA Grant Writing

I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.

Dask-EC2 startup

The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:

pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
            --keypair /path/to/ssh-key \
            --type m4.2xlarge
            --count 8

This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen `dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.

If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.

Dataframe Categorical Flexibility

Categoricals can significantly improve performance on text-based data. Currently Dask’s dataframes support categoricals, but they expect to know all of the categories up-front. This is easy if this set is small, like the ["Healthy", "Sick"] categories that might arise in medical research, but requires a full dataset read if the categories are not known ahead of time, like the names of all of the patients.

Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.

Communication Refactor

Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.

Dask Development Log

Stability enhancements and micro-release

NASA Grant Writing

Dask-EC2 startup

Dataframe Categorical Flexibility

Communication Refactor

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112