Working notes by Matthew Rocklin

↧

Image may be NSFW.
Clik here to view.

PyData and the GIL

March 9, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr Many PyData projects release the GIL. Multi-core parallelism is alive and well.IntroductionMachines...

View Article

Image may be NSFW.
Clik here to view.

Towards Out-of-core DataFrames

March 10, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectThis post primarily targets developers. It is on experimental code that is not ready for users.tl;dr Can...

View Article

Image may be NSFW.
Clik here to view.

Efficiently Store Pandas DataFrames

March 15, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric...

View Article

Image may be NSFW.
Clik here to view.

Partition and Shuffle

March 24, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectThis post primarily targets developers.tl;dr We partition out-of-core dataframes efficiently.Partition...

View Article

Profiling Data Throughput

April 20, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectDisclaimer: This post is on experimental/buggy code.tl;dr We measure the costs of processing...

View Article

Image may be NSFW.
Clik here to view.

State of Dask

May 18, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr We lay out the pieces of Dask, a system for parallel computingIntroductionDask started five months...

View Article

Pandas Categoricals

June 17, 2015, 5:00 pm

tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categoriesDisclaimer: Categoricals were created by the Pandas development team and not by me.There...

View Article

Image may be NSFW.
Clik here to view.

Distributed Scheduling

June 22, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory...

View Article

Image may be NSFW.
Clik here to view.

Write Complex Parallel Algorithms

June 25, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk...

View Article

Image may be NSFW.
Clik here to view.

Custom Parallel Workflows

July 22, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability...

View Article

Caching

August 2, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data...

View Article

A Weekend with Asyncio

August 9, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: I learned asyncio and rewrote part of dask.distributed with it; this details my...

View Article

Image may be NSFW.
Clik here to view.

Efficient Tabular Storage

August 27, 2015, 5:00 pm

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:Binary storesColumn storesCategorical supportCompressionIndexed/Partitioned storesWe use NYCTaxi...

View Article

Distributed Prototype

October 8, 2015, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We demonstrate a prototype distributed computing library and discuss data locality.Distributed...

View Article

Data Bandwidth

December 28, 2015, 4:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We list and combine common bandwidths relevant in data scienceUnderstanding data bandwidths helps...

View Article

Image may be NSFW.
Clik here to view.

Disk Bandwidth

December 28, 2015, 4:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: Disk read and write bandwidths depend strongly on block size.Disk read/write bandwidths on...

View Article

Introducing Dask distributed

February 16, 2016, 4:00 pm

tl;dr: We analyze JSON data on a cluster using pure Python projects.Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with...

View Article

Image may be NSFW.
Clik here to view.

Pandas on HDFS with Dask Dataframes

February 21, 2016, 4:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectIn this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these...

View Article

Image may be NSFW.
Clik here to view.

Distributed Dask Arrays

February 25, 2016, 4:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectIn this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus...

View Article

Dask EC2 Startup Script

March 15, 2016, 5:00 pm

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectA screencast version of this post is available here: https://youtu.be/KGlhU9kSfVkSummaryCopy-pasting the...

View Article