Quantcast
Channel: Working notes by Matthew Rocklin - SciPy
Browsing all 100 articles
Browse latest View live
↧

Image may be NSFW.
Clik here to view.

PyData and the GIL

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr Many PyData projects release the GIL. Multi-core parallelism is alive and well.IntroductionMachines...

View Article


Image may be NSFW.
Clik here to view.

Towards Out-of-core DataFrames

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectThis post primarily targets developers. It is on experimental code that is not ready for users.tl;dr Can...

View Article


Image may be NSFW.
Clik here to view.

Efficiently Store Pandas DataFrames

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric...

View Article

Image may be NSFW.
Clik here to view.

Partition and Shuffle

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectThis post primarily targets developers.tl;dr We partition out-of-core dataframes efficiently.Partition...

View Article

Profiling Data Throughput

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectDisclaimer: This post is on experimental/buggy code.tl;dr We measure the costs of processing...

View Article


Image may be NSFW.
Clik here to view.

State of Dask

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr We lay out the pieces of Dask, a system for parallel computingIntroductionDask started five months...

View Article

Pandas Categoricals

tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categoriesDisclaimer: Categoricals were created by the Pandas development team and not by me.There...

View Article

Image may be NSFW.
Clik here to view.

Distributed Scheduling

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory...

View Article


Image may be NSFW.
Clik here to view.

Write Complex Parallel Algorithms

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk...

View Article


Image may be NSFW.
Clik here to view.

Custom Parallel Workflows

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability...

View Article

Caching

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data...

View Article

A Weekend with Asyncio

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: I learned asyncio and rewrote part of dask.distributed with it; this details my...

View Article

Image may be NSFW.
Clik here to view.

Efficient Tabular Storage

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:Binary storesColumn storesCategorical supportCompressionIndexed/Partitioned storesWe use NYCTaxi...

View Article


Distributed Prototype

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We demonstrate a prototype distributed computing library and discuss data locality.Distributed...

View Article

Data Bandwidth

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We list and combine common bandwidths relevant in data scienceUnderstanding data bandwidths helps...

View Article


Image may be NSFW.
Clik here to view.

Disk Bandwidth

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: Disk read and write bandwidths depend strongly on block size.Disk read/write bandwidths on...

View Article

Introducing Dask distributed

tl;dr: We analyze JSON data on a cluster using pure Python projects.Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with...

View Article


Image may be NSFW.
Clik here to view.

Pandas on HDFS with Dask Dataframes

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectIn this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these...

View Article

Image may be NSFW.
Clik here to view.

Distributed Dask Arrays

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectIn this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus...

View Article

Dask EC2 Startup Script

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectA screencast version of this post is available here: https://youtu.be/KGlhU9kSfVkSummaryCopy-pasting the...

View Article
Browsing all 100 articles
Browse latest View live