PyData and the GIL
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr Many PyData projects release the GIL. Multi-core parallelism is alive and well.IntroductionMachines...
View ArticleTowards Out-of-core DataFrames
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectThis post primarily targets developers. It is on experimental code that is not ready for users.tl;dr Can...
View ArticleEfficiently Store Pandas DataFrames
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric...
View ArticlePartition and Shuffle
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectThis post primarily targets developers.tl;dr We partition out-of-core dataframes efficiently.Partition...
View ArticleProfiling Data Throughput
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectDisclaimer: This post is on experimental/buggy code.tl;dr We measure the costs of processing...
View ArticleState of Dask
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr We lay out the pieces of Dask, a system for parallel computingIntroductionDask started five months...
View ArticlePandas Categoricals
tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categoriesDisclaimer: Categoricals were created by the Pandas development team and not by me.There...
View ArticleDistributed Scheduling
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory...
View ArticleWrite Complex Parallel Algorithms
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk...
View ArticleCustom Parallel Workflows
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability...
View ArticleCaching
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data...
View ArticleA Weekend with Asyncio
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: I learned asyncio and rewrote part of dask.distributed with it; this details my...
View ArticleEfficient Tabular Storage
tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:Binary storesColumn storesCategorical supportCompressionIndexed/Partitioned storesWe use NYCTaxi...
View ArticleDistributed Prototype
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We demonstrate a prototype distributed computing library and discuss data locality.Distributed...
View ArticleData Bandwidth
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: We list and combine common bandwidths relevant in data scienceUnderstanding data bandwidths helps...
View ArticleDisk Bandwidth
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Projecttl;dr: Disk read and write bandwidths depend strongly on block size.Disk read/write bandwidths on...
View ArticleIntroducing Dask distributed
tl;dr: We analyze JSON data on a cluster using pure Python projects.Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with...
View ArticlePandas on HDFS with Dask Dataframes
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectIn this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these...
View ArticleDistributed Dask Arrays
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectIn this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus...
View ArticleDask EC2 Startup Script
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectA screencast version of this post is available here: https://youtu.be/KGlhU9kSfVkSummaryCopy-pasting the...
View Article