Posts by null

Writing Parquet Files with Dask using to_parquet

01 January 2022

This blog post explains how to write Parquet files with Dask using the to_parquet method.

Read more ...

Why we passed on Kubernetes

01 January 2022

Kubernetes is great if you need to organize many always-on services and have in-house expertise, but can add an extra burden and abstraction when deploying a single bursty service like Dask, especially in a user environment with quickly changing needs.

Read more ...

Use Mambaforge to Conda Install PyData Stack on your Apple M1 Silicon Machine

01 January 2022

Running PyData libraries on an Apple M1 machine requires you to use an ARM64-tailored version of conda-forge. This article provides a step-by-step guide of how to set that up on your machine using mambaforge.

Read more ...

Understanding Managed Dask (Dask as a Service)

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for Dask users, addressing:

Read more ...

Tackling unmanaged memory with Dask

01 January 2022

TL;DR: unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.

Read more ...

Speed up a pandas query 10x with these 6 Dask DataFrame tricks

01 January 2022

Updated April 18th, 2024: For Dask versions >= 2024.3.0, Dask will perform a number of the optimizations discussed in this blog post automatically. See this demo for more details.

Read more ...

Spark to Dask: The Good, Bad, and Ugly of Moving from Spark to Dask

01 January 2022

Apache Spark has long been a popular tool for handling petabytes of data in analytics applications. It’s often used for big data and machine learning, and most organizations use it with cloud infrastructure to run models and build algorithms. Spark is no doubt a fast analytical tool that provides high-speed queries for large datasets, but recent client testimonials tell us that Dask is even faster. So, what should you keep in mind when moving from Spark to Dask?

Read more ...

Seven Stages of Open Software

01 January 2022

This post lays out the different stages of openness in Open Source Software (OSS) and the benefits and costs of each.

Read more ...

Setting a Dask DataFrame index

01 January 2022

This post demonstrates how to change a DataFrame index with set_index and explains when you should perform this operation.

Read more ...

Search at Grubhub and User Intent

01 January 2022

Alex Egg, Senior Data Scientist at Grubhub, joins Matt Rocklin and Hugo Bowne-Anderson to talk and code about how Dask and distributed compute are used throughout the user intent classification pipeline at Grubhub!

Read more ...

Scikit-learn in the Cloud with Coiled

01 January 2022

You can train scikit-learn models in parallel using the scikit-learn joblib interface. This allows scikit-learn to take full advantage of the multiple cores in your machine (or, spoiler alert, on your cluster) and speed up training.

Read more ...

Scale your data science workflows with Python and Dask

01 January 2022

Data Scientists are increasingly using Python and the Python ecosystem of tools for their analysis. Combined with the growing popularity of big data, this brings the challenge of scaling data science workflows. Dask is a library built for this exact purpose - making it easy to scale your Python code, and serve as a toolbox for distributed computing!

Read more ...

Save Money with Spot

01 January 2022

The cloud is wonderful but expensive.

Read more ...

Repartitioning Dask DataFrames

01 January 2022

This article explains how to redistribute data among partitions in a Dask DataFrame with repartitioning…

Read more ...

Reducing memory usage in Dask workloads by 80%

01 January 2022

There’s a saying in emergency response: “slow is smooth, smooth is fast”.

Read more ...

Reduce memory usage with Dask dtypes

01 January 2022

Columns in Dask DataFrames are typed, which means they can only hold certain values (e.g. integer columns can’t hold string values). This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.

Read more ...

PyArrow Strings in Dask DataFrames

01 January 2022

pandas 2.0 has been released! 🎉

Read more ...

Prioritizing Pragmatic Performance for Dask

01 January 2022

Many people say the following to me:

Read more ...

Perform a Spatial Join in Python

01 January 2022

This blog explains how to perform a spatial join in Python. Knowing how to perform a spatial join is an important asset in your data-processing toolkit: it enables you to join two datasets based on spatial predicates. For example, you can join a point-based dataset with a polygon-based dataset based on whether the points fall within the polygon.

Read more ...

Introducing the Dask Active Memory Manager

01 January 2022

Historically, out-of-memory errors and excessive memory requirements have frequently been a pain point for Dask users. Two of the main causes of memory-related headaches are data duplication and imbalance between workers.

Read more ...

How to Merge Dask DataFrames

01 January 2022

This post demonstrates how to merge Dask DataFrames and discusses important considerations when making large joins.

Read more ...

How to Convert a pandas Dataframe into a Dask Dataframe

01 January 2022

In this post, we will cover:

Read more ...

How Coiled sets memory limit for Dask workers

01 January 2022

While running workloads to test Dask reliability, we noticed that some workers were freezing or dying when the OS stepped in and started killing processes when the system ran out of memory.

Read more ...

Filtering Dask DataFrames with loc

01 January 2022

This post explains how to filter Dask DataFrames based on the DataFrame index and on column values using loc.

Read more ...

Enterprise Dask Support

01 January 2022

Along with the Cloud SaaS product, Coiled sells enterprise support for Dask. Mostly people buy this for these three things:

Read more ...

Easily Run Python Functions in Parallel

01 January 2022

When you search for how to run a Python function in parallel, one of the first things that comes up is the multiprocessing module. The documentation describes parallelism in terms of processes versus threads and mentions it can side-step the infamous Python GIL (Global Interpreter Lock).

Read more ...

Dask on GCP

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on Google Cloud, addressing:

Read more ...

Dask on Azure

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on Azure, addressing:

Read more ...

Dask on AWS

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on AWS, addressing:

Read more ...

Dask for Parallel Python

01 January 2022

Dask is a general purpose library for parallel computing. Dask can be used on its own to parallelize Python code, or with integrations to other popular libraries to scale out common workflows.

Read more ...

Dask and the PyData Stack

01 January 2022

The PyData stack for scientific computing in Python is an interoperable collection of tools for data analysis, visualization, statistical inference, machine learning, interactive computing and more that is used across all types of industries and academic research. Dask, the open source package for scalable data science that was developed to meet the needs of modern data professionals, was born from the PyData community and is considered foundational in this computational stack. This post describes a schematic for thinking about the PyData stack, along with detailing the technical, cultural, and community forces that led to Dask becoming the go-to package for scalable analytics in Python, including its interoperability with packages such as NumPy and pandas, among many others.

Read more ...

Dask Read Parquet Files into DataFrames with read_parquet

01 January 2022

This blog post explains how to read Parquet files into Dask DataFrames. Parquet is a columnar, binary file format that has multiple advantages when compared to a row-based file format like CSV. Luckily Dask makes it easy to read Parquet files into Dask DataFrames with read_parquet.

Read more ...

Creating Disk Partitioned Lakes with Dask using partition_on

01 January 2022

This post explains how to create disk-partitioned Parquet lakes using partition_on and how to read disk-partitioned lakes with read_parquet and filters. Disk partitioning can significantly improve performance when used correctly.

Read more ...

Cost Savings with Dask and Coiled

01 January 2022

Coiled can often save money for an organization running Dask. This article goes through the most common ways in which we see that happen.

Read more ...

Convert Large JSON to Parquet with Dask

01 January 2022

You can use Coiled, the cloud-based Dask platform, to easily convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object-store. Start off by iterating with Dask locally first to build and test your pipeline, then transfer the same workflow to Coiled with minimal code changes. We demonstrate a JSON to Parquet conversion for a 75GB dataset that runs without downloading the dataset to your local machine.

Read more ...

Coiled Cloud Architecture

01 January 2022

Running Dask in the cloud is easy.

Read more ...

Code Formatting Jupyter Notebooks with Black

01 January 2022

Black is an amazing Python code formatting tool that automatically corrects your code.

Read more ...

Cloud ETL automation with Github Actions and Coiled

01 January 2022

Github Actions let you launch automated jobs from your Github repository. Coiled lets you run your Python code in the cloud. Combining the two gives you lightweight workflow orchestration for heavy ETL (extract-transform-load) jobs that can run in the cloud without any complicated infrastructure provisioning or DevOps.

Read more ...

Better Shuffling in Dask: a Proof-of-Concept

01 January 2022

Updated May 16th, 2023: With release 2023.2.1, dask.dataframe introduced this shuffling method called P2P, making sorts, merges, and joins faster and using constant memory. Benchmarks show impressive improvements. See our blog post.

Read more ...

Accelerating Microstructural Analytics with Dask and Coiled

01 January 2022

In this article, we will discuss an interesting use case of Dask and Coiled: Accelerating Volumetric X-ray Microstructural Analytics using distributed and high-performance computing. This blog post is inspired by the article published in Kitware Blog and corresponding research.

Read more ...