Disco - A Powerful Erlang and Python Map/Reduce Framework

By Tait Clarridge, Mon 05 May 2014, Category Data

bigdata, disco, discoproject, erlang, hustle, map-reduce, pipeline, python

In the wake of PyData 2014 I felt it was important to share my thoughts on the power of Disco.

For those of you who may not know what Disco (also know as the Disco Project) can do and why it is so powerful, it's core is written in Erlang and the workers for job parts run Python (more languages coming). This means that running anything from regular aggregation/exploration jobs to machine learning using the great libraries and quick prototyping available with Python.

When comparing Disco to Hadoop, which I hope to do in detail in the very near future, it is important to note the simplicity of Disco. It uses the concurrency and clustering of Erlang so you don't require anything other than one master server to submit jobs and push files into the Disco Distributed File System (DDFS). This also means that there is virtually no latency when starting jobs and getting up and running quickly; a requirement for the distributed, relational event database that Chango has open sourced called Hustle that I won't be going into great detail in this particular post, but it is definitely worth taking a look and reading about how it works.

Another advantage of Disco is that you can run a single node cluster on your local laptop (I run it on a MacBook Air) to test jobs and push data into DDFS; it has almost zero overhead and you don't need to run it in a virtual machine and download a distro like Cloudera and fuss around with XML (which is a massive plus in my opinion).

Not only does Disco allow you to run regular MapReduce jobs, but you can use pipelines starting from version 0.5.x. Pipelines are extremely powerful, defined as a linear sequence of stages, where the outputs of each stage are grouped into the input of the subsequent stage.

At Chango we use Disco as our data crunching powerhouse, ploughing through our massive amounts of data with ease and allowing our data scientists to quickly prototype and isolate subsets of data to run machine learning algorithms.

I will be following this post, which is really an introduction to Disco, with some examples and some deeper concepts of running pipelines.

Futhermore, I am going to assist in easier packaging and bundling of Disco (including test data sets), maybe with an iPython Notebook, to help those who may not be entirely familiar with python and map reduce concepts the ability to explore.

The Disco community is growing, and if you are interested I highly suggest joining in and trying it out for your workloads. I will release more information about the packaging (most likely using Conda) and iPython Notebook as I get more time to develop it.