News Archive
PhpRiot Newsletter
Your Email Address:

More information

Big Data

Note: This article was originally published at Planet PHP on 20 December 2010.
Planet PHP

Big data, data science, analytics. These are some of the hottest buzzwords in tech right now. Five years ago, the boasting rights went to the geek with the largest number of users: these days he with the biggest data wins.

There are a number of approaches to dealing with vast quantities of data, but one of the best known is Apache Hadoop. Hadoop is a toolkit for managing large data sets, based originally on the Google whitepapers about MapReduce and the Google File System. For Socorro, the Mozilla crash reporting system, we use HBase, a non-relational (NoSQL) database built on the Hadoop ecosystem.

The Hadoop world is largely a Java world, since all the tools are written in Java. However, if you feel the same way about Java as Sean Coates, you should not lose hope. You, too, can use PHP to work with Hadoop.

Let's start by understanding MapReduce. This is a framework for distributed processing of large datasets.

A MapReduce job consists of two pieces of code:

A Mapper The job of the Mapper is to map input key-value pairs to output key-value pairs. A Reducer The Reducer receives and collates results from Mappers.

More parts are needed to make this work:

  • An Input reader generates splits of data for each Mapper to work through.
  • A Partition function takes the output of Mappers and chooses a destination Reducer.
  • An Output writer takes the output of the Reducers and writes it to the Hadoop Distributed File System (HDFS).

In summary, the Mapper and Reducer are the core functionality of a MapReduce job. Now, let's get set up to write a Mapper and Reducer against Hadoop with PHP.

Setting up Hadoop is a non-trivial task; luckily, a number of VMs are available to help. For this example, I am using the Training VM from Cloudera. (You'll need VMWare Player for Windows or Linux, or VMWare Fusion for OS X to run this VM.)

Once you've started the VM, open up a terminal window. (This VM is Ubuntu based.)

The VM you have just installed comes with a sample data set of the complete works of Shakespeare. You'll need to put these files into HDFS so that we can work with them. Run the following commands to put the files into HDFS:

cd ~/git/data tar vzxf shakespeare.tar.gz hadoop fs -put input /user/training/input

You can confirm this worked by viewing the files in the input directory on HDFS: hadoop fs -ls /user/training

Next, we need to create the mapper and reducer. To demonstrate these, we'll reproduce what is often referred to as the canonical MapReduce example: word count.

You can find the Java version of this code in the Cloudera Hadoop Tutorial.

As you can see (if you know Java), the mapper reads words from input, and for each word it encounters, emits to standard output the word and the value 1 to indicate that the word has been encountered. The reducer takes output from mappers and aggregates it to produce a set of words and counts.

The easiest way to communicate from PHP to Hadoop and back again is using the Hadoop Streaming API. This expects mappers and reducers to use standard input and output as a pipe for communication.

This is how we write the word count mapper in PHP, which we'll name mapper.php:


We open standard input for reading a line at a time, split that line into an array along word boundaries using a regular expressiob, and emit output as the word encountered followed by a 1. (I delimited this with tabs, but you may use whatever you like.)

Now, here's the reducer (reducer.php):

#!/usr/bin/php $count) { echo("$word $count\n"); }

Again, we read a line at a time from standard input, and summarize the r

Truncated by Planet PHP, read more at the original (another 1806 bytes)