Getting Started with MapReduce

As the number and type of data acquisition devices grows annually, the sheer size and rate of data being collected is rapidly expanding. These big data sets can contain gigabytes or terabytes of data, and can grow on the order of megabytes or gigabytes per day. While the collection of this information presents opportunities for insight, it also presents many challenges. Most algorithms are not designed to process big data sets in a reasonable amount of time or with a reasonable amount of memory. MapReduce allows you to meet many of these challenges to gain important insights from large data sets.

What Is MapReduce?
MapReduce Algorithm Phases
Example MapReduce Calculation

What Is MapReduce?

MapReduce is a programming technique for analyzing data sets that do not fit in memory. You may be familiar with Hadoop^® MapReduce, which is a popular implementation that works with the Hadoop Distributed File System (HDFS™). MATLAB^® provides a slightly different implementation of the MapReduce technique with the mapreduce function.

mapreduce uses a datastore to process data in small blocks that individually fit into memory. Each block goes through a Map phase, which formats the data to be processed. Then the intermediate data blocks go through a Reduce phase, which aggregates the intermediate results to produce a final result. The Map and Reduce phases are encoded by map and reduce functions, which are primary inputs to mapreduce. There are endless combinations of map and reduce functions to process data, so this technique is both flexible and extremely powerful for tackling large data processing tasks.

mapreduce lends itself to being extended to run in several environments. For more information about these capabilities, see Speed Up and Deploy MapReduce Using Other Products.

The utility of the mapreduce function lies in its ability to perform calculations on large collections of data. Thus, mapreduce is not well-suited for performing calculations on normal sized data sets which can be loaded directly into computer memory and analyzed with traditional techniques. Instead, use mapreduce to perform a statistical or analytical calculation on a data set that does not fit in memory.

Each call to the map or reduce function by mapreduce is independent of all others. For example, a call to the map function cannot depend on inputs or results from a previous call to the map function. It is best to break up such calculations into multiple calls to mapreduce.

MapReduce Algorithm Phases

mapreduce moves each block of data in the input datastore through several phases before reaching the final output. The following figure outlines the phases of the algorithm for mapreduce.

The algorithm has the following steps:

mapreduce reads a block of data from the input datastore using [data,info] = read(ds), and then calls the map function to work on that block.
The map function receives the block of data, organizes it or performs a precursory calculation, and then uses the add and addmulti functions to add key-value pairs to an intermediate data storage object called a KeyValueStore. The number of calls to the map function by mapreduce is equal to the number of blocks in the input datastore.
After the map function works on all of the blocks of data in the datastore, mapreduce groups all of the values in the intermediate KeyValueStore object by unique key.
Next, mapreduce calls the reduce function once for each unique key added by the map function. Each unique key can have many associated values. mapreduce passes the values to the reduce function as a ValueIterator object, which is an object used to iterate over the values. The ValueIterator object for each unique key contains all the associated values for that key.
The reduce function uses the hasnext and getnext functions to iterate through the values in the ValueIterator object one at a time. Then, after aggregating the intermediate results from the map function, the reduce function adds final key-value pairs to the output using the add and addmulti functions. The order of the keys in the output is the same as the order in which the reduce function adds them to the final KeyValueStore object. That is, mapreduce does not explicitly sort the output.
Note
The reduce function writes the final key-value pairs to a final KeyValueStore object. From this object, mapreduce pulls the key-value pairs into the output datastore, which is a KeyValueDatastore object by default.

Example MapReduce Calculation

This example uses a simple calculation (the mean travel distance in a set of flight data) to illustrate the steps needed to run mapreduce.

Prepare Data

The first step to using mapreduce is to construct a datastore for the data set. Along with the map and reduce functions, the datastore for a data set is a required input to mapreduce, since it allows mapreduce to process the data in blocks.

mapreduce works with most types of datastores. For example, create a TabularTextDatastore object for the airlinesmall.csv data set.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA')

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             ' ...\matlab\toolbox\matlab\demos\airlinesmall.csv'
                             }
                    Folders: {
                             ' ...\matlab\toolbox\matlab\demos'
                             }
               FileEncoding: 'UTF-8'
   AlternateFileSystemRoots: {}
      PreserveVariableNames: false
          ReadVariableNames: true
              VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
             DatetimeLocale: en_US

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: 'NA'
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%f', '%f', '%f' ... and 26 more}
                   TextType: 'char'
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
            SelectedFormats: {'%f', '%f', '%f' ... and 26 more}
                   ReadSize: 20000 rows
                 OutputType: 'table'
                   RowTimes: []

  Write-specific Properties:
     SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"]
        DefaultOutputFormat: "txt"

Several of the previously described options are useful in the context of mapreduce. The mapreduce function executes read on the datastore to retrieve data to pass to the map function. Therefore, you can use the SelectedVariableNames, SelectedFormats, and ReadSize options to directly configure the block size and type of data that mapreduce passes to the map function.

For example, to select the Distance (total flight distance) variable as the only variable of interest, specify SelectedVariableNames.

ds.SelectedVariableNames = 'Distance';

Now, whenever the read, readall, or preview functions act on ds, they will return only information for the Distance variable. To confirm this, you can preview the first few rows of data in the datastore. This allows you to examine the format of the data that the mapreduce function will pass to the map function.

preview(ds)

To view the exact data that mapreduce will pass to the map function, use read.

For additional information and a complete summary of the available options, see Datastore.

Write Map and Reduce Functions

The mapreduce function automatically calls the map and reduce functions during execution, so these functions must meet certain requirements to run properly.

The inputs to the map function are data, info, and intermKVStore:
- data and info are the result of a call to the read function on the input datastore, which mapreduce executes automatically before each call to the map function.
- intermKVStore is the name of the intermediate KeyValueStore object to which the map function needs to add key-value pairs. The add and addmulti functions use this object name to add key-value pairs. If none of the calls to the map function add key-value pairs to intermKVStore, then mapreduce does not call the reduce function and the resulting datastore is empty.
A simple example of a map function is:
```
function MeanDistMapFun(data, info, intermKVStore)
    distances = data.Distance(~isnan(data.Distance));
    sumLenValue = [sum(distances)  length(distances)];
    add(intermKVStore, 'sumAndLength', sumLenValue);
end
```
This map function has only three lines, which perform some straightforward roles. The first line filters out all NaN values in the block of distance data. The second line creates a two-element vector with the total distance and count for the block, and the third line adds that vector of values to intermKVStore with the key, 'sumAndLength'. After this map function runs on all of the blocks of data in ds, the intermKVStore object contains the total distance and count for each block of distance data.
Save this function in your current folder as MeanDistMapFun.m.
The inputs to the reduce function are intermKey, intermValIter, and outKVStore:
- intermKey is for the active key added by the map function. Each call to the reduce function by mapreduce specifies a new unique key from the keys in the intermediate KeyValueStore object.
- intermValIter is the ValueIterator associated with the active key, intermKey. This ValueIterator object contains all of the values associated with the active key. Scroll through the values using the hasnext and getnext functions.
- outKVStore is the name for the final KeyValueStore object to which the reduce function needs to add key-value pairs. mapreduce takes the output key-value pairs from outKVStore and returns them in the output datastore, which is a KeyValueDatastore object by default. If none of the calls to the reduce function add key-value pairs to outKVStore, then mapreduce returns an empty datastore.
A simple example of a reduce function is:
```
function MeanDistReduceFun(intermKey, intermValIter, outKVStore)
    sumLen = [0 0];
    while hasnext(intermValIter)
        sumLen = sumLen + getnext(intermValIter);
    end
    add(outKVStore, 'Mean', sumLen(1)/sumLen(2));
end
```
This reduce function loops through each of the distance and count values in intermValIter, keeping a running total of the distance and count after each pass. After this loop, the reduce function calculates the overall mean flight distance with a simple division, and then adds a single key to outKVStore.
Save this function in your current folder as MeanDistReduceFun.m.

For information about writing more advanced map and reduce functions, see Write a Map Function and Write a Reduce Function.

Run `mapreduce`

After you have a datastore, a map function, and a reduce function, you can call mapreduce to perform the calculation. To calculate the average flight distance in the data set, call mapreduce using ds, MeanDistMapFun, and MeanDistReduceFun.

outds = mapreduce(ds, @MeanDistMapFun, @MeanDistReduceFun);

********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

By default, the mapreduce function displays progress information at the command line and returns a KeyValueDatastore object that points to files in the current folder. You can adjust all three of these options using the Name,Value pair arguments for 'OutputFolder', 'OutputType', and 'Display'. For more information, see the reference page for mapreduce.

View Results

Use the readall function to read the key-value pairs from the output datastore.

readall(outds)

ans =

  1×2 table

      Key          Value    
    ________    ____________

    {'Mean'}    {[702.1630]}

Documentation