Cluster Data with a Self-Organizing Map

Clustering data is another excellent application for neural networks. This process involves grouping data by similarity. For example, you might perform:

  • Market segmentation by grouping people according to their buying patterns

  • Data mining by partitioning data into related subsets

  • Bioinformatic analysis by grouping genes with related expression patterns

Suppose that you want to cluster flower types according to petal length, petal width, sepal length, and sepal width. You have 150 example cases for which you have these four measurements.

As with function fitting and pattern recognition, there are two ways to solve this problem:

Defining a Problem

To define a clustering problem, simply arrange Q input vectors to be clustered as columns in an input matrix (see “Data Structures” for a detailed description of data formatting for static and time series data). For instance, you might want to cluster this set of 10 two-element vectors:

inputs = [7 0 6 2 6 5 6 1 0 1; 6 2 5 0 7 5 5 1 2 2]

The next section shows how to train a network using the nctool GUI.

Using the Neural Network Clustering App

  1. If needed, open the Neural Network Start GUI with this command:

    nnstart
    

  2. Click Clustering app to open the Neural Network Clustering App. (You can also use the command nctool.)

  3. Click Next. The Select Data window appears.

  4. Click Load Example Data Set. The Clustering Data Set Chooser window appears.

  5. In this window, select Simple Clusters, and click Import. You return to the Select Data window.

  6. Click Next to continue to the Network Size window, shown in the following figure.

    For clustering problems, the self-organizing feature map (SOM) is the most commonly used network, because after the network has been trained, there are many visualization tools that can be used to analyze the resulting clusters. This network has one layer, with neurons organized in a grid. (For more information on the SOM, see “Self-Organizing Feature Maps”.) When creating the network, you specify the numbers of rows and columns in the grid. Here, the number of rows and columns is set to 10. The total number of neurons is 100. You can change this number in another run if you want.

  7. Click Next. The Train Network window appears.

  8. Click Train.

    The training runs for the maximum number of epochs, which is 200.

  9. For SOM training, the weight vector associated with each neuron moves to become the center of a cluster of input vectors. In addition, neurons that are adjacent to each other in the topology should also move close to each other in the input space, therefore it is possible to visualize a high-dimensional inputs space in the two dimensions of the network topology. Investigate some of the visualization tools for the SOM. Under the Plots pane, click SOM Sample Hits.

    The default topology of the SOM is hexagonal. This figure shows the neuron locations in the topology, and indicates how many of the training data are associated with each of the neurons (cluster centers). The topology is a 10-by-10 grid, so there are 100 neurons. The maximum number of hits associated with any neuron is 31. Thus, there are 31 input vectors in that cluster.

  10. You can also visualize the SOM by displaying weight planes (also referred to as component planes). Click SOM Weight Planes in the Neural Network Clustering App.

    This figure shows a weight plane for each element of the input vector (two, in this case). They are visualizations of the weights that connect each input to each of the neurons. (Darker colors represent larger weights.) If the connection patterns of two inputs were very similar, you can assume that the inputs are highly correlated. In this case, input 1 has connections that are very different than those of input 2.

  11. In the Neural Network Clustering App, click Next to evaluate the network.

    At this point you can test the network against new data.

    If you are dissatisfied with the network's performance on the original or new data, you can increase the number of neurons, or perhaps get a larger training data set.

  12. When you are satisfied with the network performance, click Next.

  13. Use this panel to generate a MATLAB function or Simulink diagram for simulating your neural network. You can use the generated code or diagram to better understand how your neural network computes outputs from inputs or deploy the network with MATLAB Compiler tools and other MATLAB and Simulink code generation tools.

  14. Use the buttons on this screen to save your results.

    • You can click Simple Script or Advanced Script to create MATLAB® code that can be used to reproduce all of the previous steps from the command line. Creating MATLAB code can be helpful if you want to learn how to use the command-line functionality of the toolbox to customize the training process. In Using Command-Line Functions, you will investigate the generated scripts in more detail.

    • You can also save the network as net in the workspace. You can perform additional tests on it or put it to work on new inputs.

  15. When you have generated scripts and saved your results, click Finish.

Using Command-Line Functions

The easiest way to learn how to use the command-line functionality of the toolbox is to generate scripts from the GUIs, and then modify them to customize the network training. As an example, look at the simple script that was created in step 14 of the previous section.

% Solve a Clustering Problem with a Self-Organizing Map
% Script generated by NCTOOL
%
% This script assumes these variables are defined:
%
%   simpleclusterInputs - input data.

inputs = simpleclusterInputs;

% Create a Self-Organizing Map
dimension1 = 10;
dimension2 = 10;
net = selforgmap([dimension1 dimension2]);

% Train the Network
[net,tr] = train(net,inputs);

% Test the Network
outputs = net(inputs);

% View the Network
view(net)

% Plots
% Uncomment these lines to enable various plots.
% figure, plotsomtop(net)
% figure, plotsomnc(net)
% figure, plotsomnd(net)
% figure, plotsomplanes(net)
% figure, plotsomhits(net,inputs)
% figure, plotsompos(net,inputs)

You can save the script, and then run it from the command line to reproduce the results of the previous GUI session. You can also edit the script to customize the training process. In this case, let's follow each of the steps in the script.

  1. The script assumes that the input vectors are already loaded into the workspace. To show the command-line operations, you can use a different data set than you used for the GUI operation. Use the flower data set as an example. The iris data set consists of 150 four-element input vectors.

    load iris_dataset
    inputs = irisInputs;
    
  2. Create a network. For this example, you use a self-organizing map (SOM). This network has one layer, with the neurons organized in a grid. (For more information, see “Self-Organizing Feature Maps”.) When creating the network with selforgmap, you specify the number of rows and columns in the grid:

    dimension1 = 10;
    dimension2 = 10;
    net = selforgmap([dimension1 dimension2]);
    
  3. Train the network. The SOM network uses the default batch SOM algorithm for training.

    [net,tr] = train(net,inputs);
    
  4. During training, the training window opens and displays the training progress. To interrupt training at any point, click Stop Training.

  5. Test the network. After the network has been trained, you can use it to compute the network outputs.

    outputs = net(inputs);
    
  6. View the network diagram.

    view(net)
    

  7. For SOM training, the weight vector associated with each neuron moves to become the center of a cluster of input vectors. In addition, neurons that are adjacent to each other in the topology should also move close to each other in the input space, therefore it is possible to visualize a high-dimensional inputs space in the two dimensions of the network topology. The default SOM topology is hexagonal; to view it, enter the following commands.

    figure, plotsomtop(net)
    

    In this figure, each of the hexagons represents a neuron. The grid is 10-by-10, so there are a total of 100 neurons in this network. There are four elements in each input vector, so the input space is four-dimensional. The weight vectors (cluster centers) fall within this space.

    Because this SOM has a two-dimensional topology, you can visualize in two dimensions the relationships among the four-dimensional cluster centers. One visualization tool for the SOM is the weight distance matrix (also called the U-matrix).

  8. To view the U-matrix, click SOM Neighbor Distances in the training window.

    In this figure, the blue hexagons represent the neurons. The red lines connect neighboring neurons. The colors in the regions containing the red lines indicate the distances between neurons. The darker colors represent larger distances, and the lighter colors represent smaller distances. A band of dark segments crosses from the lower-center region to the upper-right region. The SOM network appears to have clustered the flowers into two distinct groups.

To get more experience in command-line operations, try some of these tasks:

Also, see the advanced script for more options, when training from the command line.