splitapply

Split data into groups and apply function

Description

example

Y = splitapply(func,X,G) splits X into groups specified by G and applies the function func to each group. splitapply returns Y as an array that contains the concatenated outputs from func for the groups split out of X. The input argument G is a vector of positive integers that specifies the groups to which corresponding elements of X belong. If G contains NaN values, splitapply omits the corresponding values in X when it splits X into groups. To create G, you can use the findgroups function.

splitapply combines two steps in the Split-Apply-Combine Workflow.

example

Y = splitapply(func,X1,...,XN,G) splits X1,...,XN into groups and applies func. The splitapply function calls func once per group, with corresponding elements from X1,...,XN as the N input arguments to func.

example

Y = splitapply(func,T,G) splits variables of table T into groups and applies func. The splitapply function treats the variables of T as vectors, matrices, or cell arrays, depending on the data types of the table variables. If T has N variables, then func must accept N input arguments.

example

[Y1,...,YM] = splitapply(___) splits variables into groups and applies func to each group. func returns multiple output arguments. Y1,...,YM contains the concatenated outputs from func for the groups split out of the input data variables. func can return output arguments that belong to different classes, but the class of each output must be the same each time func is called. You can use this syntax with any of the input arguments of the previous syntaxes.

The number of output arguments from func need not be the same as the number of input arguments specified by X1,...,XN.

Examples

collapse all

Calculate the mean heights by gender for groups of patients and display the results.

Load patient heights and genders from the data file patients.mat.

load patients
whos Gender Height
  Name          Size            Bytes  Class     Attributes

  Gender      100x1             11412  cell                
  Height      100x1               800  double              

Specify groups by gender with findgroups.

G = findgroups(Gender);

Split Height into groups specified by G. Calculate the mean height by gender. The first row of the output argument is the mean height of the female patients, and the second row is the mean height of the male patients.

splitapply(@mean,Height,G)
ans = 2×1

   65.1509
   69.2340

Calculate the variances of the differences in blood pressure readings for groups of patients, and display the results. The blood pressure readings are contained in two data variables. To calculate the differences, use a function that takes two input arguments.

Load blood pressure readings and smoking data for 100 patients from the data file patients.mat.

load patients
whos Systolic Diastolic Smoker
  Name             Size            Bytes  Class      Attributes

  Diastolic      100x1               800  double               
  Smoker         100x1               100  logical              
  Systolic       100x1               800  double               

Define func as a function that calculates the variances of the differences between systolic and diastolic blood-pressure readings for smokers and nonsmokers. func requires two input arguments.

func = @(x,y) var(x-y);

Use findgroups and splitapply to split the patient data into groups and calculate the variances of the differences. findgroups also returns group identifiers in smokers. The splitapply function calls func once per group, with Systolic and Diastolic as the two input arguments.

[G,smokers] = findgroups(Smoker);
varBP = splitapply(func,Systolic,Diastolic,G)
varBP = 2×1

   44.4459
   48.6783

Create a table that contains the variances of the differences, with the number of patients in each group.

numPatients = splitapply(@numel,Smoker,G);
T = table(smokers,numPatients,varBP)
T=2×3 table
    smokers    numPatients    varBP 
    _______    ___________    ______

     false         66         44.446
     true          34         48.678

Calculate the minimum, median, and maximum weights for groups of patients and return these results as arrays for each group. splitapply concatenates the output arguments so that you can distinguish output for each group from output for the other groups.

Define a function that returns the minimum, median, and maximum as a row vector.

mystats = @(x)[min(x) median(x) max(x)];

Load patient weights, genders, and status as smokers from patients.mat.

load patients
whos Weight Gender Smoker
  Name          Size            Bytes  Class      Attributes

  Gender      100x1             11412  cell                 
  Smoker      100x1               100  logical              
  Weight      100x1               800  double               

Use findgroups and splitapply to split the patient weights into groups and calculate statistics for each group.

G = findgroups(Gender,Smoker);
Y = splitapply(mystats,Weight,G)
Y = 4×3

  111.0000  131.0000  147.0000
  115.0000  131.0000  146.0000
  158.0000  181.5000  194.0000
  164.0000  181.0000  202.0000

In this example, you can return nonscalar output as row vectors because the data and grouping variables are column vectors. Each row of Y contains statistics for a different group of patients.

Calculate the mean body-mass-index (BMI) from tables of patient data. Group the patients by gender and status as smokers or nonsmokers.

Load patient data and grouping variables into tables.

load patients
DT = table(Height,Weight);
GT = table(Gender,Smoker);

Define a function that calculates mean BMI from the weights and heights of groups or patients.

meanBMIFcn = @(h,w)mean((w ./ (h.^2)) * 703);

Create a table that contains the mean BMI for each group.

[G,results] = findgroups(GT);
meanBMI = splitapply(meanBMIFcn,DT,G);
results.meanBMI = meanBMI
results=4×3 table
      Gender      Smoker    meanBMI
    __________    ______    _______

    {'Female'}    false     21.672 
    {'Female'}    true      21.669 
    {'Male'  }    false     26.578 
    {'Male'  }    true      26.458 

Calculate the minimum, mean, and maximum heights for groups of patients and return results in a table.

Define a function in a file named multiStats.m that accepts an input vector and returns the minimum, mean, and maximum values of the vector.

% Copyright 2015 The MathWorks, Inc.

function [lo,avg,hi] = multiStats(x)
lo = min(x);
avg = mean(x);
hi = max(x);
end

Load patient data into a table.

load patients
T = table(Gender,Height);
summary(T)
Variables:

    Gender: 100x1 cell array of character vectors

    Height: 100x1 double

        Values:

            Min          60   
            Median       67   
            Max          72   

Group patient heights by gender. Create a table that contains the outputs from multiStats for each group.

[G,gender] = findgroups(T.Gender);
[minHeight,meanHeight,maxHeight] = splitapply(@multiStats,T.Height,G);
result = table(gender,minHeight,meanHeight,maxHeight)
result =

  2x4 table

      gender      minHeight    meanHeight    maxHeight
    __________    _________    __________    _________

    {'Female'}       60          65.151         70    
    {'Male'  }       66          69.234         72    

Input Arguments

collapse all

Function to apply to groups of data, specified as a function handle.

If func returns a nonscalar output argument, then the argument must be oriented so that splitapply can concatenate the output arguments from successive calls to func. For example, if the input data variables are column vectors, then func must return either a scalar or a row vector as an output argument.

Example: Y = splitapply(@sum,X,G) returns the sums of the groups of data in X.

Data variable, specified as a vector, matrix, or cell array. The elements of X belong to groups specified by the corresponding elements of G.

If X is a matrix, splitapply treats each column or row as a separate data variable. The orientation of G determines whether splitapply treats the columns or rows of X as data variables.

Group numbers, specified as a vector of positive integers.

  • If X is a vector or cell array, then G must be the same length as X.

  • If X is a matrix, then the length of G must be equal to the number of columns or rows of X, depending on the orientation of G.

  • If the input argument is table T, then G must be a column vector. The length of G must be equal to the number of rows of T.

Data variables, specified as a table. splitapply treats each table variable as a separate data variable.

More About

collapse all

Split-Apply-Combine Workflow

The Split-Apply-Combine workflow is common in data analysis. In this workflow, the analyst splits the data into groups, applies a function to each group, and combines the results. The diagram shows a typical example of the workflow and the parts of the workflow implemented by findgroups and splitapply.

Extended Capabilities

Introduced in R2015b