Create a Dataset Array from a File

Note

The dataset data type is not recommended. To work with heterogeneous data, use the MATLAB^® table data type instead. See MATLAB table documentation for more information.

Create a Dataset Array from a Tab-Delimited Text File
Create a Dataset Array from a Comma-Separated Text File
Create a Dataset Array from an Excel File

Create a Dataset Array from a Tab-Delimited Text File

This example shows how to create a dataset array from the contents of a tab-delimited text file.

Create a dataset array using default settings.

Import the text file hospitalSmall.txt as a dataset array using the default settings.

ds = dataset('File',fullfile(matlabroot,'help/toolbox/stats/examples','hospitalSmall.txt'))

ds = 

    name              sex        age    wgt    smoke
    'SMITH'           'm'        38     176    1    
    'JOHNSON'         'm'        43     163    0    
    'WILLIAMS'        'f'        38     131    0    
    'JONES'           'f'        40     133    0    
    'BROWN'           'f'        49     119    0    
    'DAVIS'           'f'        46     142    0    
    'MILLER'          'f'        33     142    1    
    'WILSON'          'm'        40     180    0    
    'MOORE'           'm'        28     183    0    
    'TAYLOR'          'f'        31     132    0    
    'ANDERSON'        'f'        45     128    0    
    'THOMAS'          'f'        42     137    0    
    'JACKSON'         'm'        25     174    0    
    'WHITE'           'm'        39     202    1

By default, dataset uses the first row of the text file for variable names. If the first row does not contain variable names, you can specify the optional name-value pair argument 'ReadVarNames',false to change the default behavior.

The dataset array contains heterogeneous variables. The variables id, name, and sex are cell arrays of character vectors, and the other variables are numeric.

Summarize the dataset array.

You can see the data type and other descriptive statistics for each variable by using summary to summarize the dataset array.

summary(ds)

name: [14x1 cell array of character vectors]

sex: [14x1 cell array of character vectors]

age: [14x1 double]

    min    1st quartile    median    3rd quartile    max
    25     33              39.5      43              49 

wgt: [14x1 double]

    min    1st quartile    median    3rd quartile    max
    119    132             142       176             202

smoke: [14x1 double]

    min    1st quartile    median    3rd quartile    max
    0      0               0         0               1

Import observation names.

Import the text file again, this time specifying that the first column contains observation names.

ds = dataset('File',fullfile(matlabroot,'help/toolbox/stats/examples','hospitalSmall.txt'),'ReadObsNames',true)

ds = 

                sex        age    wgt    smoke
    SMITH       'm'        38     176    1    
    JOHNSON     'm'        43     163    0    
    WILLIAMS    'f'        38     131    0    
    JONES       'f'        40     133    0    
    BROWN       'f'        49     119    0    
    DAVIS       'f'        46     142    0    
    MILLER      'f'        33     142    1    
    WILSON      'm'        40     180    0    
    MOORE       'm'        28     183    0    
    TAYLOR      'f'        31     132    0    
    ANDERSON    'f'        45     128    0    
    THOMAS      'f'        42     137    0    
    JACKSON     'm'        25     174    0    
    WHITE       'm'        39     202    1

The elements of the first column in the text file, last names, are now observation names. Observation names and row names are dataset array properties. You can always add or change the observation names of an existing dataset array by modifying the property ObsNames.

Change dataset array properties.

By default, the DimNames property of the dataset array has name as the descriptor of the observation (row) dimension. dataset got this name from the first row of the first column in the text file.

Change the first element of DimNames to LastName.

ds.Properties.DimNames{1} = 'LastName';
ds.Properties

ans = 

       Description: ''
    VarDescription: {}
             Units: {}
          DimNames: {'LastName'  'Variables'}
          UserData: []
          ObsNames: {14x1 cell}
          VarNames: {'sex'  'age'  'wgt'  'smoke'}

Index into dataset array.

You can use observation names to index into a dataset array. For example, return the data for the patient with last name BROWN.

ds('BROWN',:)

ans = 

             sex        age    wgt    smoke
    BROWN    'f'        49     119    0

Note that observation names must be unique.

Create a Dataset Array from a Comma-Separated Text File

This example shows how to create a dataset array from the contents of a comma-separated text file.

Create a dataset array.

Import the file hospitalSmall.csv as a dataset array, specifying the comma-delimited format.

ds = dataset('File',fullfile(matlabroot,'help/toolbox/stats/examples','hospitalSmall.csv'),'Delimiter',',')

ds = 

    id               name              sex        age    wgt    smoke
    'YPL-320'        'SMITH'           'm'        38     176    1    
    'GLI-532'        'JOHNSON'         'm'        43     163    0    
    'PNI-258'        'WILLIAMS'        'f'        38     131    0    
    'MIJ-579'        'JONES'           'f'        40     133    0    
    'XLK-030'        'BROWN'           'f'        49     119    0    
    'TFP-518'        'DAVIS'           'f'        46     142    0    
    'LPD-746'        'MILLER'          'f'        33     142    1    
    'ATA-945'        'WILSON'          'm'        40     180    0    
    'VNL-702'        'MOORE'           'm'        28     183    0    
    'LQW-768'        'TAYLOR'          'f'        31     132    0    
    'QFY-472'        'ANDERSON'        'f'        45     128    0    
    'UJG-627'        'THOMAS'          'f'        42     137    0    
    'XUE-826'        'JACKSON'         'm'        25     174    0    
    'TRW-072'        'WHITE'           'm'        39     202    1

By default, dataset uses the first row in the text file as variable names.

Add observation names.

Use the unique identifiers in the variable id as observation names. Then, delete the variable id from the dataset array.

ds.Properties.ObsNames = ds.id;
ds.id = []

ds = 

               name              sex        age    wgt    smoke
    YPL-320    'SMITH'           'm'        38     176    1    
    GLI-532    'JOHNSON'         'm'        43     163    0    
    PNI-258    'WILLIAMS'        'f'        38     131    0    
    MIJ-579    'JONES'           'f'        40     133    0    
    XLK-030    'BROWN'           'f'        49     119    0    
    TFP-518    'DAVIS'           'f'        46     142    0    
    LPD-746    'MILLER'          'f'        33     142    1    
    ATA-945    'WILSON'          'm'        40     180    0    
    VNL-702    'MOORE'           'm'        28     183    0    
    LQW-768    'TAYLOR'          'f'        31     132    0    
    QFY-472    'ANDERSON'        'f'        45     128    0    
    UJG-627    'THOMAS'          'f'        42     137    0    
    XUE-826    'JACKSON'         'm'        25     174    0    
    TRW-072    'WHITE'           'm'        39     202    1

Delete observations.

Delete any patients with the last name BROWN. You can use strcmp to match 'BROWN' with the elements of the variable containing last names, name.

toDelete = strcmp(ds.name,'BROWN');
ds(toDelete,:) = []

ds = 

               name              sex        age    wgt    smoke
    YPL-320    'SMITH'           'm'        38     176    1    
    GLI-532    'JOHNSON'         'm'        43     163    0    
    PNI-258    'WILLIAMS'        'f'        38     131    0    
    MIJ-579    'JONES'           'f'        40     133    0    
    TFP-518    'DAVIS'           'f'        46     142    0    
    LPD-746    'MILLER'          'f'        33     142    1    
    ATA-945    'WILSON'          'm'        40     180    0    
    VNL-702    'MOORE'           'm'        28     183    0    
    LQW-768    'TAYLOR'          'f'        31     132    0    
    QFY-472    'ANDERSON'        'f'        45     128    0    
    UJG-627    'THOMAS'          'f'        42     137    0    
    XUE-826    'JACKSON'         'm'        25     174    0    
    TRW-072    'WHITE'           'm'        39     202    1

One patient having last name BROWN is deleted from the dataset array.

Return size of dataset array.

The array now has 13 observations.

size(ds)

ans =

    13     5

Note that the row and column corresponding to variable and observation names, respectively, are not included in the size of a dataset array.

Create a Dataset Array from an Excel File

This example shows how to create a dataset array from the contents of an Excel^® spreadsheet file.

Create a dataset array.

Import the data from the first worksheet in the file hospitalSmall.xlsx, specifying that the data file is an Excel spreadsheet.

ds = dataset('XLSFile',fullfile(matlabroot,'help/toolbox/stats/examples','hospitalSmall.xlsx'))

ds = 

    id               name              sex        age    wgt    smoke
    'YPL-320'        'SMITH'           'm'        38     176    1    
    'GLI-532'        'JOHNSON'         'm'        43     163    0    
    'PNI-258'        'WILLIAMS'        'f'        38     131    0    
    'MIJ-579'        'JONES'           'f'        40     133    0    
    'XLK-030'        'BROWN'           'f'        49     119    0    
    'TFP-518'        'DAVIS'           'f'        46     142    0    
    'LPD-746'        'MILLER'          'f'        33     142    1    
    'ATA-945'        'WILSON'          'm'        40     180    0    
    'VNL-702'        'MOORE'           'm'        28     183    0    
    'LQW-768'        'TAYLOR'          'f'        31     132    0    
    'QFY-472'        'ANDERSON'        'f'        45     128    0    
    'UJG-627'        'THOMAS'          'f'        42     137    0    
    'XUE-826'        'JACKSON'         'm'        25     174    0    
    'TRW-072'        'WHITE'           'm'        39     202    1

By default, dataset creates variable names using the contents of the first row in the spreadsheet.

Specify which worksheet to import.

Import the data from the second worksheet into a new dataset array.

ds2 = dataset('XLSFile',fullfile(matlabroot,'help/toolbox/stats/examples','hospitalSmall.xlsx'),'Sheet',2)

ds2 = 

    id               name              sex        age    wgt    smoke
    'TRW-072'        'WHITE'           'm'        39     202    1    
    'ELG-976'        'HARRIS'          'f'        36     129    0    
    'KOQ-996'        'MARTIN'          'm'        48     181    1    
    'YUZ-646'        'THOMPSON'        'm'        32     191    1    
    'XBR-291'        'GARCIA'          'f'        27     131    1    
    'KPW-846'        'MARTINEZ'        'm'        37     179    0    
    'XBA-581'        'ROBINSON'        'm'        50     172    0    
    'BKD-785'        'CLARK'           'f'        48     133    0

Documentation