A mini-batch datastore is an implementation of a datastore with support for reading data in batches. You can use a mini-batch datastore as a source of training, validation, test, and prediction data sets for deep learning applications that use Deep Learning Toolbox™.
To preprocess sequence, time series, or text data, build your own mini-batch datastore using the framework described here. For an example showing how to use a custom mini-batch datastore, see Train Network Using Custom Mini-Batch Datastore for Sequence Data.
Build your custom datastore interface using the custom datastore classes and objects. Then, use the custom datastore to bring your data into MATLAB®.
Designing your custom mini-batch datastore involves inheriting from the matlab.io.Datastore
and matlab.io.datastore.MiniBatchable
classes, and implementing the required
properties and methods. You optionally can add support for shuffling during
training.
Processing Needs | Classes |
---|---|
Mini-batch datastore for training, validation, test, and prediction data sets in Deep Learning Toolbox | |
Mini-batch datastore with support for shuffling during training |
|
MiniBatchable
DatastoreTo implement a custom mini-batch datastore named MyDatastore
, create
a script MyDatastore.m
. The script must be on the MATLAB path and should contain code that inherits from the appropriate class and
defines the required methods. The code for creating a mini-batch datastore for training,
validation, test, and prediction data sets in Deep Learning
Toolbox must:
Inherit from the classes matlab.io.Datastore
and matlab.io.datastore.MiniBatchable
.
Define these properties: MiniBatchSize
and
NumObservations
.
In addition to these steps, you can define any other properties or methods that you need to process and analyze your data.
If you are training a network and trainingOptions
specifies 'Shuffle'
as
'once'
or 'every-epoch'
, then you must also
inherit from the matlab.io.datastore.Shuffleable
class. For more
information, see Add Support for Shuffling.
This example shows how to create a custom mini-batch datastore for processing sequence
data. Save the script in a file called MySequenceDatastore.m
.
Steps | Implementation |
---|---|
| classdef MySequenceDatastore < matlab.io.Datastore & ... matlab.io.datastore.MiniBatchable properties Datastore Labels NumClasses SequenceDimension MiniBatchSize end properties(SetAccess = protected) NumObservations end properties(Access = private) % This property is inherited from Datastore CurrentFileIndex end methods function ds = MySequenceDatastore(folder) % Construct a MySequenceDatastore object % Create a file datastore. The readSequence function is % defined following the class definition. fds = fileDatastore(folder, ... 'ReadFcn',@readSequence, ... 'IncludeSubfolders',true); ds.Datastore = fds; % Read labels from folder names numObservations = numel(fds.Files); for i = 1:numObservations file = fds.Files{i}; filepath = fileparts(file); [~,label] = fileparts(filepath); labels{i,1} = label; end ds.Labels = categorical(labels); ds.NumClasses = numel(unique(labels)); % Determine sequence dimension. When you define the LSTM % network architecture, you can use this property to % specify the input size of the sequenceInputLayer. X = preview(fds); ds.SequenceDimension = size(X,1); % Initialize datastore properties. ds.MiniBatchSize = 128; ds.NumObservations = numObservations; ds.CurrentFileIndex = 1; end function tf = hasdata(ds) % Return true if more data is available tf = ds.CurrentFileIndex + ds.MiniBatchSize - 1 ... <= ds.NumObservations; end function [data,info] = read(ds) % Read one mini-batch batch of data miniBatchSize = ds.MiniBatchSize; info = struct; for i = 1:miniBatchSize predictors{i,1} = read(ds.Datastore); responses(i,1) = ds.Labels(ds.CurrentFileIndex); ds.CurrentFileIndex = ds.CurrentFileIndex + 1; end data = preprocessData(ds,predictors,responses); end function data = preprocessData(ds,predictors,responses) % data = preprocessData(ds,predictors,responses) preprocesses % the data in predictors and responses and returns the table % data miniBatchSize = ds.MiniBatchSize; % Pad data to length of longest sequence. sequenceLengths = cellfun(@(X) size(X,2),predictors); maxSequenceLength = max(sequenceLengths); for i = 1:miniBatchSize X = predictors{i}; % Pad sequence with zeros. if size(X,2) < maxSequenceLength X(:,maxSequenceLength) = 0; end predictors{i} = X; end % Return data as a table. data = table(predictors,responses); end function reset(ds) % Reset to the start of the data reset(ds.Datastore); ds.CurrentFileIndex = 1; end end methods (Hidden = true) function frac = progress(ds) % Determine percentage of data read from datastore frac = (ds.CurrentFileIndex - 1) / ds.NumObservations; end end end % end class definition readSequence . You must create this
function to read sequence data from a
MAT-file.function data = readSequence(filename) % data = readSequence(filename) reads the sequence X from the MAT-file % filename S = load(filename); data = S.X; end |
To add support for shuffling, first follow the instructions in Implement MiniBatchable Datastore and then update your
implementation code in MySequenceDatastore.m
to:
Inherit from an additional class matlab.io.datastore.Shuffleable
.
Define the additional method shuffle
.
This example code adds shuffling support to the MySequenceDatastore
class. Vertical ellipses indicate where you should
copy code from the MySequenceDatastore
implementation.
Steps | Implementation |
---|---|
|
classdef MySequenceDatastore < matlab.io.Datastore & ... matlab.io.datastore.MiniBatchable & ... matlab.io.datastore.Shuffleable % previously defined properties . . . methods % previously defined methods . . . function dsNew = shuffle(ds) % dsNew = shuffle(ds) shuffles the files and the % corresponding labels in the datastore. % Create a copy of datastore dsNew = copy(ds); dsNew.Datastore = copy(ds.Datastore); fds = dsNew.Datastore; % Shuffle files and corresponding labels numObservations = dsNew.NumObservations; idx = randperm(numObservations); fds.Files = fds.Files(idx); dsNew.Labels = dsNew.Labels(idx); end end end |
If you have followed all the instructions presented here, then the implementation of your custom mini-batch datastore is complete. Before using this datastore, qualify it using the guidelines presented in Testing Guidelines for Custom Datastores (MATLAB).