Datastore with custom file reader
Use a FileDatastore
object to manage large collections of
custom format files where the collection does not necessarily fit in memory or when a
large custom format file does not fit in memory. You can create a
FileDatastore
object using the fileDatastore
function, specify its properties, and then import and process the data using object
functions.
specifies additional parameters and properties for fds
= fileDatastore(location
,'ReadFcn',@fcn
,Name,Value
)fds
using one or more name-value
pair arguments. For example, you can specify which files to include in the
datastore depending on their extensions with
fileDatastore(location,'ReadFcn',@customreader,'FileExtensions',{'.exts','.extx'})
.
location
— Files or folders to include in datastoreDsFileSet
objectFiles or folders included in the datastore, specified as a path
or a DsFileSet
object.
path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.
Local files or folders — Specify location
as a
local path to files or folders. If the files are not in the current
folder, then local path must specify full or relative paths. Files
within subfolders of the specified folder are not automatically included
in the datastore. You can use the wildcard character (*) when specifying
the local path. This character specifies that the datastore include all
matching files or all files in the matching folders.
Remote files or folders — Specify location
to be
the full paths of the files or folders as a uniform resource locator
(URL) of the form hdfs:///path_to_file
. For more
information, see Work with Remote Data.
DsFileSet
object — You also can specify
location
as a DsFileSet
object. For more
information, see matlab.io.datastore.DsFileSet
.
When location
represents a folder, the datastore includes only
supported file formats and ignores any other format. To specify a custom list of file extensions
to include in your datastore, see the FileExtensions
property.
Example: 'file1.ext'
Example: '../dir/data/file1.ext'
Example: {'C:\dir\data\file1.exts','C:\dir\data\file2.extx'}
Example: 'C:\dir\data\*.ext'
@fcn
— Function that reads file dataFunction that reads the file data, specified as a function handle.
The signature of the function represented by the function handle
@fcn
depends on the value of the specified
ReadMode
. The function that reads the file data
must confirm to one of these signatures.
|
|
---|---|
'file' (default) | The function must have this signature: function data = MyReadFcn(filename) ... end
|
'partialfile' | The function must have this signature: function [data,userdata,done] = MyReadFcn(filename,userdata) ... end
|
'byte' | The function must have this signature: function data = MyReadFcn(filename,offset,size) ... end
The
|
The value specified in @fcn
, sets the value of the
ReadFcn property.
Example: @customreader
Data Types: function_handle
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
fds =
fileDatastore('C:\dir\data','FileExtensions',{'.exts','.extx'})
'IncludeSubfolders'
— Subfolder inclusion flagtrue
or false
| 0 or 1Subfolder inclusion flag, specified as the comma-separated pair
consisting of 'IncludeSubfolders'
and
true
, false
, 0, or 1.
Specify true
to include all files and subfolders
within each folder or false
to include only the
files within each folder.
If you do not specify 'IncludeSubfolders'
, then
the default value is false
.
Example: 'IncludeSubfolders',true
Data Types: logical
| double
'FileExtensions'
— Custom format file extensionsCustom format file extensions, specified as the comma-separated
pair consisting of 'FileExtensions'
and a
character vector, cell array of character vectors, string scalar, or
string array.
When you specify a file extension, the
fileDatastore
function creates a datastore
object only for files with the specified extension. You also can
create a datastore for files without any extensions by specifying
'FileExtensions'
as an empty character
vector, ''
. If you do not specify
'FileExtensions'
, then
fileDatastore
automatically includes all
files within a folder.
Example: 'FileExtensions',''
Example: 'FileExtensions','.ext'
Example: 'FileExtensions',{'.exts','.extx'}
Data Types: char
| cell
| string
'PreviewFcn'
— Function to preview input data@ReadFcn
(default) | function handleFunction to preview the input data, specified as a function handle.
If you do not specify a preview function,
FileDatastore
uses the value specified in
@ReadFcn
as the default preview function.
Alternatively, you can specify your own custom preview function for
your data.
@ReadFcn
(default) — Use
ReadFcn
to sample
FileDatastore
data. This option
can lead to slower performance for
tall
construction.
Function handle
— Use your custom
preview function for FileDatastore
and tall
construction to sample the
input data. Use PreviewFcn
to provide
a function that reads only the minimum needed part of
input data for preview and tall construction.
The function specified by PreviewFcn
must
return values with the same data types that the
ReadFcn
returns.
Data Types: function_handle
'ReadMode'
— Portion of the file to read'file'
(default) | 'partialfile'
| 'bytes'
Portion of the file to read, specified as
'file'
, 'partialfile'
, or
'bytes'
.
'file' (default) | Use read mode Based on
your custom read function, the file datastore
reads the complete file with each call to
|
'partialfile'
| Use read mode
Based on your custom read
function, the file datastore reads only a portion
of the file with every call to the
In
the |
'bytes' | Use read mode
Based on your
custom read function, the file datastore reads
|
To use the subset
and
shuffle
functions on a
FileDatastore
object, you must set
'ReadMode'
to
'file'
.
Data Types: char
| string
'BlockSize'
— Number of bytes to readpositive integer
Number of bytes to read with every read
operation, specified as a positive integer.
To ensure that you can distribute multiple blocks of a file across
multiple parallel MATLAB® workers, specify BlockSize
as a
positive integer greater than 131072
bytes
(128
kilobytes).
To specify or to change the value of BlockSize
,
you must first set ReadMode
to
'bytes'
. FileDatastore
sets the default value of BlockSize
based on the
value specified in ReadMode
.
If ReadMode
is
'file'
or
'partialfile'
, then
FileDatastore
sets the default
BlockSize
to
inf
.
If ReadMode
is
'bytes'
, then
FileDatastore
sets the default
BlockSize
to
128
megabytes.
'AlternateFileSystemRoots'
— Alternate file system root pathsAlternate file system root paths, specified as the comma-separated pair consisting of
'AlternateFileSystemRoots'
and a string vector or a cell array. Use
'AlternateFileSystemRoots'
when you create a datastore on a local
machine, but need to access and process the data on another machine (possibly of a different
operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB
Parallel Server™, and the data is stored on your local machines with a copy of the data available
on different platform cloud or cluster machines, you must use
'AlternateFileSystemRoots'
to associate the root paths.
To associate a set of root paths that are equivalent to one another, specify
'AlternateFileSystemRoots'
as a string vector. For
example,
["Z:\datasets","/mynetwork/datasets"]
To associate multiple sets of root paths that are equivalent for the datastore,
specify 'AlternateFileSystemRoots'
as a cell array containing
multiple rows where each row represents a set of equivalent root paths. Specify each row
in the cell array as either a string vector or a cell array of character vectors. For example:
Specify 'AlternateFileSystemRoots'
as a cell array of
string
vectors.
{["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}
Alternatively, specify 'AlternateFileSystemRoots'
as a cell
array of cell array of character
vectors.
{{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}
The value of 'AlternateFileSystemRoots'
must satisfy these conditions:
Contains one or more rows, where each row specifies a set of equivalent root paths.
Each row specifies multiple root paths and each root path must contain at least two characters.
Root paths are unique and are not subfolders of one another.
Contains at least one root path entry that points to the location of the files.
For more information, see Set Up Datastore for Processing on Different Machines or Clusters.
Example: ["Z:\datasets","/mynetwork/datasets"]
Data Types: string
| cell
FileDatastore
properties describe the files associated with a
FileDatastore
object. Except for the Files
property, you can specify the value of FileDatastore
properties using
name-value pair arguments. To view or modify a property after creating the object, use
the dot notation.
Files
— Files included in datastoreFiles included in the datastore, resolved as a character vector, cell
array of character vectors, string scalar, or string array, where each
character vector or string is a full path to a file. The
location
argument in the
fileDatastore
and datastore
functions defines Files
when the datastore is
created.
Example: {'C:\dir\data\file1.ext';'C:\dir\data\file2.ext'}
Example: 'hdfs:///data/*.mat'
Data Types: char
| cell
| string
Folders
— Folders used to construct datastoreThis property is read-only.
Folders used to construct datastore, returned as a cell array of character
vectors. The cell array is oriented as a column vector. Each character
vector is a path to a folder that contains data files. The
location
argument in the
fileDatastore
and datastore
functions defines Folders
when the datastore is
created.
The Folders
property is reset when you modify the
Files
property of a FileDatastore
object.
Data Types: cell
ReadFcn
— Function that reads file dataFunction that reads the file data, specified as a function handle.
The value specified by @fcn, sets
the value of the ReadFcn
property.
Example: @MyCustomFileReader
Data Types: function_handle
UniformRead
— Vertically concatenateable flagfalse
(default) | true
This property is read-only.
Vertically concatenateable flag, specified as a logical
true
or false
. Specify the value
of this property when you first create the FileDatastore
object.
true | Multiple reads of the
When the
|
false (default) | Multiple reads of the
When the
|
Example: fds =
fileDatastore(location,'ReadFcn',@load,'UniformRead',true)
Data Types: logical
| double
SupportedOutputFormats
— List of formats supported for writingThis property is read-only.
List of formats supported for writing, returned as a row vector of strings. This property
specifies the possible output formats when using writeall
to write output files from the datastore.
Data Types: string
hasdata | Determine if data is available to read |
numpartitions | Number of datastore partitions |
partition | Partition a datastore |
preview | Preview subset of data in datastore |
read | Read data in datastore |
readall | Read all data in datastore |
writeall | Write datastore to files |
reset | Reset datastore to initial state |
transform | Transform datastore |
combine | Combine data from multiple datastores |
isPartitionable | Determine whether datastore is partitionable |
isShuffleable | Determine whether datastore is shuffleable |
shuffle | Shuffle all data in datastore |
subset | Create subset of datastore or file-set |
Create a datastore for the files in the MATLAB® demos
folder that have a .mat
extension.
fds = fileDatastore(fullfile(matlabroot,'toolbox','matlab','demos'),'ReadFcn',@load,'FileExtensions','.mat')
Create a datastore containing all the .mat
files within the MATLAB® demos
folder, specifying the load
function to read the file data.
fds = fileDatastore(fullfile(matlabroot,'toolbox','matlab','demos'),'ReadFcn',@load,'FileExtensions','.mat')
Read the first file in the datastore, and then read the second file.
data1 = read(fds); data2 = read(fds);
Read all files in the datastore simultaneously.
readall(fds);
Initialize a cell array to hold the data and counter i
.
dataarray = cell(numel(fds.Files), 1); i = 1;
Reset the datastore to the first file and read the files one at a time until there is no data left. Assign the data to the array dataarray
.
reset(fds); while hasdata(fds) dataarray{i} = read(fds); i = i+1; end
You can create a datastore to read from a large MAT-file that does not necessarily fit in memory. Assuming that each array in the large MAT-file fits in the available memory, create a datastore to read and process the data in three steps:
Write a custom reading function that reads one array at a time from a MAT-file.
Set up the parameters of the datastore function to perform partial reads.
Read one array at a time from the MAT-file.
Write a custom function that reads one array at time from MAT-file. The function must have a signature as described in the @ReadFcn
argument of FileDatastore
. Save this file in your working folder or in a folder that is on the MATLAB path. For this example, a custom function load_variable
is included here.
type load_variable.m
function [data,variables,done] = load_variable(filename,variables) % If variable list is empty, % create list of variables from the file if isempty(variables) variables = who('-file', filename); end % Load a variable from the list of variables data = load(filename, variables{1}); % Remove the newly-read variable from the list variables(1) = []; % Move on to the next file if this file is done reading. done = isempty(variables); end
Create and setup a FileDatastore
containing accidents.mat
. Specify the datastore parameters to use 'partialfile'
as the read mode and load_variable
as the custom reading function.
fds = fileDatastore('accidents.mat','ReadMode','partialfile','ReadFcn',@load_variable);
Read the first three variables from the file using the datastore. The file accidents.mat
contains nine variables and every call to read
returns one variable. Therefore, to get the first three variables, call the read function three times.
data = read(fds)
data = struct with fields:
datasources: {3x1 cell}
data = read(fds)
data = struct with fields:
hwycols: 17
data = read(fds)
data = struct with fields:
hwydata: [51x17 double]
Note that the sample file accidents.mat
is small and fits in memory, but you can expect similar results for large MAT-files that do not fit in memory.
To use the subset
and shuffle
functions
on a FileDatastore
object, you must set
'ReadMode'
to 'file'
.
You have a modified version of this example. Do you want to open this example with your edits?