Create datastore for large collections of data
specifies
additional parameters for ds
= datastore(location
,Name,Value
)ds
using one or more
name-value pair arguments. For example, you can create a datastore
for image files by specifying 'Type','image'
.
Create a datastore associated with the sample file airlinesmall.csv
. This file contains airline data from the years 1987 through 2008.
To manage the import of missing data in numeric columns, use the 'TreatAsMissing'
and 'MissingValue'
name-value pair arguments. Specify the value 'NA'
for 'TreatAsMissing'
and 0 for 'MissingValue'
to replace every instance of 'NA'
with a 0 in the imported data.
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'MissingValue',0)
ds = TabularTextDatastore with properties: Files: { ' .../devel/bat/BR2020ad/build/matlab/toolbox/matlab/demos/airlinesmall.csv' } Folders: { '/mathworks/devel/bat/BR2020ad/build/matlab/toolbox/matlab/demos' } FileEncoding: 'UTF-8' AlternateFileSystemRoots: {} PreserveVariableNames: false ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} DatetimeLocale: en_US Text Format Properties: NumHeaderLines: 0 Delimiter: ',' RowDelimiter: '\r\n' TreatAsMissing: 'NA' MissingValue: 0 Advanced Text Format Properties: TextscanFormats: {'%f', '%f', '%f' ... and 26 more} TextType: 'char' ExponentCharacters: 'eEdD' CommentStyle: '' Whitespace: ' \b\t' MultipleDelimitersAsOne: false Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedFormats: {'%f', '%f', '%f' ... and 26 more} ReadSize: 20000 rows OutputType: 'table' RowTimes: [] Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "txt"
datastore
creates a TabularTextDatastore
.
Create a datastore containing all .tif
files in the MATLAB® path and its subfolders.
ds = datastore(fullfile(matlabroot, 'toolbox', 'matlab'),... 'IncludeSubfolders', true,'FileExtensions', '.tif','Type', 'image')
ds = ImageDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\example.tif'; ' ...\matlab\toolbox\matlab\imagesci\corn.tif' } ReadSize: 1 Labels: {} ReadFcn: @readDatastoreImage
location
— Files or folders to include in the datastoreDsFileSet
objectFiles or folders included in the datastore, specified as a path
or a DsFileSet
object.
path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.
Local files or folders — Specify location
as a
local path to files or folders. If the files are not in the current
folder, then local path must specify full or relative paths. Files
within subfolders of the specified folder are not automatically included
in the datastore. You can use the wildcard character (*) when specifying
the local path. This character specifies that the datastore include all
matching files or all files in the matching folders.
Remote files or folders — Specify location
to be
the full paths of the files or folders as a uniform resource locator
(URL) of the form hdfs:///path_to_file
. For more
information, see Work with Remote Data.
DsFileSet
object — You also can specify
location
as a DsFileSet
object. For more
information, see matlab.io.datastore.DsFileSet
.
When location
represents a folder, the datastore includes only
supported file formats and ignores any other format. To specify a custom list of file extensions
to include in your datastore, see the FileExtensions
property.
For KeyValueDatastore
, the files must be
MAT-files or Sequence files generated by the mapreduce
function.
MAT-files must be in a local file system or in a network file system.
Sequence files can be in a local, network, or HDFS™ file system.
For DatabaseDatastore
, the location argument need
not be files. For more information, see DatabaseDatastore
.
Example: 'file1.csv'
Example: '../dir/data/file1.jpg'
Example: {'C:\dir\data\file1.xls','C:\dir\data\file2.xlsx'}
Example: 'C:\dir\data\*.mat'
Example: 'hdfs:///data/file1.txt'
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'FileExtensions',{'.jpg','.tif'}
includes
all extensions with a .jpg
or .tif
extension
for an ImageDatastore
object.'Type'
— Type of datastore'tabulartext'
| 'image'
| 'spreadsheet'
| 'keyvalue'
| 'file'
| 'tall'
| ...Type of datastore, specified as the comma-separated pair consisting of
'Type'
and one of the following:
Value of
'Type' | Description |
---|---|
'tabulartext' | Text files containing tabular data. The encoding of the data must be ASCII or UTF-8. |
'image' | Image files in a format such as JPEG or PNG.
Acceptable files include imformats
formats. |
'spreadsheet' | Spreadsheet files containing one or more sheets. |
'keyvalue' | Key-value pair data contained in MAT-files or
Sequence files with data generated by
mapreduce . |
'file' | Custom format files, which require a specified read
function to read the data. For more information, see
FileDatastore . |
'tall' | MAT-files or Sequence files produced by the write
function of the tall data type. For
more information see, TallDatastore . |
'parquet' | Parquet files containing column-oriented data. For
more information see, ParquetDatastore . |
'database' | Data stored in database. Requires Database
Toolbox™. Requires specification of additional
input argument when using the type
parameter. For more information, see DatabaseDatastore . |
If there are multiple types that support the format of the
files, then use the 'Type'
argument to
specify a datastore type.
If you do not specify a value for 'Type'
,
then datastore
automatically determines the
appropriate type of datastore to create based on the extensions
of the files.
Data Types: char
| string
'IncludeSubfolders'
— Include subfolders within foldertrue
or false
| 0 or 1Include subfolders within a folder, specified as the comma-separated
pair consisting of 'IncludeSubfolders'
and
true
(1) or false
(0). Specify
true
to include all files and subfolders within
each folder or false
to include only the files within
each folder.
When you do not specify 'IncludeSubfolders'
, then
the default value is false
.
The 'IncludeSubfolders'
name-value pair is only
valid when creating these objects:
TabularTextDatastore
ImageDatastore
SpreadsheetDatastore
FileDatastore
KeyValueDatastore
ParquetDatastore
Example: 'IncludeSubfolders',true
Data Types: logical
| double
'FileExtensions'
— Extensions of filesExtensions of files, specified as the comma-separated pair consisting
of 'FileExtensions'
and a character vector, cell
array of character vectors, string scalar, or string array. When
specifying 'FileExtensions'
, also specify
'Type'
. You can use the empty quotes
''
to represent files without extensions.
If 'FileExtensions'
is not specified, then
datastore
automatically includes all supported
file extensions depending on the datastore type. If you want to include
unsupported extensions, then specify each extension you want to include individually.
For TabularTextDatastore
objects,
supported extensions include .txt
,
.csv
, .dat
,
.dlm
, .asc
,
.text
, and no extension.
For ImageDatastore
objects, supported
extensions include all imformats
extensions.
For SpreadsheetDatastore
objects,
supported extensions include .xls
,
.xlsx
, .xlsm
,
.xltx
, and
.xltm
.
For TallDatastore
objects, supported
extensions include .mat
and
.seq
.
For ParquetDatastore
objects, supported
extensions include .parquet
and
.parq
.
The 'FileExtensions'
name-value pair is only valid
when creating these objects:
TabularTextDatastore
ImageDatastore
SpreadsheetDatastore
FileDatastore
KeyValueDatastore
ParquetDatastore
Example: 'FileExtensions','.jpg'
Example: 'FileExtensions',{'.txt','.text'}
Data Types: char
| cell
| string
'AlternateFileSystemRoots'
— Alternate file system root pathsAlternate file system root paths, specified as the comma-separated pair consisting of
'AlternateFileSystemRoots'
and a string vector or a cell array. Use
'AlternateFileSystemRoots'
when you create a datastore on a local
machine, but need to access and process the data on another machine (possibly of a different
operating system). Also, when processing data using the Parallel
Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available
on different platform cloud or cluster machines, you must use
'AlternateFileSystemRoots'
to associate the root paths.
To associate a set of root paths that are equivalent to one another, specify
'AlternateFileSystemRoots'
as a string vector. For
example,
["Z:\datasets","/mynetwork/datasets"]
To associate multiple sets of root paths that are equivalent for the datastore,
specify 'AlternateFileSystemRoots'
as a cell array containing
multiple rows where each row represents a set of equivalent root paths. Specify each row
in the cell array as either a string vector or a cell array of character vectors. For example:
Specify 'AlternateFileSystemRoots'
as a cell array of
string
vectors.
{["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}
Alternatively, specify 'AlternateFileSystemRoots'
as a cell
array of cell array of character
vectors.
{{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}
The value of 'AlternateFileSystemRoots'
must satisfy these conditions:
Contains one or more rows, where each row specifies a set of equivalent root paths.
Each row specifies multiple root paths and each root path must contain at least two characters.
Root paths are unique and are not subfolders of one another.
Contains at least one root path entry that points to the location of the files.
For more information, see Set Up Datastore for Processing on Different Machines or Clusters.
Example: ["Z:\datasets","/mynetwork/datasets"]
Data Types: string
| cell
'TextType'
— Output data type of text variables'char'
(default) | 'string'
Output data type of text variables, specified as the comma-separated
pair consisting of 'TextType'
and either
'char'
or 'string'
. If the
output table from the read
,
readall
, or preview
functions
contains text variables, then 'TextType'
specifies
the data type of those variables for
TabularTextDatastore
and
SpreadsheetDatastore
objects only. If
'TextType'
is 'char'
, then the
output is a cell array of character vectors. If
'TextType'
is 'string'
, then
the output has type string
.
Data Types: char
| string
'DatetimeType'
— Type for imported date and time data'datetime'
(default) | 'text'
Type for imported date and time data, specified as the comma-separated
pair consisting of 'DatetimeType'
and one of these
values: 'datetime'
or 'text'
. The
'DatetimeType'
argument only applies when
creating a TabularTextDatastore
object.
Value | Type for Imported Date and Time Data |
---|---|
'datetime' | MATLAB
For more information, see |
'text' | If
|
Example: 'DatetimeType','datetime'
Data Types: char
| string
'DurationType'
— Output data type of duration data'duration'
(default) | 'text'
Output data type of duration data from text files, specified as the
comma-separated pair consisting of 'DurationType'
and
either 'duration'
or 'text'
.
Value | Type for Imported Duration Data |
---|---|
'duration' | MATLAB
For more information, see |
'text' | If
|
Data Types: char
| string
'PreserveVariableNames'
— Flag to preserve variable namesfalse
(default) | true
Flag to preserve variable names, specified as the comma-separated pair consisting of
PreserveVariableNames
and either true
, or
false
.
true
— Preserve variable names that are not valid
MATLAB identifiers such as variable names that include spaces and
non-ASCII characters.
false
— Convert invalid variable names (as determined
by the isvarname
function) to
valid MATLAB identifiers.
Starting in R2019b, variable names and row names can include any characters, including
spaces and non-ASCII characters. Also, they can start with any characters, not just
letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname
function). To preserve these variable names and row names, set
PreserveVariableNames
to true
.
In addition to these name-value pairs, you also can specify any of the properties of the
following objects as name-value pairs, except for the Files
property:
ds
— Datastore for collection of dataTabularTextDatastore
| ImageDatastore
| SpreadsheetDatastore
| KeyValueDatastore
| FileDatastore
| TallDatastore
| ...Datastore for a collection of data, returned as one of these objects:
TabularTextDatastore
,
ImageDatastore
,
SpreadsheetDatastore
,
KeyValueDatastore
, FileDatastore
,
TallDatastore
, ParquetDatastore
,
or DatabaseDatastore
. The type of the datastore depends
on the type of files or the location
argument. For more
information, click the datastore name in the following table:
Type | Output |
---|---|
Text files | TabularTextDatastore |
Image files | ImageDatastore |
Spreadsheet files | SpreadsheetDatastore |
MAT-files or Sequence files produced by mapreduce | KeyValueDatastore |
Custom format files | FileDatastore |
MAT-files or Sequence files produced by the write function of the tall data
type. | TallDatastore |
Parquet Files | ParquetDatastore |
Database | DatabaseDatastore |
For each of these datastore types, the Files
property
is a cell array of character vectors. Each character vector is an
absolute path to a file resolved by the location
argument.
CombinedDatastore
| TransformedDatastore
| fileDatastore
| imageDatastore
| imformats
| javaaddpath
| mapreduce
| parquetDatastore
| spreadsheetDatastore
| tabularTextDatastore
| tall
You have a modified version of this example. Do you want to open this example with your edits?