Datastore for spreadsheet files
Use a SpreadsheetDatastore
object to manage large
collections of spreadsheet files where the collection does not necessarily fit in
memory. You can create a SpreadsheetDatastore
object using the
spreadsheetDatastore
function, specify its properties, and then
import the data using object functions.
creates a spreadsheet datastore from the collection of data specified by
ssds
= spreadsheetDatastore(location
)location
.
specifies additional parameters and properties for ssds
= spreadsheetDatastore(location
,Name,Value
)ssds
using one or more name-value
pair arguments. For example,
spreadsheetDatastore(location,'FileExtensions',{'.xlsx','.xls'})
specifies which files to include in the datastore depending on the file
extensions.
location
— Files or folders to include in datastoreDsFileSet
objectFiles or folders included in the datastore, specified as a path
or a DsFileSet
object.
path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.
Local files or folders — Specify location
as a
local path to files or folders. If the files are not in the current
folder, then local path must specify full or relative paths. Files
within subfolders of the specified folder are not automatically included
in the datastore. You can use the wildcard character (*) when specifying
the local path. This character specifies that the datastore include all
matching files or all files in the matching folders.
Remote files or folders — Specify location
to be
the full paths of the files or folders as a uniform resource locator
(URL) of the form hdfs:///path_to_file
. For more
information, see Work with Remote Data.
DsFileSet
object — You also can specify
location
as a DsFileSet
object. For more
information, see matlab.io.datastore.DsFileSet
.
When location
represents a folder, the datastore includes only
supported file formats and ignores any other format. To specify a custom list of file extensions
to include in your datastore, see the FileExtensions
property.
The spreadsheetDatastore
function supports these
extensions: .xls
, .xlsx
,
.xlsm
, .xltx
, and
.xltm
.
Example: 'file1.xlsx'
Example: '../dir/data/file1.xlsx'
Example: {'C:\dir\data\file1.xlsx','C:\dir\data\file2.xlsx'}
Example: 's3://bucketname/path_to_files/*.xls'
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
ssds =
spreadsheetDatastore('C:\dir\spreadsheetdata','FileExtensions',{'.xls','.xlsm'})
'FileExtensions'
— Spreadsheet file extensionsSpreadsheet file extensions, specified as the comma-separated pair
consisting of 'FileExtensions'
and a character
vector, cell array of character vectors, string scalar, or string
array.
If you do not specify
'FileExtensions'
, then
spreadsheetDatastore
automatically includes all files with
.xls
, .xlsx
,
.xlsm
, .xltx
,
and .xltm
extensions in the specified
path.
If you want to include spreadsheets with non-standard
file extensions in the
SpreadsheetDatastore
, then
specify those extensions explicitly.
If you want to create a
SpreadsheetDatastore
for files
without any extensions, then specify
'FileExtensions'
as an empty
character vector, ''
.
Example: 'FileExtensions',''
Example: 'FileExtensions','.xls'
Example: 'FileExtensions',{'.xlsx','.xlsm'}
Data Types: char
| cell
| string
'IncludeSubfolders'
— Subfolder inclusion flagtrue
or false
| 0 or 1Subfolder inclusion flag, specified as the comma-separated pair
consisting of 'IncludeSubfolders'
and
true
, false
, 0, or 1.
Specify true
to include all files and subfolders
within each folder or false
to include only the
files within each folder.
If you do not specify 'IncludeSubfolders'
, then
the default value is false
.
Example: 'IncludeSubfolders',true
Data Types: logical
| double
'OutputType'
— Output datatype'auto'
(default) | 'table'
| 'timetable'
Output datatype, specified as the comma-separated pair consisting of 'OutputType'
and one of these values:
The value of OutputType
determines the data type returned by the preview
, read
, and readall
functions. Use this option in conjunction with the 'RowTimes'
name-value pair to return timetables from SpreadsheetDatastore
.
Example: 'OutputType','timetable'
Data Types: char
| string
'AlternateFileSystemRoots'
— Alternate file system root pathsAlternate file system root paths, specified as the comma-separated pair consisting of
'AlternateFileSystemRoots'
and a string vector or a cell array. Use
'AlternateFileSystemRoots'
when you create a datastore on a local
machine, but need to access and process the data on another machine (possibly of a different
operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB®
Parallel Server™, and the data is stored on your local machines with a copy of the data available
on different platform cloud or cluster machines, you must use
'AlternateFileSystemRoots'
to associate the root paths.
To associate a set of root paths that are equivalent to one another, specify
'AlternateFileSystemRoots'
as a string vector. For
example,
["Z:\datasets","/mynetwork/datasets"]
To associate multiple sets of root paths that are equivalent for the datastore,
specify 'AlternateFileSystemRoots'
as a cell array containing
multiple rows where each row represents a set of equivalent root paths. Specify each row
in the cell array as either a string vector or a cell array of character vectors. For example:
Specify 'AlternateFileSystemRoots'
as a cell array of
string
vectors.
{["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}
Alternatively, specify 'AlternateFileSystemRoots'
as a cell
array of cell array of character
vectors.
{{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}
The value of 'AlternateFileSystemRoots'
must satisfy these conditions:
Contains one or more rows, where each row specifies a set of equivalent root paths.
Each row specifies multiple root paths and each root path must contain at least two characters.
Root paths are unique and are not subfolders of one another.
Contains at least one root path entry that points to the location of the files.
For more information, see Set Up Datastore for Processing on Different Machines or Clusters.
Example: ["Z:\datasets","/mynetwork/datasets"]
Data Types: string
| cell
'TextType'
— Output data type of text variables'char'
(default) | 'string'
Output data type of text variables, specified as the
comma-separated pair consisting of 'TextType'
and
either 'char'
or 'string'
.
If the output table from the read
,
readall
, or
preview
functions contains text
variables, then 'TextType'
specifies
the data type of those variables.
If 'TextType'
is
'char'
, then the output is a cell
array of character vectors.
If 'TextType'
is
'string'
, then the output has
type string
.
Data Types: char
| string
In addition to these name-value pairs, you also can specify any of the
properties on this page as name-value pairs (except for the
Files
property).
SpreadsheetDatastore
properties describe the
format of the files in a datastore object, and control how the data is read from the
datastore. Except for the Files
property, you can specify the
value of SpreadsheetDatastore
properties using name-value pair
arguments when you create the datastore object. To view or modify a property after
creating the object, use the dot notation:
ds = spreadsheetDatastore('airlinesmall_subset.xlsx'); ds.Sheets = [1 2]; ds.Range = 'A1:C6';
Files
— Files included in datastoreFiles included in the datastore, resolved as a cell array of character
vectors or a string array, where each character vector or string is a full
path to a file. The location
argument in the
spreadsheetDatastore
and datastore
functions define these files.
The first file specified in the cell array determines the variable names and format information for all files in the datastore.
Example: {'C:\dir\data\file1.xls';'C:\dir\data\file2.xls'}
Data Types: cell
| string
NumHeaderLines
— Number of lines to skip0
(default) | positive integerNumber of lines to skip at the beginning of each sheet when reading,
specified as a positive integer. When you also specify the
Range
property, NumHeaderLines
is
the number of lines to skip at the beginning of the specified block of
data.
Data Types: double
Sheets
— Sheets in files''
(default) | sheet names | sheet indicesSheets in files, specified as a character vector, cell array of character
vectors, string scalar, or string array containing sheet names, or as a
numeric vector of sheet indices. The empty character vector
''
indicates that all sheets in the files are
included.
Example: {'sheet1','sheet7'}
Example: [3 5 7]
Data Types: char
| cell
| string
| double
Range
— Row and column bounds''
(default) | character vector | string scalarRow and column bounds, specified as a character vector or string scalar
that defines a rectangular block of data in the sheets. The empty character
vector ''
indicates that the bounds are the beginning of
the file and the end of the data.
Example: 'B1:T7'
Example: 'A:C'
Data Types: char
| string
ReadVariableNames
— Indicator for reading variable namestrue
(default) | false
| 1
| 0
Indicator for reading the first row of the first file in the datastore as
variable names, specified as either true
(1
) or false
(0
).
If true
, then the first nonheader row of
the first file determines the variable names for the
data.
If false
, then the first nonheader row of
the first file contains the first row of data. Default variable
names are assigned as Var1
,
Var2
, and so on.
Data Types: logical
| double
VariableNamingRule
— Flag to preserve variable names'modify'
(default) | 'preserve'
Flag to preserve variable names, specified as the comma-separated pair consisting of
VariableNamingRule
and either true
, or
false
.
'preserve'
— Preserve variable names that are not valid
MATLAB identifiers such as variable names that include spaces and
non-ASCII characters.
'modify'
— Convert invalid variable names (as
determined by the isvarname
function) to
valid MATLAB identifiers.
Starting in R2019b, variable names and row names can include any characters, including
spaces and non-ASCII characters. Also, they can start with any characters, not just
letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname
function). To preserve these variable names and row names, set
the value of VariableNamingRule
to
'preserve'
.
Data Types: char
| string
VariableNames
— Names of variablesNames of variables in the datastore, specified as a character vector, cell array of character
vectors, string scalar, or string array. Specify the variable names in the order in
which they appear in the files. If you do not specify the variable names, the datastore
detects them from the first nonheader line in the first file. You can specify
VariableNames
with a character vector or string scalar, however
the datastore converts and stores the property value to a cell array of character
vectors. When modifying the VariableNames
property, the number of new
variable names must match the number of original variable names.
To support invalid MATLAB identifiers as variable names, such as variable names containing spaces
and non-ASCII characters, set the value of the VariableNamingRule
parameter to 'preserve'
.
If ReadVariableNames
is false
, then VariableNames
defaults to {'Var1','Var2', ...}
.
Example: {'Time','Date','Quantity'}
Data Types: char
| cell
| string
VariableTypes
— Variable types'double'
| 'char'
| 'string'
| 'categorical'
| 'datetime'
Variable types, specified as 'double'
,
'char'
, 'string'
,
'categorical'
, or 'datetime'
,
which indicates the type of each variable when reading the data.
The list of variable types corresponds with the variables in
VariableNames
. Types double
,
char
, and datetime
can be
automatically detected from the data. You can specify
VariableTypes
as a character vector or string scalar,
however spreadsheetDatastore
automatically converts it
to a cell array of character vectors or a string array.
Example: {'char','categorical'}
Data Types: char
| cell
| string
preview
, read
,
readall
TableSelectedVariableNames
— Variables to readVariables to read from the file, specified as a character vector, cell
array of character vectors, string scalar, or string array. Each character
vector or string contains the name of one variable. You can specify the
variable names in any order. You can specify
SelectedVariableNames
with a character vector or
string scalar, however spreadsheetDatastore
automatically converts it to a cell array of character vectors or a string
array.
To support invalid MATLAB identifiers as variable names, such as variable names
containing spaces and non-ASCII characters, set the value of
VariableNamingRule
parameter to
'preserve'
.
Example: {'Var3','Var7','Var4'}
Data Types: char
| cell
| string
SelectedVariableTypes
— Selected variable types'double'
| 'char'
| 'string'
| 'categorical'
| 'datetime'
Selected variable types, specified as 'double'
,
'char'
, 'string'
,
'categorical'
, or 'datetime'
,
which indicates the type of each selected variable when reading the data.
The list of variable types corresponds with the variables in
SelectedVariableNames
. Types
double
, char
, and
datetime
can be automatically detected from the data.
You can specify SelectedVariableTypes
as a character
vector or string scalar, however it is automatically converted to a cell
array of character vectors or string array.
Example: {'double','datetime'}
Data Types: char
| cell
| string
ReadSize
— Amount of data to read'file'
(default) | 'sheet'
| positive integer scalarAmount of data to read in a call to the read
function,
specified as 'file'
or 'sheet'
, or as
a positive integer scalar.
If ReadSize
is 'file'
,
then each call to read
reads all the data one
file at a time.
If ReadSize
is 'sheet'
,
then each call to read
reads all the data one
sheet at a time.
If ReadSize
is a positive integer, then
each call to read
reads the rows specified by
ReadSize
, or fewer if it reaches the end
of the data.
When you change ReadSize
from an integer scalar to
'file'
or 'sheet'
, or conversely,
the datastore resets using the reset
function.
Data Types: char
| string
| double
RowTimes
— Name of row times variableName of row times variable, specified as the comma-separated pair consisting of
'RowTimes'
and a variable name (such as
"Date"
) or a variable index (such as 3
).
RowTimes
is a timetable-related parameter. Each row of a timetable is
associated with a time, which is captured in a time vector for the timetable. The
variable specified in RowTimes
must contain a
datetime
or a duration
vector.
If the value of 'OutputType'
is 'timetable'
, but you do
not specify 'RowTimes'
, then SpreadsheetDatastore
uses the
first datetime
or duration
variable as the row
times for the timetable.
writeall
Folders
— Folders used to construct datastoreThis property is read-only.
Folders used to construct datastore, returned as a cell array of character
vectors. The cell array is oriented as a column vector. Each character
vector is a path to a folder that contains data files. The
location
argument in the
spreadsheetDatastore
and datastore
functions defines Folders
when the datastore is
created.
The Folders
property is reset when you modify the
Files
property of a
SpreadsheetDatastore
object.
Data Types: cell
SupportedOutputFormats
— List of formats supported for writingThis property is read-only.
List of formats supported for writing, returned as a row vector of strings. This property
specifies the possible output formats when using writeall
to write output files from the datastore.
Data Types: string
DefaultOutputFormat
— Default output formatThis property is read-only.
Default output format, returned as a string scalar. This property specifies the default format
when using writeall
to write output files from the datastore.
Data Types: string
hasdata | Determine if data is available to read |
numpartitions | Number of datastore partitions |
partition | Partition a datastore |
preview | Preview subset of data in datastore |
read | Read data in datastore |
readall | Read all data in datastore |
writeall | Write datastore to files |
reset | Reset datastore to initial state |
sheetnames | Query sheet names from datastore |
transform | Transform datastore |
combine | Combine data from multiple datastores |
isPartitionable | Determine whether datastore is partitionable |
isShuffleable | Determine whether datastore is shuffleable |
ssds = spreadsheetDatastore('airlinesmall_subset.xlsx')
ssds = SpreadsheetDatastore with properties: Files: { ' ...\Documents\MATLAB\Examples\matlab-ex26030327\airlinesmall_subset.xlsx' } Folders: { ' ...\OneDrive - MathWorks\Documents\MATLAB\Examples\matlab-ex26030327' } AlternateFileSystemRoots: {} Sheets: '' Range: '' Sheet Format Properties: NumHeaderLines: 0 VariableNamingRule: 'modify' ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} VariableTypes: {'double', 'double', 'double' ... and 26 more} Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedVariableTypes: {'double', 'double', 'double' ... and 26 more} ReadSize: 'file' OutputType: 'table' RowTimes: [] Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "xlsx"
Create a SpreadsheetDatastore
object containing the file airlinesmall_subset.xlsx
.
ssds = spreadsheetDatastore('airlinesmall_subset.xlsx')
ssds = SpreadsheetDatastore with properties: Files: { ' .../publish_examples7/tp89dfea3d/ex29672123/airlinesmall_subset.xlsx' } Folders: { '/tmp/BR2020bd_1444674_32127/publish_examples7/tp89dfea3d/ex29672123' } AlternateFileSystemRoots: {} Sheets: '' Range: '' Sheet Format Properties: NumHeaderLines: 0 VariableNamingRule: 'modify' ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} VariableTypes: {'double', 'double', 'double' ... and 26 more} Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedVariableTypes: {'double', 'double', 'double' ... and 26 more} ReadSize: 'file' OutputType: 'table' RowTimes: [] Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "xlsx"
Display the sheet names for the file. The file contains one sheet per year.
sheetnames(ssds,1)
ans = 13x1 string array "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008"
Specify the variable FlightNum
in the second sheet as the data of interest, and preview the first eight rows.
ssds.Sheets = 2;
ssds.SelectedVariableNames = 'FlightNum';
preview(ssds)
ans = 8x1 table FlightNum _________ 1014 1201 702 1184 1310 1759 1242 1558
Read only the first three rows of variables DepTime
and ArrTime
in the first sheet.
ssds.ReadSize = 3; ssds.Sheets = 1; ssds.SelectedVariableNames = {'DepTime','ArrTime'}; read(ssds)
ans = 3x2 table DepTime ArrTime _______ _______ 2117 2305 1252 1511 1441 1708
Read all of sheets four, five, and six.
ssds.Sheets = 4:6; readall(ssds);
Use the OutputType
and RowTimes
name-value pairs to make SpreadsheetDatastore
return timetables instead of tables.
Create a datastore for tsunamis_dated.xlsx
. Specify 'OutputType'
as 'timetable'
so that SpreadsheetDatastore
returns timetables instead of tables.
ssds = spreadsheetDatastore('tsunamis_dated.xlsx','OutputType','timetable'); preview(ssds)
ans=8×15 timetable
Date Latitude Longitude ValidityCode Validity CauseCode Cause EarthquakeMagnitude Country Location MaxHeight IidaMagnitude Intensity NumDeaths DescDeaths Time
___________ ________ _________ ____________ ____________________ _________ ____________________________ ___________________ _____________ ____________________________ _________ _____________ _________ _________ __________ ________________________
04-Mar-1952 42.15 143.85 4 {'definite tsunami'} 1 {'Earthquake' } 8.1 {'JAPAN' } {'SE. HOKKAIDO ISLAND' } 6.5 2.7 2 33 1 04-Mar-1952 01:22:41.000
10-Jul-1958 58.34 -136.52 4 {'definite tsunami'} 3 {'Earthquake and Landslide'} 8.3 {'USA' } {'SE. ALASKA, AK' } 524.26 4.6 5 5 1 10-Jul-1958 06:15:53.600
22-May-1960 -39.5 -74.5 4 {'definite tsunami'} 1 {'Earthquake' } 9.5 {'CHILE' } {'CENTRAL CHILE' } 25 4.6 4 1260 3 22-May-1960 19:11:17.000
20-Nov-1960 -6.8 -80.7 4 {'definite tsunami'} 1 {'Earthquake' } 6.8 {'PERU' } {'PERU' } 9 3.2 2.5 66 2 20-Nov-1960 22:01:56.400
28-Mar-1964 61.1 -147.5 4 {'definite tsunami'} 3 {'Earthquake and Landslide'} 9.2 {'USA' } {'PRINCE WILLIAM SOUND, AK'} 67 6.1 5 221 3 28-Mar-1964 03:36:14.000
16-Jun-1964 38.65 139.2 4 {'definite tsunami'} 1 {'Earthquake' } 7.5 {'JAPAN' } {'NW. HONSHU ISLAND' } 5.8 2.7 2 26 1 16-Jun-1964 04:01:44.300
14-Aug-1968 0.2 119.8 4 {'definite tsunami'} 1 {'Earthquake' } 7.8 {'INDONESIA'} {'BANDA SEA' } 10 3.3 3 200 3 14-Aug-1968 22:14:19.400
23-Feb-1969 -3.1 118.9 4 {'definite tsunami'} 1 {'Earthquake' } 6.9 {'INDONESIA'} {'MAKASSAR STRAIT' } 4 2 2 600 3 23-Feb-1969 00:36:56.600
When you do not specify 'RowTimes'
, spreadsheetDatastore
uses the first datetime or duration variable as the row times. In this case, the Date
variable is used for the row times. This data has two datetime variables: Date
and Time
. The Date
variable does not include information about hours, minutes, or seconds, while the Time
variable has the specific time of each event.
Specify the 'RowTimes'
option to use the event times (the Time
variable) as the row times.
ssds = spreadsheetDatastore('tsunamis_dated.xlsx','OutputType','timetable','RowTimes','Time'); preview(ssds)
ans=8×15 timetable
Time Latitude Longitude ValidityCode Validity CauseCode Cause EarthquakeMagnitude Country Location MaxHeight IidaMagnitude Intensity NumDeaths DescDeaths Date
________________________ ________ _________ ____________ ____________________ _________ ____________________________ ___________________ _____________ ____________________________ _________ _____________ _________ _________ __________ ___________
04-Mar-1952 01:22:41.000 42.15 143.85 4 {'definite tsunami'} 1 {'Earthquake' } 8.1 {'JAPAN' } {'SE. HOKKAIDO ISLAND' } 6.5 2.7 2 33 1 04-Mar-1952
10-Jul-1958 06:15:53.600 58.34 -136.52 4 {'definite tsunami'} 3 {'Earthquake and Landslide'} 8.3 {'USA' } {'SE. ALASKA, AK' } 524.26 4.6 5 5 1 10-Jul-1958
22-May-1960 19:11:17.000 -39.5 -74.5 4 {'definite tsunami'} 1 {'Earthquake' } 9.5 {'CHILE' } {'CENTRAL CHILE' } 25 4.6 4 1260 3 22-May-1960
20-Nov-1960 22:01:56.400 -6.8 -80.7 4 {'definite tsunami'} 1 {'Earthquake' } 6.8 {'PERU' } {'PERU' } 9 3.2 2.5 66 2 20-Nov-1960
28-Mar-1964 03:36:14.000 61.1 -147.5 4 {'definite tsunami'} 3 {'Earthquake and Landslide'} 9.2 {'USA' } {'PRINCE WILLIAM SOUND, AK'} 67 6.1 5 221 3 28-Mar-1964
16-Jun-1964 04:01:44.300 38.65 139.2 4 {'definite tsunami'} 1 {'Earthquake' } 7.5 {'JAPAN' } {'NW. HONSHU ISLAND' } 5.8 2.7 2 26 1 16-Jun-1964
14-Aug-1968 22:14:19.400 0.2 119.8 4 {'definite tsunami'} 1 {'Earthquake' } 7.8 {'INDONESIA'} {'BANDA SEA' } 10 3.3 3 200 3 14-Aug-1968
23-Feb-1969 00:36:56.600 -3.1 118.9 4 {'definite tsunami'} 1 {'Earthquake' } 6.9 {'INDONESIA'} {'MAKASSAR STRAIT' } 4 2 2 600 3 23-Feb-1969
You have a modified version of this example. Do you want to open this example with your edits?