Datastore for collection of Parquet files
Use a ParquetDatastore
object to manage a collection of Parquet
files, where each individual Parquet file fits in memory, but the entire collection of files
does not necessarily fit. You can create a ParquetDatastore
object using
the parquetDatastore
function, specify its properties, and then import and
process the data using object functions.
creates a datastore pds
= parquetDatastore(location
)pds
from the collection of Parquet files specified
by location
.
specifies additional parameters and properties for pds
= parquetDatastore(location
,Name,Value
)pds
using one or
more name-value pair arguments.
location
— Files or folders to include in datastoreDsFileSet
objectFiles or folders included in the datastore, specified as a path
or a DsFileSet
object.
path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.
Local files or folders — Specify location
as a
local path to files or folders. If the files are not in the current
folder, then local path must specify full or relative paths. Files
within subfolders of the specified folder are not automatically included
in the datastore. You can use the wildcard character (*) when specifying
the local path. This character specifies that the datastore include all
matching files or all files in the matching folders.
Remote files or folders — Specify location
to be
the full paths of the files or folders as a uniform resource locator
(URL) of the form hdfs:///path_to_file
. For more
information, see Work with Remote Data.
DsFileSet
object — You also can specify
location
as a DsFileSet
object. For more
information, see matlab.io.datastore.DsFileSet
.
When location
represents a folder, the datastore includes only
supported file formats and ignores any other format. To specify a custom list of file extensions
to include in your datastore, see the FileExtensions
property.
The parquetDatastore
function supports the
.parquet
file format.
Example: 'myfile.parquet'
Example: '../dir/data/myfile.parquet'
Example: {'C:\dir\data\myfile01.parquet','C:\dir\data\myfile02.parquet'}
Example: 's3://bucketname/path_to_files/*.parquet'
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'IncludeSubfolders',true
'FileExtensions'
— Extensions to include in datastoreExtensions to include in datastore, specified as the comma-separated pair
consisting of 'FileExtensions'
and a character vector, cell array
of character vectors, string scalar, or string array.
If you do not specify 'FileExtensions'
, then
parquetDatastore
automatically includes all files with
.parquet
and .parq
extensions in the
specified path.
If you want to include parquet files with non-standard file extensions in
the parquetDatastore
, then specify those extensions
explicitly.
If you want to create a parquetDatastore
for files
without any extensions, then specify 'FileExtensions'
as an
empty character vector, ''
.
Example: 'FileExtensions',{'.parquet','.parq'}
Example: 'FileExtensions','.myformat'
Example: 'FileExtensions',''
Data Types: char
| cell
| string
'IncludeSubfolders'
— Subfolder inclusion flagfalse
(default) | true
Subfolder inclusion flag, specified as the comma-separated pair consisting of 'IncludeSubfolders'
and true
or false
. Specify true
to include all files and subfolders within each folder or false
to include only the files within each folder.
If you do not specify 'IncludeSubfolders'
, then the default value is false
.
Example: 'IncludeSubfolders',true
Data Types: logical
| double
'OutputType'
— Output datatype'auto'
(default) | 'table'
| 'timetable'
Output datatype, specified as the comma-separated pair consisting of 'OutputType'
and one of these values:
The value of OutputType
determines the data type returned by the preview
, read
, and readall
functions. Use this option in conjunction with the 'RowTimes'
name-value pair to return timetables from ParquetDatastore
.
Example: 'OutputType','timetable'
Data Types: char
| string
'VariableNamingRule'
— Flag to preserve variable names'modify'
(default) | 'preserve'
Flag to preserve variable names, specified as the comma-separated pair consisting of
VariableNamingRule
and either true
, or
false
.
'preserve'
— Preserve variable names that are not valid
MATLAB® identifiers such as variable names that include spaces and
non-ASCII characters.
'modify'
— Convert invalid variable names (as
determined by the isvarname
function) to
valid MATLAB identifiers.
Starting in R2019b, variable names and row names can include any characters, including
spaces and non-ASCII characters. Also, they can start with any characters, not just
letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname
function). To preserve these variable names and row names, set
the value of VariableNamingRule
to
'preserve'
.
Data Types: char
| string
'AlternateFileSystemRoots'
— Alternate file system root pathsAlternate file system root paths, specified as the comma-separated pair consisting of
'AlternateFileSystemRoots'
and a string vector or a cell array. Use
'AlternateFileSystemRoots'
when you create a datastore on a local
machine, but need to access and process the data on another machine (possibly of a different
operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB
Parallel Server™, and the data is stored on your local machines with a copy of the data available
on different platform cloud or cluster machines, you must use
'AlternateFileSystemRoots'
to associate the root paths.
To associate a set of root paths that are equivalent to one another, specify
'AlternateFileSystemRoots'
as a string vector. For
example,
["Z:\datasets","/mynetwork/datasets"]
To associate multiple sets of root paths that are equivalent for the datastore,
specify 'AlternateFileSystemRoots'
as a cell array containing
multiple rows where each row represents a set of equivalent root paths. Specify each row
in the cell array as either a string vector or a cell array of character vectors. For example:
Specify 'AlternateFileSystemRoots'
as a cell array of
string
vectors.
{["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}
Alternatively, specify 'AlternateFileSystemRoots'
as a cell
array of cell array of character
vectors.
{{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}
The value of 'AlternateFileSystemRoots'
must satisfy these conditions:
Contains one or more rows, where each row specifies a set of equivalent root paths.
Each row specifies multiple root paths and each root path must contain at least two characters.
Root paths are unique and are not subfolders of one another.
Contains at least one root path entry that points to the location of the files.
For more information, see Set Up Datastore for Processing on Different Machines or Clusters.
Example: ["Z:\datasets","/mynetwork/datasets"]
Data Types: string
| cell
ParquetDatastore
properties describe the format of
the files in a datastore object, and control how the data is read from the datastore. With
the exception of the Files
property, you can specify the value of
ParquetDatastore
properties using name-value pair arguments when you
create the datastore object. To view or modify a property after creating the object, use the
dot notation.
Files
— Files included in datastoreFiles included in the datastore, resolved as a cell array of character vectors or a
string array, where each character vector or string is a full path to a file. The
location
argument defines these files.
The first file specified in the cell array determines the variable names and format information for all files in the datastore.
Example: {'C:\dir\data\file1.ext';'C:\dir\data\file2.ext'}
Data Types: cell
| string
Folders
— Folders used to construct datastoreThis property is read-only.
Folders used to construct datastore, returned as a cell array of character vectors.
The cell array is oriented as a column vector. Each character vector is a path to a
folder that contains data files. The location
argument in the
parquetDatastore
and datastore
functions defines
Folders
when the datastore is created.
The Folders
property is reset when you modify the
Files
property of a ParquetDatastore
object.
Data Types: cell
ReadSize
— Amount of data to read'rowgroup'
(default) | 'file'
| positive integerAmount of data to read in a call to the read
function, specified
as 'rowgroup'
, 'file'
, or a positive integer.
'rowgroup'
— Each call to read
reads the
number of rows specified in the row groups of the Parquet file. To get the number
of rows in row groups, see the RowGroupHeights
property of the
ParquetInfo
object.
'file'
— Each call to read
reads all of
the data in one file.
positive integer — Each call to read
reads a maximum of
ReadSize
rows.
When you change ReadSize
from a positive integer to
'file'
or 'rowgroup'
, or vice versa, MATLAB resets the datastore to an unread state where no data has been read from
it.
Data Types: double
| char
| string
VariableNames
— Names of variablesNames of variables in the datastore, specified as a character vector, cell array of character
vectors, string scalar, or string array. Specify the variable names in the order in
which they appear in the files. If you do not specify the variable names, the datastore
detects them from the first nonheader line in the first file. You can specify
VariableNames
with a character vector or string scalar, however
the datastore converts and stores the property value to a cell array of character
vectors. When modifying the VariableNames
property, the number of new
variable names must match the number of original variable names.
To support invalid MATLAB identifiers as variable names, such as variable names containing spaces
and non-ASCII characters, set the value of the VariableNamingRule
parameter to 'preserve'
.
If ReadVariableNames
is false
, then VariableNames
defaults to {'Var1','Var2', ...}
.
Example: {'Time','Date','Quantity'}
Data Types: char
| cell
| string
SelectedVariableNames
— Variables to readVariables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.
To support invalid MATLAB identifiers as variable names, such as variable names
containing spaces and non-ASCII characters, set the value of the
VariableNamingRule
parameter to
'preserve'
.
Example: {'Var3','Var7','Var4'}
Data Types: cell
| string
RowTimes
— Name of row times variableName of row times variable, specified as the comma-separated pair consisting of
'RowTimes'
and a variable name (such as
"Date"
) or a variable index (such as 3
).
RowTimes
is a timetable-related parameter. Each row of a timetable is
associated with a time, which is captured in a time vector for the timetable. The
variable specified in RowTimes
must contain a
datetime
or a duration
vector.
If the value of 'OutputType'
is 'timetable'
, but you do
not specify 'RowTimes'
, then ParquetDatastore
uses the
first datetime
or duration
variable as the row
times for the timetable.
SupportedOutputFormats
— List of formats supported for writingThis property is read-only.
List of formats supported for writing, returned as a row vector of strings. This property
specifies the possible output formats when using writeall
to write output files from the datastore.
Data Types: string
DefaultOutputFormat
— Default output formatThis property is read-only.
Default output format, returned as a string scalar. This property specifies the default format
when using writeall
to write output files from the datastore.
Data Types: string
hasdata | Determine if data is available to read |
numpartitions | Number of datastore partitions |
partition | Partition a datastore |
preview | Preview subset of data in datastore |
read | Read data in datastore |
readall | Read all data in datastore |
writeall | Write datastore to files |
reset | Reset datastore to initial state |
transform | Transform datastore |
combine | Combine data from multiple datastores |
isPartitionable | Determine whether datastore is partitionable |
isShuffleable | Determine whether datastore is shuffleable |
Create a ParquetDatastore
object containing the file outages.parquet
.
pds = parquetDatastore('outages.parquet')
pds = ParquetDatastore with properties: Files: { ' .../devel/bat/BR2020bd/build/matlab/toolbox/matlab/demos/outages.parquet' } Folders: { '/mathworks/devel/bat/BR2020bd/build/matlab/toolbox/matlab/demos' } VariableNames: {1x6 cell} SelectedVariableNames: {1x6 cell} ReadSize: 'rowgroup' OutputType: 'table' RowTimes: [] AlternateFileSystemRoots: {} SupportedOutputFormats: [1x6 string] DefaultOutputFormat: "parquet" VariableNamingRule: 'modify'
Create a datastore for a sample Parquet file, and then read data from the file with different ReadSize
values.
Create a datastore for airlinesmall.parquet
, set ReadSize
to 10
rows, and then read from the datastore. The value of ReadSize
determines how many rows of data are read from the datastore with each call to the read
function.
pds = parquetDatastore('outages.parquet','ReadSize',10); read(pds)
ans=10×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 530.14 2.1204e+05 NaT "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
"West" 06-Apr-2004 05:44:00 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
"West" 18-Jun-2003 02:49:00 0 0 18-Jun-2003 10:54:00 "attack"
"West" 20-Jun-2004 14:39:00 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
"NorthEast" 16-Jul-2003 16:23:00 239.93 49434 17-Jul-2003 01:12:00 "fire"
"MidWest" 27-Sep-2004 11:09:00 286.72 66104 27-Sep-2004 16:37:00 "equipment fault"
Set the ReadSize
property value to 'file'
and read from the datastore. Every call to the read
function reads all the data from the datastore.
pds.ReadSize ='file';
data = read(pds)
data=1468×6 table
Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________
"SouthWest" 01-Feb-2002 12:18:00 458.98 1.8202e+06 07-Feb-2002 16:50:00 "winter storm"
"SouthEast" 23-Jan-2003 00:49:00 530.14 2.1204e+05 NaT "winter storm"
"SouthEast" 07-Feb-2003 21:15:00 289.4 1.4294e+05 17-Feb-2003 08:14:00 "winter storm"
"West" 06-Apr-2004 05:44:00 434.81 3.4037e+05 06-Apr-2004 06:10:00 "equipment fault"
"MidWest" 16-Mar-2002 06:18:00 186.44 2.1275e+05 18-Mar-2002 23:23:00 "severe storm"
"West" 18-Jun-2003 02:49:00 0 0 18-Jun-2003 10:54:00 "attack"
"West" 20-Jun-2004 14:39:00 231.29 NaN 20-Jun-2004 19:16:00 "equipment fault"
"West" 06-Jun-2002 19:28:00 311.86 NaN 07-Jun-2002 00:51:00 "equipment fault"
"NorthEast" 16-Jul-2003 16:23:00 239.93 49434 17-Jul-2003 01:12:00 "fire"
"MidWest" 27-Sep-2004 11:09:00 286.72 66104 27-Sep-2004 16:37:00 "equipment fault"
"SouthEast" 05-Sep-2004 17:48:00 73.387 36073 05-Sep-2004 20:46:00 "equipment fault"
"West" 21-May-2004 21:45:00 159.99 NaN 22-May-2004 04:23:00 "equipment fault"
"SouthEast" 01-Sep-2002 18:22:00 95.917 36759 01-Sep-2002 19:12:00 "severe storm"
"SouthEast" 27-Sep-2003 07:32:00 NaN 3.5517e+05 04-Oct-2003 07:02:00 "severe storm"
"West" 12-Nov-2003 06:12:00 254.09 9.2429e+05 17-Nov-2003 02:04:00 "winter storm"
"NorthEast" 18-Sep-2004 05:54:00 0 0 NaT "equipment fault"
⋮
You also can set the value of ReadSize
property to 'rowgroup'
. For more information, see the ReadSize
property of the ParquetDatastore
object reference page.
Use the OutputType
and RowTimes
name-value pairs to make ParquetDatastore
return timetables instead of tables.
Create a datastore for airlinesmall.parquet
. Specify the 'OutputType'
name-value pair as 'timetable'
.
pds = parquetDatastore('airlinesmall.parquet','OutputType','timetable'); preview(pds)
ans=8×26 timetable
Date DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
___________ _________ ____________________ ____________________ ____________________ ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ _______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________
21-Oct-1987 3 21-Oct-1987 06:42:00 21-Oct-1987 06:30:00 21-Oct-1987 07:35:00 21-Oct-1987 07:27:00 "PS" 1503 "NA" 3180 sec 3420 sec NaN sec 480 sec 720 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
26-Oct-1987 1 26-Oct-1987 10:21:00 26-Oct-1987 10:20:00 26-Oct-1987 11:24:00 26-Oct-1987 11:16:00 "PS" 1550 "NA" 3780 sec 3360 sec NaN sec 480 sec 60 sec "SJC" "BUR" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 5 23-Oct-1987 20:55:00 23-Oct-1987 20:35:00 23-Oct-1987 22:18:00 23-Oct-1987 21:57:00 "PS" 1589 "NA" 4980 sec 4920 sec NaN sec 1260 sec 1200 sec "SAN" "SMF" 480 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 5 23-Oct-1987 13:32:00 23-Oct-1987 13:20:00 23-Oct-1987 14:31:00 23-Oct-1987 14:18:00 "PS" 1655 "NA" 3540 sec 3480 sec NaN sec 780 sec 720 sec "BUR" "SJC" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
22-Oct-1987 4 22-Oct-1987 06:29:00 22-Oct-1987 06:30:00 22-Oct-1987 07:46:00 22-Oct-1987 07:42:00 "PS" 1702 "NA" 4620 sec 4320 sec NaN sec 240 sec -60 sec "SMF" "LAX" 373 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
28-Oct-1987 3 28-Oct-1987 14:46:00 28-Oct-1987 13:43:00 28-Oct-1987 15:47:00 28-Oct-1987 14:48:00 "PS" 1729 "NA" 3660 sec 3900 sec NaN sec 3540 sec 3780 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
08-Oct-1987 4 08-Oct-1987 09:28:00 08-Oct-1987 09:30:00 08-Oct-1987 10:52:00 08-Oct-1987 10:49:00 "PS" 1763 "NA" 5040 sec 4740 sec NaN sec 180 sec -120 sec "SAN" "SFO" 447 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
10-Oct-1987 6 10-Oct-1987 08:59:00 10-Oct-1987 09:00:00 10-Oct-1987 11:34:00 10-Oct-1987 11:23:00 "PS" 1800 "NA" 9300 sec 8580 sec NaN sec 660 sec -60 sec "SEA" "LAX" 954 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
When you do not also specify 'RowTimes'
, parquetDatastore
uses the first datetime or duration variable as the row times. In this case, the Date
variable is used for the row times.
Specify the 'RowTimes'
option to use the arrival times (ArrTime
) as the row times, instead of the flight dates.
pds = parquetDatastore('airlinesmall.parquet','OutputType','timetable','RowTimes','ArrTime'); preview(pds)
ans=8×26 timetable
ArrTime Date DayOfWeek DepTime CRSDepTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
____________________ ___________ _________ ____________________ ____________________ ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ _______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________
21-Oct-1987 07:35:00 21-Oct-1987 3 21-Oct-1987 06:42:00 21-Oct-1987 06:30:00 21-Oct-1987 07:27:00 "PS" 1503 "NA" 3180 sec 3420 sec NaN sec 480 sec 720 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
26-Oct-1987 11:24:00 26-Oct-1987 1 26-Oct-1987 10:21:00 26-Oct-1987 10:20:00 26-Oct-1987 11:16:00 "PS" 1550 "NA" 3780 sec 3360 sec NaN sec 480 sec 60 sec "SJC" "BUR" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 22:18:00 23-Oct-1987 5 23-Oct-1987 20:55:00 23-Oct-1987 20:35:00 23-Oct-1987 21:57:00 "PS" 1589 "NA" 4980 sec 4920 sec NaN sec 1260 sec 1200 sec "SAN" "SMF" 480 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
23-Oct-1987 14:31:00 23-Oct-1987 5 23-Oct-1987 13:32:00 23-Oct-1987 13:20:00 23-Oct-1987 14:18:00 "PS" 1655 "NA" 3540 sec 3480 sec NaN sec 780 sec 720 sec "BUR" "SJC" 296 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
22-Oct-1987 07:46:00 22-Oct-1987 4 22-Oct-1987 06:29:00 22-Oct-1987 06:30:00 22-Oct-1987 07:42:00 "PS" 1702 "NA" 4620 sec 4320 sec NaN sec 240 sec -60 sec "SMF" "LAX" 373 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
28-Oct-1987 15:47:00 28-Oct-1987 3 28-Oct-1987 14:46:00 28-Oct-1987 13:43:00 28-Oct-1987 14:48:00 "PS" 1729 "NA" 3660 sec 3900 sec NaN sec 3540 sec 3780 sec "LAX" "SJC" 308 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
08-Oct-1987 10:52:00 08-Oct-1987 4 08-Oct-1987 09:28:00 08-Oct-1987 09:30:00 08-Oct-1987 10:49:00 "PS" 1763 "NA" 5040 sec 4740 sec NaN sec 180 sec -120 sec "SAN" "SFO" 447 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
10-Oct-1987 11:34:00 10-Oct-1987 6 10-Oct-1987 08:59:00 10-Oct-1987 09:00:00 10-Oct-1987 11:23:00 "PS" 1800 "NA" 9300 sec 8580 sec NaN sec 660 sec -60 sec "SEA" "LAX" 954 NaN sec NaN sec false "NA" false NaN sec NaN sec NaN sec NaN sec NaN sec
If you use parquetread
or parquetDatastore
to read
the files, then the result might not have the same format or contents as the original table.
For more information, see Apache Parquet Data Type Mappings.
mapreduce
| parquetinfo
| parquetread
| tall
You have a modified version of this example. Do you want to open this example with your edits?