parquetDatastore

Datastore for collection of Parquet files

Description

Use a ParquetDatastore object to manage a collection of Parquet files, where each individual Parquet file fits in memory, but the entire collection of files does not necessarily fit. You can create a ParquetDatastore object using the parquetDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Description

example

pds = parquetDatastore(location) creates a datastore pds from the collection of Parquet files specified by location.

example

pds = parquetDatastore(location,Name,Value) specifies additional parameters and properties for pds using one or more name-value pair arguments.

Input Arguments

expand all

Files or folders included in the datastore, specified as a path or a DsFileSet object.

  • path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.

    • Local files or folders — Specify location as a local path to files or folders. If the files are not in the current folder, then local path must specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.

    • Remote files or folders — Specify location to be the full paths of the files or folders as a uniform resource locator (URL) of the form hdfs:///path_to_file. For more information, see Work with Remote Data.

  • DsFileSet object — You also can specify location as a DsFileSet object. For more information, see matlab.io.datastore.DsFileSet.

When location represents a folder, the datastore includes only supported file formats and ignores any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions property.

The parquetDatastore function supports the .parquet file format.

Example: 'myfile.parquet'

Example: '../dir/data/myfile.parquet'

Example: {'C:\dir\data\myfile01.parquet','C:\dir\data\myfile02.parquet'}

Example: 's3://bucketname/path_to_files/*.parquet'

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'IncludeSubfolders',true

Extensions to include in datastore, specified as the comma-separated pair consisting of 'FileExtensions' and a character vector, cell array of character vectors, string scalar, or string array.

  • If you do not specify 'FileExtensions', then parquetDatastore automatically includes all files with .parquet and .parq extensions in the specified path.

  • If you want to include parquet files with non-standard file extensions in the parquetDatastore, then specify those extensions explicitly.

  • If you want to create a parquetDatastore for files without any extensions, then specify 'FileExtensions' as an empty character vector, ''.

Example: 'FileExtensions',{'.parquet','.parq'}

Example: 'FileExtensions','.myformat'

Example: 'FileExtensions',''

Data Types: char | cell | string

Subfolder inclusion flag, specified as the comma-separated pair consisting of 'IncludeSubfolders' and true or false. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify 'IncludeSubfolders', then the default value is false.

Example: 'IncludeSubfolders',true

Data Types: logical | double

Output datatype, specified as the comma-separated pair consisting of 'OutputType' and one of these values:

  • 'auto' — Detects if the output from the datastore should be a table or a timetable based on whether you specify the 'RowTimes' name-value pair. If you specify 'RowTimes' then the output is a timetable; otherwise, the output is a table.

  • 'table' — Return a table.

  • 'timetable' — Return a timetable.

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the 'RowTimes' name-value pair to return timetables from ParquetDatastore.

Example: 'OutputType','timetable'

Data Types: char | string

Flag to preserve variable names, specified as the comma-separated pair consisting of VariableNamingRule and either true, or false.

  • 'preserve' — Preserve variable names that are not valid MATLAB® identifiers such as variable names that include spaces and non-ASCII characters.

  • 'modify' — Convert invalid variable names (as determined by the isvarname function) to valid MATLAB identifiers.

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to 'preserve'.

Data Types: char | string

Alternate file system root paths, specified as the comma-separated pair consisting of 'AlternateFileSystemRoots' and a string vector or a cell array. Use 'AlternateFileSystemRoots' when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use 'AlternateFileSystemRoots' to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify 'AlternateFileSystemRoots' as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify 'AlternateFileSystemRoots' as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify 'AlternateFileSystemRoots' as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify 'AlternateFileSystemRoots' as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of 'AlternateFileSystemRoots' must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Properties

expand all

ParquetDatastore properties describe the format of the files in a datastore object, and control how the data is read from the datastore. With the exception of the Files property, you can specify the value of ParquetDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation.

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument defines these files.

The first file specified in the cell array determines the variable names and format information for all files in the datastore.

Example: {'C:\dir\data\file1.ext';'C:\dir\data\file2.ext'}

Data Types: cell | string

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. The location argument in the parquetDatastore and datastore functions defines Folders when the datastore is created.

The Folders property is reset when you modify the Files property of a ParquetDatastore object.

Data Types: cell

Amount of data to read in a call to the read function, specified as 'rowgroup', 'file', or a positive integer.

  • 'rowgroup' — Each call to read reads the number of rows specified in the row groups of the Parquet file. To get the number of rows in row groups, see the RowGroupHeights property of the ParquetInfo object.

  • 'file' — Each call to read reads all of the data in one file.

  • positive integer — Each call to read reads a maximum of ReadSize rows.

When you change ReadSize from a positive integer to 'file' or 'rowgroup', or vice versa, MATLAB resets the datastore to an unread state where no data has been read from it.

Data Types: double | char | string

Names of variables in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, the datastore detects them from the first nonheader line in the first file. You can specify VariableNames with a character vector or string scalar, however the datastore converts and stores the property value to a cell array of character vectors. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to 'preserve'.

If ReadVariableNames is false, then VariableNames defaults to {'Var1','Var2', ...}.

Example: {'Time','Date','Quantity'}

Data Types: char | cell | string

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to 'preserve'.

Example: {'Var3','Var7','Var4'}

Data Types: cell | string

Name of row times variable, specified as the comma-separated pair consisting of 'RowTimes' and a variable name (such as "Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain a datetime or a duration vector.

If the value of 'OutputType' is 'timetable', but you do not specify 'RowTimes', then ParquetDatastore uses the first datetime or duration variable as the row times for the timetable.

This property is read-only.

List of formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

Data Types: string

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

hasdataDetermine if data is available to read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
previewPreview subset of data in datastore
readRead data in datastore
readallRead all data in datastore
writeallWrite datastore to files
resetReset datastore to initial state
transformTransform datastore
combineCombine data from multiple datastores
isPartitionableDetermine whether datastore is partitionable
isShuffleableDetermine whether datastore is shuffleable

Examples

collapse all

Create a ParquetDatastore object containing the file outages.parquet.

pds = parquetDatastore('outages.parquet')
pds = 
  ParquetDatastore with properties:

                       Files: {
                              ' .../devel/bat/BR2020bd/build/matlab/toolbox/matlab/demos/outages.parquet'
                              }
                     Folders: {
                              '/mathworks/devel/bat/BR2020bd/build/matlab/toolbox/matlab/demos'
                              }
               VariableNames: {1x6 cell}
       SelectedVariableNames: {1x6 cell}
                    ReadSize: 'rowgroup'
                  OutputType: 'table'
                    RowTimes: []
    AlternateFileSystemRoots: {}
      SupportedOutputFormats: [1x6 string]
         DefaultOutputFormat: "parquet"
          VariableNamingRule: 'modify'

Create a datastore for a sample Parquet file, and then read data from the file with different ReadSize values.

Create a datastore for airlinesmall.parquet, set ReadSize to 10 rows, and then read from the datastore. The value of ReadSize determines how many rows of data are read from the datastore with each call to the read function.

pds = parquetDatastore('outages.parquet','ReadSize',10);
read(pds)
ans=10×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"

Set the ReadSize property value to 'file' and read from the datastore. Every call to the read function reads all the data from the datastore.

pds.ReadSize ='file'; 
data = read(pds)
data=1468×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"
    "SouthEast"    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    "equipment fault"
    "West"         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    "equipment fault"
    "SouthEast"    01-Sep-2002 18:22:00    95.917         36759    01-Sep-2002 19:12:00    "severe storm"   
    "SouthEast"    27-Sep-2003 07:32:00       NaN    3.5517e+05    04-Oct-2003 07:02:00    "severe storm"   
    "West"         12-Nov-2003 06:12:00    254.09    9.2429e+05    17-Nov-2003 02:04:00    "winter storm"   
    "NorthEast"    18-Sep-2004 05:54:00         0             0                     NaT    "equipment fault"
      ⋮

You also can set the value of ReadSize property to 'rowgroup'. For more information, see the ReadSize property of the ParquetDatastore object reference page.

Use the OutputType and RowTimes name-value pairs to make ParquetDatastore return timetables instead of tables.

Create a datastore for airlinesmall.parquet. Specify the 'OutputType' name-value pair as 'timetable'.

pds = parquetDatastore('airlinesmall.parquet','OutputType','timetable');
preview(pds)
ans=8×26 timetable
       Date        DayOfWeek          DepTime                CRSDepTime               ArrTime                CRSArrTime         UniqueCarrier    FlightNum    TailNum    ActualElapsedTime    CRSElapsedTime    AirTime    ArrDelay    DepDelay    Origin    Dest     Distance    TaxiIn     TaxiOut    Cancelled    CancellationCode    Diverted    CarrierDelay    WeatherDelay    NASDelay    SecurityDelay    LateAircraftDelay
    ___________    _________    ____________________    ____________________    ____________________    ____________________    _____________    _________    _______    _________________    ______________    _______    ________    ________    ______    _____    ________    _______    _______    _________    ________________    ________    ____________    ____________    ________    _____________    _________________

    21-Oct-1987        3        21-Oct-1987 06:42:00    21-Oct-1987 06:30:00    21-Oct-1987 07:35:00    21-Oct-1987 07:27:00        "PS"           1503        "NA"          3180 sec            3420 sec       NaN sec     480 sec     720 sec    "LAX"     "SJC"      308       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    26-Oct-1987        1        26-Oct-1987 10:21:00    26-Oct-1987 10:20:00    26-Oct-1987 11:24:00    26-Oct-1987 11:16:00        "PS"           1550        "NA"          3780 sec            3360 sec       NaN sec     480 sec      60 sec    "SJC"     "BUR"      296       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987        5        23-Oct-1987 20:55:00    23-Oct-1987 20:35:00    23-Oct-1987 22:18:00    23-Oct-1987 21:57:00        "PS"           1589        "NA"          4980 sec            4920 sec       NaN sec    1260 sec    1200 sec    "SAN"     "SMF"      480       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987        5        23-Oct-1987 13:32:00    23-Oct-1987 13:20:00    23-Oct-1987 14:31:00    23-Oct-1987 14:18:00        "PS"           1655        "NA"          3540 sec            3480 sec       NaN sec     780 sec     720 sec    "BUR"     "SJC"      296       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    22-Oct-1987        4        22-Oct-1987 06:29:00    22-Oct-1987 06:30:00    22-Oct-1987 07:46:00    22-Oct-1987 07:42:00        "PS"           1702        "NA"          4620 sec            4320 sec       NaN sec     240 sec     -60 sec    "SMF"     "LAX"      373       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    28-Oct-1987        3        28-Oct-1987 14:46:00    28-Oct-1987 13:43:00    28-Oct-1987 15:47:00    28-Oct-1987 14:48:00        "PS"           1729        "NA"          3660 sec            3900 sec       NaN sec    3540 sec    3780 sec    "LAX"     "SJC"      308       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    08-Oct-1987        4        08-Oct-1987 09:28:00    08-Oct-1987 09:30:00    08-Oct-1987 10:52:00    08-Oct-1987 10:49:00        "PS"           1763        "NA"          5040 sec            4740 sec       NaN sec     180 sec    -120 sec    "SAN"     "SFO"      447       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    10-Oct-1987        6        10-Oct-1987 08:59:00    10-Oct-1987 09:00:00    10-Oct-1987 11:34:00    10-Oct-1987 11:23:00        "PS"           1800        "NA"          9300 sec            8580 sec       NaN sec     660 sec     -60 sec    "SEA"     "LAX"      954       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     

When you do not also specify 'RowTimes', parquetDatastore uses the first datetime or duration variable as the row times. In this case, the Date variable is used for the row times.

Specify the 'RowTimes' option to use the arrival times (ArrTime) as the row times, instead of the flight dates.

pds = parquetDatastore('airlinesmall.parquet','OutputType','timetable','RowTimes','ArrTime');
preview(pds)
ans=8×26 timetable
          ArrTime              Date        DayOfWeek          DepTime                CRSDepTime              CRSArrTime         UniqueCarrier    FlightNum    TailNum    ActualElapsedTime    CRSElapsedTime    AirTime    ArrDelay    DepDelay    Origin    Dest     Distance    TaxiIn     TaxiOut    Cancelled    CancellationCode    Diverted    CarrierDelay    WeatherDelay    NASDelay    SecurityDelay    LateAircraftDelay
    ____________________    ___________    _________    ____________________    ____________________    ____________________    _____________    _________    _______    _________________    ______________    _______    ________    ________    ______    _____    ________    _______    _______    _________    ________________    ________    ____________    ____________    ________    _____________    _________________

    21-Oct-1987 07:35:00    21-Oct-1987        3        21-Oct-1987 06:42:00    21-Oct-1987 06:30:00    21-Oct-1987 07:27:00        "PS"           1503        "NA"          3180 sec            3420 sec       NaN sec     480 sec     720 sec    "LAX"     "SJC"      308       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    26-Oct-1987 11:24:00    26-Oct-1987        1        26-Oct-1987 10:21:00    26-Oct-1987 10:20:00    26-Oct-1987 11:16:00        "PS"           1550        "NA"          3780 sec            3360 sec       NaN sec     480 sec      60 sec    "SJC"     "BUR"      296       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987 22:18:00    23-Oct-1987        5        23-Oct-1987 20:55:00    23-Oct-1987 20:35:00    23-Oct-1987 21:57:00        "PS"           1589        "NA"          4980 sec            4920 sec       NaN sec    1260 sec    1200 sec    "SAN"     "SMF"      480       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987 14:31:00    23-Oct-1987        5        23-Oct-1987 13:32:00    23-Oct-1987 13:20:00    23-Oct-1987 14:18:00        "PS"           1655        "NA"          3540 sec            3480 sec       NaN sec     780 sec     720 sec    "BUR"     "SJC"      296       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    22-Oct-1987 07:46:00    22-Oct-1987        4        22-Oct-1987 06:29:00    22-Oct-1987 06:30:00    22-Oct-1987 07:42:00        "PS"           1702        "NA"          4620 sec            4320 sec       NaN sec     240 sec     -60 sec    "SMF"     "LAX"      373       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    28-Oct-1987 15:47:00    28-Oct-1987        3        28-Oct-1987 14:46:00    28-Oct-1987 13:43:00    28-Oct-1987 14:48:00        "PS"           1729        "NA"          3660 sec            3900 sec       NaN sec    3540 sec    3780 sec    "LAX"     "SJC"      308       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    08-Oct-1987 10:52:00    08-Oct-1987        4        08-Oct-1987 09:28:00    08-Oct-1987 09:30:00    08-Oct-1987 10:49:00        "PS"           1763        "NA"          5040 sec            4740 sec       NaN sec     180 sec    -120 sec    "SAN"     "SFO"      447       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    10-Oct-1987 11:34:00    10-Oct-1987        6        10-Oct-1987 08:59:00    10-Oct-1987 09:00:00    10-Oct-1987 11:23:00        "PS"           1800        "NA"          9300 sec            8580 sec       NaN sec     660 sec     -60 sec    "SEA"     "LAX"      954       NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     

Limitations

If you use parquetread or parquetDatastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Introduced in R2019a