parquetwrite

Write columnar data to Parquet file

Description

example

parquetwrite(filename,T) writes a table or timetable T to a Parquet 2.0 file with the filename specified in filename.

example

parquetwrite(filename,T,Name,Value) specifies additional options with one or more name-value pair arguments. For example, you can specify 'VariableCompression' to change the compression algorithm used, or 'Version' to write the data to a Parquet 1.0 file.

Examples

collapse all

Write tabular data into a Parquet file and compare the size of the same tabular data in .csv and .parquet file formats.

Read the tabular data from the file outages.csv into a table.

T = readtable('outages.csv');

Write the data to Parquet file format. By default, the parquetwrite function uses the Snappy compression scheme. To specify other compression schemes see 'VariableCompression' name-value pair.

parquetwrite('outagesDefault.parquet',T)

Get the file sizes and compute the ratio of the size of tabular data in the .csv format to size of the same data in .parquet format.

Get size of .csv file.

fcsv = dir(which('outages.csv'));
size_csv = fcsv.bytes
size_csv = 101040

Get size of .parquet file.

fparquet  = dir('outagesDefault.parquet');
size_parquet = fparquet.bytes
size_parquet = 44881

Compute the ratio.

sizeRatio = ( size_parquet/size_csv )*100 ;
disp(['Size Ratio = ', num2str(sizeRatio) '% of original size'])
Size Ratio = 44.419% of original size

Input Arguments

collapse all

Name of output Parquet file, specified as a character vector or string scalar.

Depending on the location you are writing to, filename can take on one of these forms.

Location

Form

Current folder

To write to the current folder, specify the name of the file in filename.

Example: 'myData.parquet'

Other folders

To write to a folder different from the current folder, specify the full or relative path name in filename.

Example: 'C:\myFolder\myData.parquet'

Example: 'dataDir\myData.parquet'

Remote Location

To write to a remote location, filename must contain the full path of the file specified as a uniform resource locator (URL) of the form:

scheme_name://path_to_file/myData.parquet

Based on your remote location, scheme_name can be one of the values in this table.

Remote Locationscheme_name
Amazon S3™s3
Windows Azure® Blob Storagewasb, wasbs
HDFS™hdfs

For more information, see Work with Remote Data.

Example: 's3://bucketname/path_to_file/myData.parquet'

Data Types: char | string

Input data, specified as a table or timetable.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: parquetwrite(filename,T,'VariableCompression','gzip','Version','1.0')

Compression scheme names, specified as one of these values:

  • 'snappy', 'brotli', 'gzip', or 'uncompressed'. If you specify one compression algorithm then parquetwrite compresses all variables using the same algorithm.

  • Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general, 'snappy' has better performance for reading and writing, 'gzip' has a higher compression ratio at the cost of more CPU processing time, and 'brotli' typically produces the smallest file size at the cost of compression speed.

Example: parquetwrite('myData.parquet', T, 'VariableCompression', 'brotli')

Example: parquetwrite('myData.parquet', T, 'VariableCompression', {'brotli' 'snappy' 'gzip'})

Encoding scheme names, specified as one of these values:

  • 'auto'parquetwrite uses 'plain' encoding for logical variables, and 'dictionary' encoding for all others.

  • 'dictionary', 'plain' — If you specify one encoding scheme then parquetwrite encodes all variables with that scheme.

  • Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.

In general, 'dictionary' encoding results in smaller file sizes, but 'plain' encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, see Parquet encoding definitions.

Example: parquetwrite('myData.parquet', T, 'VariableEncoding', 'plain')

Example: parquetwrite('myData.parquet', T, 'VariableEncoding', {'plain' 'dictionary' 'plain'})

Parquet version to use, specified as either '1.0' or '2.0'. By default, '2.0' offers the most efficient storage, but you can select '1.0' for the broadest compatibility with external applications that support the Parquet format.

Caution

Parquet version 1.0 has a limitation that it cannot round-trip variables of type uint32 (they are read back into MATLAB® as int64).

Limitations

In some cases, parquetwrite creates files that do not represent the original array T exactly. If you use parquetread or datastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Introduced in R2019a