In MATLAB®, you can read and write data to and from a remote location, such as cloud storage in Amazon S3™ (Simple Storage Service), Microsoft® Azure® Storage Blob, and Hadoop® Distributed File System (HDFS™).
You can access remote data using datastore objects. Use the datastore to examine part of
your data from your desktop version of MATLAB. Then, after prototyping your code locally, you can scale up to a cluster or
cloud. Scaling up improves execution efficiency as it is more efficient to run large
calculations in the same location as the data. To write data to a remote location, you can
use the write
function on a tall or distributed array.
MATLAB enables you to use Amazon S3 as an online file storage web service offered by Amazon Web Services. You can
use data stored on Amazon S3 with datastore objects such as ImageDatastore
, FileDatastore
, SpreadsheetDatastore
, or TabularTextDatastore
. When you specify the location of the data, you must
specify the full path to the files or folders using a uniform resource locator (URL) of the
form
s3://bucketname/path_to_file
bucketname
is the name of the container and
path_to_file
is the path to the file or folders.
Amazon S3 provides data storage through web services interfaces. You can use a bucket as a container to store objects in Amazon S3.
To work with remote data in Amazon S3, you must set up access first:
Sign up for an Amazon Web Services (AWS) root account. See Amazon Web Services: Account.
Using your AWS root account, create an IAM (Identity and Access Management) user. See Creating an IAM User in Your AWS Account.
Generate an access key to receive an access key ID and a secret access key. See Managing Access Keys for IAM Users.
Configure your machine with the AWS access key ID, secret access key, and region
using the AWS Command Line Interface tool from https://aws.amazon.com/cli/. Alternatively, set the environment
variables directly by using setenv
:
AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
— Authenticate and enable
use of Amazon S3 services. (You generated this pair of access key variables in
step 3.)
AWS_DEFAULT_REGION
(optional) — Select the
geographic region of your bucket. The value of this environment variable is
typically determined automatically, but the bucket owner might require that
you set it manually.
AWS_SESSION_TOKEN
(optional) — Specify the
session token if you are using temporary security credentials, such as with
AWS® Federated Authentication.
If you are using Parallel Computing Toolbox™, you must ensure the cluster has been configured to access S3 services. You
can copy your client environment variables to the workers on a cluster by setting
EnvironmentVariables
in parpool
,
batch
, createJob
, or in the Cluster Profile
Manager.
You can read data from Amazon S3 using datastore objects. For example, create an
ImageDatastore
, read a specified image from the datastore, and then
display the image to
screen.
setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = imageDatastore('s3://bucketname/image_datastore/jpegfiles', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)
To write data to Amazon S3, call the write
function on a distributed or tall array,
and provide the full path to a folder in the cloud storage. The following example shows
how to read tabular data from Amazon S3 into a tall array, preprocess it by removing missing entries and sorting,
and then write it back to Amazon S3.
setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = tabularTextDatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);
To read your tall data back, use the datastore
function.
ds = datastore('s3://bucketname/preprocessedData/'); tt = tall(ds);
MATLAB enables you to use Windows Azure® Storage Blob (WASB) as an online file storage web service offered by
Microsoft. You can use data stored on Azure with datastore objects such as ImageDatastore
, FileDatastore
, SpreadsheetDatastore
, or TabularTextDatastore
. When you specify the location of the data, you must
specify the full path to the files or folders using a uniform resource locator (URL) of the
form
wasbs://container@account/path_to_file/file.ext
container@account
is the name of the container and
path_to_file
is the path to the file or folders.
Azure provides data storage through web services interfaces. You can use a blob to store data files in Azure. See Introduction to Azure for more information.
To work with remote data in Azure storage, you must set up access first:
Sign up for a Microsoft Azure account, see Microsoft Azure Account.
Set up your authentication details by setting exactly one of the two following
environment variables using setenv
:
MW_WASB_SAS_TOKEN
— Authentication via Shared
Access Signature (SAS)
Obtain an SAS. For details, see the "Get the SAS for a blob container" section in Manage Azure Blob Storage resources with Storage Explorer.
In MATLAB, set MW_WASB_SAS_TOKEN
to the SAS query
string. For example,
setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'
You must set this string to a valid SAS token generated from the Azure Storage web UI or Explorer.
MW_WASB_SECRET_KEY
— Authentication via one of
the Account's two secret keys
Each Storage Account has two secret keys that permit administrative
privilege access. This same access can be given to MATLAB without having to
create an SAS token by setting the MW_WASB_SECRET_KEY
environment variable. For example:
setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'
If you are using Parallel Computing Toolbox, you must copy your client environment variables to the workers on a cluster
by setting EnvironmentVariables
in parpool
,
batch
, createJob
, or in the Cluster Profile
Manager.
For more information, see Use Azure storage with Azure HDInsight clusters.
To read data from a WASB location, use a datastore object. To produce the file
location, start with the filename file.ext
, and prefix it with the file
path /path_to_file
and your account name
wasbs://container@account/
. The complete data location uses the
following syntax:
wasbs://container@account/path_to_file/file.ext
container@account
is the name of the container and
path_to_file
is the path to the file or folders.
For example, if you have a file airlinesmall.csv
in a folder
/airline
on a test storage account
wasbs://blobContainer@storageAccount.blob.core.windows.net/
, then you
can create a datastore by
using:
location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabularTextDatastore(location, 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', {'ArrDelay'});
You can use Azure for all calculations datastores support, including direct reading,
mapreduce
, tall arrays and deep learning. For example, create an
ImageDatastore
, read a specified image from the datastore, and then
display the image to
screen.
setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = imageDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)
To write data to Azure, call the write
function on a distributed or tall array,
and provide the full path to a folder in the cloud storage. The following example shows
how to read tabular data from Azure into a tall array, preprocess it by removing missing entries and sorting,
and then write it back to Azure.
setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = tabularTextDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);
To read your tall data back, use the datastore
function.
ds = datastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/'); tt = tall(ds);
You also can create a datastore for a collection of text files or sequence files that
reside on the Hadoop Distributed File System (HDFS) using the tabularTextDatastore
function. When you
specify the location of the data, you must specify the full path to the files or folders
using a uniform resource locator (URL) of one of these forms:
hdfs:/path_to_file
hdfs:///path_to_file
hdfs://hostname/path_to_file
hostname
is the name of the host or server and
path_to_file
is the path to the file or folders. Specifying
the hostname
is optional. When you do not specify the
hostname
, Hadoop uses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.
For example, both these commands create a datastore for the file,
file1.txt
, in a folder named data
located at a
host named myserver
:
ds = tabularTextDatastore('hdfs:///data/file1.txt')
ds = tabularTextDatastore('hdfs://myserver/data/file1.txt')
If hostname
is specified, it must correspond to the
namenode defined by the fs.default.name
property in the Hadoop XML configuration files for your Hadoop cluster.
Optionally, you can include the port number. For example, this location specifies a
host named myserver
with port 7867
, containing the
file file1.txt
in a folder named data
:
'hdfs://myserver:7867/data/file1.txt'
The specified port number must match the port number set in your HDFS configuration.
Before reading from HDFS, use the setenv
function to set the appropriate
environment variable to the folder where Hadoop is installed. This folder must be accessible from the current machine.
Hadoop v1 only — Set the HADOOP_HOME
environment
variable.
Hadoop v2 only — Set the HADOOP_PREFIX
environment
variable.
If you work with both Hadoop v1 and Hadoop v2, or if the HADOOP_HOME
and
HADOOP_PREFIX
environment variables are not set, then set the
MATLAB_HADOOP_INSTALL
environment variable.
For example, use this command to set the HADOOP_HOME
environment variable. hadoop-folder
is the folder where
Hadoop is installed, and /mypath/
is the path to that
folder.
setenv('HADOOP_HOME','/mypath/hadoop-folder');
If your current machine has access to HDFS data on Hortonworks or Cloudera®, then you do not have to set the HADOOP_HOME
or
HADOOP_PREFIX
environment variables. MATLAB automatically assigns these environment variables when using Hortonworks or
Cloudera application edge nodes.
When reading from HDFS or when reading Sequence files locally, the datastore
function calls the javaaddpath
command. This command does the
following:
Clears the definitions of all Java® classes defined by files on the dynamic class path
Removes all global variables and variables from the base workspace
Removes all compiled scripts, functions, and MEX-functions from memory
To prevent persistent variables, code files, or MEX-files from being
cleared, use the mlock
function.
Use the write
function to write your tall and distributed arrays to
a Hadoop Distributed File System. When you call this function on a distributed or
tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from
HDFS into a tall array, preprocess it by removing missing entries and sorting,
and then write it back to HDFS.
ds = tabularTextDatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);
To read your tall data back, use the datastore
function.
ds = datastore('hdfs://myserver/some/path/preprocessedData/'); tt = tall(ds);
datastore
| imageDatastore
| imread
| imshow
| javaaddpath
| mlock
| setenv
| tabularTextDatastore
| write