Work with Remote Data

In MATLAB®, you can read and write data to and from a remote location, such as cloud storage in Amazon S3™ (Simple Storage Service), Microsoft® Azure® Storage Blob, and Hadoop® Distributed File System (HDFS™).

You can access remote data using datastore objects. Use the datastore to examine part of your data from your desktop version of MATLAB. Then, after prototyping your code locally, you can scale up to a cluster or cloud. Scaling up improves execution efficiency as it is more efficient to run large calculations in the same location as the data. To write data to a remote location, you can use the write function on a tall or distributed array.

Amazon S3

MATLAB enables you to use Amazon S3 as an online file storage web service offered by Amazon Web Services. You can use data stored on Amazon S3 with datastore objects such as ImageDatastore, FileDatastore, SpreadsheetDatastore, or TabularTextDatastore. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

s3://bucketname/path_to_file

bucketname is the name of the container and path_to_file is the path to the file or folders.

Amazon S3 provides data storage through web services interfaces. You can use a bucket as a container to store objects in Amazon S3.

Set Up Access

To work with remote data in Amazon S3, you must set up access first:

  1. Sign up for an Amazon Web Services (AWS) root account. See Amazon Web Services: Account.

  2. Using your AWS root account, create an IAM (Identity and Access Management) user. See Creating an IAM User in Your AWS Account.

  3. Generate an access key to receive an access key ID and a secret access key. See Managing Access Keys for IAM Users.

  4. Configure your machine with the AWS access key ID, secret access key, and region using the AWS Command Line Interface tool from https://aws.amazon.com/cli/. Alternatively, set the environment variables directly by using setenv:

    • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY — Authenticate and enable use of Amazon S3 services. (You generated this pair of access key variables in step 3.)

    • AWS_DEFAULT_REGION (optional) — Select the geographic region of your bucket. The value of this environment variable is typically determined automatically, but the bucket owner might require that you set it manually.

    • AWS_SESSION_TOKEN (optional) — Specify the session token if you are using temporary security credentials, such as with AWS® Federated Authentication.

If you are using Parallel Computing Toolbox™, you must ensure the cluster has been configured to access S3 services. You can copy your client environment variables to the workers on a cluster by setting EnvironmentVariables in parpool, batch, createJob, or in the Cluster Profile Manager.

Read Data from Amazon S3

You can read data from Amazon S3 using datastore objects. For example, create an ImageDatastore, read a specified image from the datastore, and then display the image to screen.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); 
setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY');

ds = imageDatastore('s3://bucketname/image_datastore/jpegfiles', ...
  'IncludeSubfolders', true, 'LabelSource', 'foldernames'); 
img = ds.readimage(1);
imshow(img)

Write Data to Amazon S3

To write data to Amazon S3, call the write function on a distributed or tall array, and provide the full path to a folder in the cloud storage. The following example shows how to read tabular data from Amazon S3 into a tall array, preprocess it by removing missing entries and sorting, and then write it back to Amazon S3.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID');
setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY');

ds = tabularTextDatastore('s3://bucketname/dataset/airlinesmall.csv', ...
  'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'});
tt = tall(ds);
tt = sortrows(rmmissing(tt));
write('s3://bucketname/preprocessedData/',tt);

To read your tall data back, use the datastore function.

ds = datastore('s3://bucketname/preprocessedData/');
tt = tall(ds);

Microsoft Azure Storage Blob

MATLAB enables you to use Windows Azure® Storage Blob (WASB) as an online file storage web service offered by Microsoft. You can use data stored on Azure with datastore objects such as ImageDatastore, FileDatastore, SpreadsheetDatastore, or TabularTextDatastore. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

wasbs://container@account/path_to_file/file.ext

container@account is the name of the container and path_to_file is the path to the file or folders.

Azure provides data storage through web services interfaces. You can use a blob to store data files in Azure. See Introduction to Azure for more information.

Set Up Access

To work with remote data in Azure storage, you must set up access first:

  1. Sign up for a Microsoft Azure account, see Microsoft Azure Account.

  2. Set up your authentication details by setting exactly one of the two following environment variables using setenv:

    • MW_WASB_SAS_TOKEN — Authentication via Shared Access Signature (SAS)

      Obtain an SAS. For details, see the "Get the SAS for a blob container" section in Manage Azure Blob Storage resources with Storage Explorer.

      In MATLAB, set MW_WASB_SAS_TOKEN to the SAS query string. For example,

      setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'

      You must set this string to a valid SAS token generated from the Azure Storage web UI or Explorer.

    • MW_WASB_SECRET_KEY — Authentication via one of the Account's two secret keys

      Each Storage Account has two secret keys that permit administrative privilege access. This same access can be given to MATLAB without having to create an SAS token by setting the MW_WASB_SECRET_KEY environment variable. For example:

      setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'

If you are using Parallel Computing Toolbox, you must copy your client environment variables to the workers on a cluster by setting EnvironmentVariables in parpool, batch, createJob, or in the Cluster Profile Manager.

For more information, see Use Azure storage with Azure HDInsight clusters.

Read Data from Azure

To read data from a WASB location, use a datastore object. To produce the file location, start with the filename file.ext, and prefix it with the file path /path_to_file and your account name wasbs://container@account/. The complete data location uses the following syntax:

wasbs://container@account/path_to_file/file.ext

container@account is the name of the container and path_to_file is the path to the file or folders.

For example, if you have a file airlinesmall.csv in a folder /airline on a test storage account wasbs://blobContainer@storageAccount.blob.core.windows.net/, then you can create a datastore by using:

location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabularTextDatastore(location, 'TreatAsMissing', 'NA', ...
  'SelectedVariableNames', {'ArrDelay'});

You can use Azure for all calculations datastores support, including direct reading, mapreduce, tall arrays and deep learning. For example, create an ImageDatastore, read a specified image from the datastore, and then display the image to screen.

setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN');  
ds = imageDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/', ...
  'IncludeSubfolders', true, 'LabelSource', 'foldernames'); 
img = ds.readimage(1);
imshow(img)

Write Data to Azure

To write data to Azure, call the write function on a distributed or tall array, and provide the full path to a folder in the cloud storage. The following example shows how to read tabular data from Azure into a tall array, preprocess it by removing missing entries and sorting, and then write it back to Azure.

setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN');  

ds = tabularTextDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ...
  'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'});
tt = tall(ds);
tt = sortrows(rmmissing(tt));
write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);

To read your tall data back, use the datastore function.

ds = datastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/');
tt = tall(ds);

Hadoop Distributed File System

Specify Location of Data

You also can create a datastore for a collection of text files or sequence files that reside on the Hadoop Distributed File System (HDFS) using the tabularTextDatastore function. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:

hdfs:/path_to_file
hdfs:///path_to_file
hdfs://hostname/path_to_file

hostname is the name of the host or server and path_to_file is the path to the file or folders. Specifying the hostname is optional. When you do not specify the hostname, Hadoop uses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.

For example, both these commands create a datastore for the file, file1.txt, in a folder named data located at a host named myserver:

  • ds = tabularTextDatastore('hdfs:///data/file1.txt') 
  • ds = tabularTextDatastore('hdfs://myserver/data/file1.txt') 

If hostname is specified, it must correspond to the namenode defined by the fs.default.name property in the Hadoop XML configuration files for your Hadoop cluster.

Optionally, you can include the port number. For example, this location specifies a host named myserver with port 7867, containing the file file1.txt in a folder named data:

'hdfs://myserver:7867/data/file1.txt'

The specified port number must match the port number set in your HDFS configuration.

Set Hadoop Environment Variable

Before reading from HDFS, use the setenv function to set the appropriate environment variable to the folder where Hadoop is installed. This folder must be accessible from the current machine.

  • Hadoop v1 only — Set the HADOOP_HOME environment variable.

  • Hadoop v2 only — Set the HADOOP_PREFIX environment variable.

  • If you work with both Hadoop v1 and Hadoop v2, or if the HADOOP_HOME and HADOOP_PREFIX environment variables are not set, then set the MATLAB_HADOOP_INSTALL environment variable.

For example, use this command to set the HADOOP_HOME environment variable. hadoop-folder is the folder where Hadoop is installed, and /mypath/ is the path to that folder.

setenv('HADOOP_HOME','/mypath/hadoop-folder');

HDFS data on Hortonworks or Cloudera

If your current machine has access to HDFS data on Hortonworks or Cloudera®, then you do not have to set the HADOOP_HOME or HADOOP_PREFIX environment variables. MATLAB automatically assigns these environment variables when using Hortonworks or Cloudera application edge nodes.

Prevent Clearing Code from Memory

When reading from HDFS or when reading Sequence files locally, the datastore function calls the javaaddpath command. This command does the following:

  • Clears the definitions of all Java® classes defined by files on the dynamic class path

  • Removes all global variables and variables from the base workspace

  • Removes all compiled scripts, functions, and MEX-functions from memory

To prevent persistent variables, code files, or MEX-files from being cleared, use the mlock function.

Write Data to HDFS

Use the write function to write your tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from HDFS into a tall array, preprocess it by removing missing entries and sorting, and then write it back to HDFS.

ds = tabularTextDatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ...
  'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'});
tt = tall(ds);
tt = sortrows(rmmissing(tt));
write('hdfs://myserver/some/path/preprocessedData/',tt);

To read your tall data back, use the datastore function.

ds = datastore('hdfs://myserver/some/path/preprocessedData/');
tt = tall(ds);

See Also

| | | | | | | |

Related Topics