Data Sets for Deep Learning

Use these data sets to get started with deep learning applications.

Image Data Sets

Data SetDescriptionTask

Digits

The digits data set consists of 10,000 synthetic grayscale images of handwritten digits. Each image is 28-by-28 pixels and has an associated label denoting which digit the image represents (0–9). Each image has been rotated by a certain angle. When loading the images as arrays, you can also load the rotation angle of the image.

Load the digits data as in-memory numeric arrays using the digitTrain4DArrayData and digitTest4DArrayData functions.

[XTrain,YTrain,anglesTrain] = digitTrain4DArrayData;
[XTest,YTest,anglesTest] = digitTest4DArrayData;

For examples showing how to process this data for deep learning, see Monitor Deep Learning Training Progress and Train Convolutional Neural Network for Regression.

Image classification and image regression

Load the digits data as an image datastore using the imageDatastore function and specify the folder containing the image data.

dataFolder = fullfile(toolboxdir('nnet'),'nndemos','nndatasets','DigitDataset');
imds = imageDatastore(dataFolder, ...
    'IncludeSubfolders',true, ....
    'LabelSource','foldernames');

For an example showing how to process this data for deep learning, see Create Simple Deep Learning Network for Classification.

Image classification

MNIST

(Representative example)

The MNIST data set consists of 70,000 handwritten digits split into training and test partitions of 60,000 and 10,000 images, respectively. Each image is 28-by-28 pixels and has an associated label denoting which digit the image represents (0–9).

Download the MNIST files from http://yann.lecun.com/exdb/mnist/ and load the data set into the workspace. To load the data from the files as MATLAB arrays, extract and place the files in the working directory, then use the helper functions processImagesMNIST and processLabelsMNIST, which are used in the example Train Variational Autoencoder (VAE) to Generate Images.

oldpath = addpath(fullfile(matlabroot,'examples','nnet','main'));
filenameImagesTrain = 'train-images.idx3-ubyte';
filenameLabelsTrain = 'train-labels.idx1-ubyte';
filenameImagesTest = 't10k-images.idx3-ubyte';
filenameLabelsTest = 't10k-labels.idx1-ubyte';

XTrain = processImagesMNIST(filenameImagesTrain);
YTrain = processLabelsMNIST(filenameLabelsTrain);
XTest = processImagesMNIST(filenameImagesTest);
YTest = processLabelsMNIST(filenameLabelsTest);

For an example showing how to process this data for deep learning, see Train Variational Autoencoder (VAE) to Generate Images.

To restore the path, use the path function.

path(oldpath);

Image classification

Omniglot

The Omniglot data set contains character sets for 50 alphabets, divided into 30 sets for training and 20 sets for testing. Each alphabet contains a number of characters, from 14 for Ojibwe (Canadian Aboriginal syllabics) to 55 for Tifinagh. Finally, each character has 20 handwritten observations.

Download and extract the Omniglot data set [1] from https://github.com/brendenlake/omniglot. Set downloadFolder to the location of the data.

downloadFolder = tempdir;

url = "https://github.com/brendenlake/omniglot/raw/master/python";
urlTrain = url + "/images_background.zip";
urlTest = url + "/images_evaluation.zip";

filenameTrain = fullfile(downloadFolder,"images_background.zip");
filenameTest = fullfile(downloadFolder,"images_evaluation.zip");

dataFolderTrain = fullfile(downloadFolder,"images_background");
dataFolderTest = fullfile(downloadFolder,"images_evaluation");

if ~exist(dataFolderTrain,"dir")
    fprintf("Downloading Omniglot training data set (4.5 MB)... ")
    websave(filenameTrain,urlTrain);
    unzip(filenameTrain,downloadFolder);
    fprintf("Done.\n")
end

if ~exist(dataFolderTest,"dir")
    fprintf("Downloading Omniglot test data (3.2 MB)... ")
    websave(filenameTest,urlTest);
    unzip(filenameTest,downloadFolder);
    fprintf("Done.\n")
end

To load the training and test data as image datastores, use the imageDatastore function. Specify the labels manually by extracting the labels from the file names and setting the Labels property.

imdsTrain = imageDatastore(dataFolderTrain, ...
    'IncludeSubfolders',true, ...
    'LabelSource','none');

files = imdsTrain.Files;
parts = split(files,filesep);
labels = join(parts(:,(end-2):(end-1)),'_');
imdsTrain.Labels = categorical(labels);

imdsTest = imageDatastore(dataFolderTest, ...
    'IncludeSubfolders',true, ...
    'LabelSource','none');

files = imdsTest.Files;
parts = split(files,filesep);
labels = join(parts(:,(end-2):(end-1)),'_');
imdsTest.Labels = categorical(labels);

For an example showing how to process this data for deep learning, see Train a Siamese Network to Compare Images.

Image similarity

Flowers

Image credits: [3] [4] [5] [6]

The Flowers data set contains 3670 images of flowers belonging to five classes (daisy, dandelion, roses, sunflowers, and tulips).

Download and extract the Flowers data set [2] from http://download.tensorflow.org/example_images/flower_photos.tgz. The data set is about 218 MB. Depending on your internet connection, the download process can take some time. Set downloadFolder to the location of the data.

url = 'http://download.tensorflow.org/example_images/flower_photos.tgz';
downloadFolder = tempdir;
filename = fullfile(downloadFolder,'flower_dataset.tgz');

dataFolder = fullfile(downloadFolder,'flower_photos');
if ~exist(dataFolder,'dir')
    fprintf("Downloading Flowers data set (218 MB)... ")
    websave(filename,url);
    untar(filename,downloadFolder)
    fprintf("Done.\n")
end

Load the data as an image datastore using the imageDatastore function and specify the folder containing the image data.

imds = imageDatastore(dataFolder, ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');

For an example showing how to process this data for deep learning, see Train Generative Adversarial Network (GAN).

Image classification

Example Food Images

The Example Food Images data set contains 978 photographs of food in nine classes (ceaser_salad, caprese_salad, french_fries, greek_salad, hamburger, hot_dog, pizza, sashimi, and sushi).

Download and extract the Example Food Images data set from https://www.mathworks.com/supportfiles/nnet/data/ExampleFoodImageDataset.zip. This data set is about 77 MB. Depending on your internet connection, the download process can take some time. Set downloadFolder to the location of the data.

url = "https://www.mathworks.com/supportfiles/nnet/data/ExampleFoodImageDataset.zip";
downloadFolder = tempdir;
filename = fullfile(downloadFolder,'ExampleFoodImageDataset.zip');

dataFolder = fullfile(downloadFolder, "ExampleFoodImageDataset");
if ~exist(dataFolder, "dir")
    fprintf("Downloading Example Food Image data set (77 MB)... ")
    websave(filename,url);
    unzip(filename,downloadFolder);
    fprintf("Done.\n")
end

For an example showing how to process this data for deep learning, see View Network Behavior Using tsne.

Image classification

CIFAR-10

(Representative example)

The CIFAR-10 data set contains 60,000 color images of size 32-by-32 pixels, belonging to 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck).

There are 6000 images per class and the data set is split into a training set with 50,000 images and a test set with 10,000 images. This data set is one of the most widely used data sets for testing new image classification models.

Download and extract the CIFAR-10 data set [7] from https://www.cs.toronto.edu/%7Ekriz/cifar-10-matlab.tar.gz. The data set is about 175 MB. Depending on your internet connection, the download process can take some time. Set downloadFolder to the location of the data.

url = 'https://www.cs.toronto.edu/~kriz/cifar-10-matlab.tar.gz';
downloadFolder = tempdir;
filename = fullfile(downloadFolder,'cifar-10-matlab.tar.gz');

dataFolder = fullfile(downloadFolder,'cifar-10-batches-mat');
if ~exist(dataFolder,'dir')
    fprintf("Downloading CIFAR-10 dataset (175 MB)... ");
    websave(filename,url);
    untar(filename,downloadFolder);
    fprintf("Done.\n")
end
Convert the data to numeric arrays using the helper function loadCIFARData, which is used in the example Train Residual Network for Image Classification.
oldpath = addpath(fullfile(matlabroot,'examples','nnet','main'));
[XTrain,YTrain,XValidation,YValidation] = loadCIFARData(downloadFolder);

For an example showing how to process this data for deep learning, see Train Residual Network for Image Classification.

To restore the path, use the path function.

path(oldpath);

Image classification

MathWorks® Merch

This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch). You can use this data set to try out transfer learning and image classification quickly.

The images are of size 227-by-227-by-3.

Extract the MathWorks Merch data set.

filename = 'MerchData.zip';

dataFolder = fullfile(tempdir,'MerchData');
if ~exist(dataFolder,'dir')
    unzip(filename,tempdir);
end

Load the data as an image datastore using the imageDatastore function and specify the folder containing the image data.

imds = imageDatastore(dataFolder, ...
    'IncludeSubfolders',true, ....
    'LabelSource','foldernames');

For examples showing how to process this data for deep learning, see Get Started with Transfer Learning and Train Deep Learning Network to Classify New Images.

Image classification

CamVid

The CamVid data set is a collection of images containing street-level views obtained from cars being driven. The data set is useful for training networks that perform semantic segmentation of images and provides pixel-level labels for 32 semantic classes, including car, pedestrian, and road.

The images are of size 720-by-960-by-3.

Download and extract the CamVid data set [8] from http://web4.cs.ucl.ac.uk/staff/g.brostow/MotionSegRecData. The data set is about 573 MB. Depending on your internet connection, the download process can take some time. Set downloadFolder to the location of the data.

downloadFolder = tempdir;
url = "http://web4.cs.ucl.ac.uk/staff/g.brostow/MotionSegRecData"
urlImages = url + "/files/701_StillsRaw_full.zip";
urlLabels = url + "/data/LabeledApproved_full.zip";

dataFolder = fullfile(downloadFolder,'CamVid');
dataFolderImages = fullfile(dataFolder,'images');
dataFolderLabels = fullfile(dataFolder,'labels');

filenameLabels = fullfile(dataFolder,'labels.zip');
filenameImages = fullfile(dataFolder,'images.zip');

if ~exist(filenameLabels, 'file') || ~exist(imagesZip,'file')   
    mkdir(dataFolder)
    
    fprintf("Downloading CamVid data set images (557 MB)... ");
    websave(filenameImages, urlImages);       
    unzip(filenameImages, dataFolderImages);
    fprintf("Done.\n")
   
    fprintf("Downloading CamVid data set labels (16 MB)... ");
    websave(filenameLabels, urlLabels);
    unzip(filenameLabels, dataFolderLabels);
    fprintf("Done.\n")
end

Load the data as a pixel label datastore using the pixelLabelDatastore function and specify the folder containing the label data, the classes, and the label IDs. To make training easier, group the 32 original classes in the data set into 11 classes. To get the label IDs, use the helper function camvidPixelLabelIDs, which is used in the example Semantic Segmentation Using Deep Learning.

oldpath = addpath(fullfile(matlabroot,'examples','deeplearning_shared','main'));
imds = imageDatastore(dataFolderImages,'IncludeSubfolders',true);

classes = ["Sky" "Building" "Pole" "Road" "Pavement" "Tree" ...
    "SignSymbol" "Fence" "Car" "Pedestrian" "Bicyclist"];

labelIDs = camvidPixelLabelIDs;

pxds = pixelLabelDatastore(dataFolderLabels,classes,labelIDs);

For an example showing how to process this data for deep learning, see Semantic Segmentation Using Deep Learning.

To restore the path, use the path function.

path(oldpath);

Semantic segmentation

Vehicle

The Vehicle data set consists of 295 images containing one or two labeled instances of a vehicle. This small data set is useful for exploring the YOLO-v2 training procedure, but in practice, more labeled images are needed to train a robust detector.

The images are of size 720-by-960-by-3.

Extract the Vehicle data set. Set dataFolder to the location of the data.

filename = 'vehicleDatasetImages.zip';

dataFolder = fullfile(tempdir,'vehicleImages');
if ~exist(dataFolder,'dir')
    unzip(filename,tempdir);
end

Load the data set of as a table of file names and bounding boxes from the extracted MAT file and convert the file names to absolute file paths.

data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;

vehicleDataset.imageFilename = fullfile(tempdir,vehicleDataset.imageFilename);

Create an image datastore containing the images and a box label datastore containing the bounding boxes using the imageDatastore and boxLabelDatastore functions, respectively. Combine the resulting datastores using the combine function.

filenamesImages = vehicleDataset.imageFilename;
tblBoxes = vehicleDataset(:,'vehicle');

imds = imageDatastore(filenamesImages);
blds = boxLabelDatastore(tblBoxes);

cds = combine(imds,blds);

For an example showing how to process this data for deep learning, see Object Detection Using YOLO v2 Deep Learning.

Object detection

RIT-18

Aerial photograph of Hamlin Beach State Park with colored pixel label overlay that indicates regions of grass, trees, sandy beach, asphalt, and other classes

The RIT-18 data set contains image data captured by a drone over Hamlin Beach State Park, in New York state. The data contains labeled training, validation, and test sets, with 18 object class labels including road markings, tree, and building.

Download the RIT-18 data set [9] from https://www.cis.rit.edu/%7Ermk6217/rit18_data.mat. The data set is about 3 GB. Depending on your internet connection, the download process can take some time. Set downloadFolder to the location of the data.

downloadFolder = tempdir;
url = 'http://www.cis.rit.edu/~rmk6217/rit18_data.mat';
filename = fullfile(downloadFolder,'rit18_data.mat');

if ~exist(filename,'file')
    fprintf("Downloading Hamlin Beach data set (3 GB)... ");
    websave(filename,url);
    fprintf("Done.\n")
end

For an example showing how to process this data for deep learning, see Semantic Segmentation of Multispectral Images Using Deep Learning.

Semantic segmentation

BraTS

Axial slice of human brain with colored pixel label overlay that indicates regions of normal tissue and tumor tissue

The BraTS data set contains MRI scans of brain tumors, namely gliomas, which are the most common primary brain malignancies.

The data set contains 750 4-D volumes, each representing a stack of 3-D images. Each 4-D volume is of size 240-by-240-by-155-by-4, where the first three dimensions correspond to the height, width, and depth of a 3-D volumetric image. The fourth dimension corresponds to different scan modalities. The data set is divided into 484 training volumes with voxel labels and 266 test volumes.

Create a directory to store the BraTS data set [10].

dataFolder = fullfile(tempdir,'BraTS');

if ~exist(dataFolder,'dir')
    mkdir(dataFolder);
end

Download the BraTS data from Medical Segmentation Decathlon by clicking the "Download Data" link. Download the "Task01_BrainTumour.tar" file. The data set is about 7 GB. Depending on your internet connection, the download process can take some time.

Extract the TAR file into the directory specified by the dataFolder variable. If the extraction is successful, then dataFolder contains a directory named Task01_BrainTumour that has three subdirectories: imagesTr, imagesTs, and labelsTr.

For an example showing how to process this data for deep learning, see 3-D Brain Tumor Segmentation Using Deep Learning.

Semantic segmentation

Camelyon16

Six patches of normal tissue samples

The data from the Camelyon16 challenge contains a total of 400 WSIs of lymph nodes from two independent sources, separated into 270 training images and 130 test images. The WSIs are stored as TIF files in a stripped format with an 11-level pyramid structure.

The training data set consists of 159 WSIs of normal lymph nodes and 111 whole-slide images (WSIs) of lymph nodes with tumor and healthy tissue. Usually, the tumor tissue is a small fraction of the healthy tissue. Ground truth coordinates of the lesion boundaries accompany the tumor images.

Create directories to store the Camelyon16 data set [11].

dataFolderTrain = fullfile(tempdir,'Camelyon16','training');
dataFolderNormalTrain = fullfile(dataFolderTrain,'normal');
dataFolderTumorTrain = fullfile(dataFolderTrain,'tumor');
dataFolderAnnotationsTrain = fullfile(dataFolderTrain,'lesion_annotations');

if ~exist(dataFolderTrain,'dir')
    mkdir(dataFolderTrain);
    mkdir(dataFolderNormalTrain);
    mkdir(dataFolderTumorTrain);
    mkdir(dataFolderAnnotationsTrain);
end

Download the Camelyon16 data set from Camelyon17 by clicking the first "CAMELYON16 data set" link. Open the "training" directory, then follow these steps:

  • Download the "lesion_annotations.zip" file. Extract the files to the directory specified by the dataFolderAnnotationsTrain variable.

  • Open the "normal" directory. Download the images to the directory specified by the dataFolderNormalTrain variable.

  • Open the "tumor" directory. Download the images to the directory specified by the dataFolderTumorTrain variable.

The data set is about 2 GB. Depending on your internet connection, the download process can take some time.

For an example showing how to process this data for deep learning, see Deep Learning Classification of Large Multiresolution Images.

Image classification (large images)

Common Objects in Context (COCO)

(Representative example)

The COCO 2014 train images data set consists of 82,783 images. The annotations data contains at least five captions corresponding to each image.

Create directories to store the COCO data set.

dataFolder = fullfile(tempdir,"coco");
if ~exist(dataFolder,'dir')
    mkdir(dataFolder);
end

Download and extract the COCO 2014 train images and captions from https://cocodataset.org/#download by clicking the "2014 Train images" and "2014 Train/Val annotations" links, respectively. Save the data in the folder specified by dataFolder.

Extract the captions from the file captions_train2014.json using the jsondecode function.

filename = fullfile(dataFolder,"annotations_trainval2014","annotations", ...
    "captions_train2014.json");
str = fileread(filename);
data = jsondecode(str);

The annotations field of the struct contains the data required for image captioning.

For an example showing how to process this data for deep learning, see Image Captioning Using Attention.

Image captioning

IAPR TC-12

A wall and gardens of the Alcazar royal palace in Seville, Spain

(Representative example)

The IAPR TC-12 Benchmark consists of 20,000 still natural images. The data set includes photos of people, animals, cities, and more. The size of the data file is about 1.8 GB.

Download the IAPR TC-12 data set.

dataDir = fullfile(tempdir,'iaprtc12');
url = 'http://www-i6.informatik.rwth-aachen.de/imageclef/resources/iaprtc12.tgz';

if ~exist(dataDir,'dir')
    fprintf('Downloading IAPR TC-12 data set (1.8 GB)...\n');
    try
        untar(url,dataDir);
    catch 
        % On some Windows machines, the untar command errors for .tgz
        % files. Rename to .tg and try again.
        fileName = fullfile(tempdir,'iaprtc12.tg');
        websave(fileName,url);
        untar(fileName,dataDir);
    end
    fprintf('Done.\n\n');
end

Load the data as an image datastore using the imageDatastore function. Specify the folder containing the image data and the image file extensions.

imageDir = fullfile(dataDir,'images')
exts = {'.jpg','.bmp','.png'};
imds = imageDatastore(imageDir, ...
    'IncludeSubfolders',true, ...
    'FileExtensions',exts);

For an example showing how to process this data for deep learning, see Single Image Super-Resolution Using Deep Learning.

Image-to-image regression

Zurich RAW to RGB

Pair of RAW and RGB image patches of a street scene in Zurich

The Zurich RAW to RGB data set contains 48,043 spatially registered pairs of RAW and RGB training image patches of size 448-by-448. The data set contains two separate test sets. One test set consists of 1,204 spatially registered pairs of RAW and RGB image patches of size 448-by-448. The other test set consists of unregistered full-resolution RAW and RGB images. The size of the data set is 22 GB.

Create a directory to store the data set.

imageDir = fullfile(tempdir,'ZurichRAWToRGB');
if ~exist(imageDir,'dir')
    mkdir(imageDir);
end 
To download the data set, request access using the Zurich RAW to RGB dataset form. Extract the data into the directory specified by the imageDir variable. When extracted successfully, imageDir contains three directories named full_resolution, test, and train.

For an example showing how to process this data for deep learning, see Develop Raw Camera Processing Pipeline Using Deep Learning.

Image-to-image regression

Time Series and Signal Data Sets

DataDescriptionTask

Japanese Vowels

The Japanese Vowels data set [12] [13] contains preprocessed sequences representing utterances of Japanese vowels from different speakers.

XTrain and XTest are cell arrays containing sequences of dimension 12 of varying length. YTrain and YTest are categorical vectors of labels 1 to 9, that correspond to the nine speakers. The entries in XTrain are matrices with 12 rows (one row for each feature) and varying numbers of columns (one column for each time step). XTest is a cell array containing 370 sequences of dimension 12 of varying length.

Load the Japanese Vowels data set as in-memory cell arrays containing numeric sequences using the japaneseVowelsTrainData and japaneseVowelsTestData functions.

[XTrain,YTrain] = japaneseVowelsTrainData;
[XTest,YTest] = japaneseVowelsTestData;

For an example showing how to process this data for deep learning, see Sequence Classification Using Deep Learning.

Sequence-to-label classification

Chickenpox

The Chickenpox data set contains a single time series, with time steps corresponding to months and values corresponding to the number of cases. The output is a cell array, where each element is a single time step.

Load the Chickenpox data as a single numeric sequences using the chickenpox_dataset function. Reshape the data to be a row vector.

data = chickenpox_dataset;
data = [data{:}];

For an example showing how to process this data for deep learning, see Time Series Forecasting Using Deep Learning.

Time-series forecasting

Human Activity

The Human Activity data set contains seven time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to accelerometer readings in three different directions.

Load the Human Activity data set.

dataTrain = load('HumanActivityTrain');
dataTest = load('HumanActivityTest');

XTrain = dataTrain.XTrain;
YTrain = dataTrain.YTrain;
XTest = dataTest.XTest;
YTest = dataTest.YTest;

For an example showing how to process this data for deep learning, see Sequence-to-Sequence Classification Using Deep Learning.

Sequence-to-sequence classification

Turbofan Engine Degradation Simulation

Each time series of the Turbofan Engine Degradation Simulation data set [14] represents a different engine. Each engine starts with unknown degrees of initial wear and manufacturing variation. The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure.

The data contains a ZIP-compressed text files with 26 columns of numbers, separated by spaces. Each row is a snapshot of data taken during a single operational cycle, and each column is a different variable. The columns correspond to the following:

  • Column 1 – Unit number

  • Column 2 – Time in cycles

  • Columns 3–5 – Operational settings

  • Columns 6–26 – Sensor measurements 1–21

Create a directory to store the Turbofan Engine Degradation Simulation data set.

dataFolder = fullfile(tempdir,"turbofan");
if ~exist(dataFolder,'dir')
    mkdir(dataFolder);
end

Download and extract the Turbofan Engine Degradation Simulation Data Set from https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/.

Unzip the data from the file CMAPSSData.zip.

filename = "CMAPSSData.zip";
unzip(filename,dataFolder)

Load the training and test data using the helper functions processTurboFanDataTrain and processTurboFanDataTest, respectively. These functions are used in the example Sequence-to-Sequence Regression Using Deep Learning.

oldpath = addpath(fullfile(matlabroot,'examples','nnet','main'));
filenamePredictors = fullfile(dataFolder,"train_FD001.txt");
[XTrain,YTrain] = processTurboFanDataTrain(filenamePredictors);

filenamePredictors = fullfile(dataFolder,"test_FD001.txt");
filenameResponses = fullfile(dataFolder,"RUL_FD001.txt");
[XTest,YTest] = processTurboFanDataTest(filenamePredictors,filenameResponses);

For an example showing how to process this data for deep learning, see Sequence-to-Sequence Regression Using Deep Learning.

To restore the path, use the path function.

path(oldpath);

Sequence-to-sequence regression, predictive maintenance

PhysioNet 2017 Challenge

The PhysioNet 2017 Challenge data set [16] consists of a set of electrocardiogram (ECG) recordings sampled at 300 Hz and divided by a group of experts into four different classes: Normal (N), AFib (A), Other Rhythm (O), and Noisy Recording (~).

Download and extract the PhysioNet 2017 Challenge data set using the ReadPhysionetData script, which is used in the example Classify ECG Signals Using Long Short-Term Memory Networks.

The data set is about 95 MB. Depending on your internet connection, the download process can take some time.

oldpath = addpath(fullfile(matlabroot,'examples','deeplearning_shared','main'));
ReadPhysionetData
data = load('PhysionetData.mat')
signals = data.Signals;
labels = data.Labels;

For an example showing how to process this data for deep learning, see Classify ECG Signals Using Long Short-Term Memory Networks.

To restore the path, use the path function.

path(oldpath);

Sequence-to-label classification

Tennessee Eastman Process (TEP) simulation

This data set consists of MAT files converted from the Tennessee Eastman Process (TEP) simulation data.

Download the Tennessee Eastman Process (TEP) simulation data set [15] from the MathWorks support files site (see disclaimer). The data set has four components: fault-free training, fault-free testing, faulty training, and faulty testing. Download each file separately.

The data set is about 1.7 GB. Depending on your internet connection, the download process can take some time.

urlSupportFiles = "https://www.mathworks.com/supportfiles/predmaint";

url = urlSupportFiles + "/chemical-process-fault-detection-data/faultytesting.mat";
fprintf("Downloading TEP faulty testing data (1 GB)... ")
websave('faultytesting.mat',url);
fprintf("Done.\n")

url = urlSupportFiles + "/chemical-process-fault-detection-data/faultytraining.mat";
fprintf("Downloading TEP faulty training data (613 MB)... ")
websave('faultytraining.mat',url);
fprintf("Done.\n")

url = urlSupportFiles + "/chemical-process-fault-detection-data/faultfreetesting.mat";
fprintf("Downloading TEP fault-free testing data (69 MB)... ")
websave('faultfreetesting.mat',url);
fprintf("Done.\n")

url = urlSupportFiles + "/chemical-process-fault-detection-data/faultfreetraining.mat";
fprintf("Downloading TEP fault-free training data (36 MB)... ")
websave('faultfreetraining.mat',url);
fprintf("Done.\n")

Load the downloaded files into the MATLAB® workspace.

load('faultfreetesting.mat');
load('faultfreetraining.mat');
load('faultytesting.mat');
load('faultytraining.mat');

For an example showing how to process this data for deep learning, see Chemical Process Fault Detection Using Deep Learning.

Sequence-to-label classification

PhysioNet ECG Segmentation

The PhysioNet ECG Segmentation data set [16] [17] consists of roughly 15 minutes of ECG recordings from a total of 105 patients. To obtain each recording, the examiners placed two electrodes on different locations on a patient's chest, resulting in a two-channel signal. The database provides signal region labels generated by an automated expert system.

Download the PhysioNet ECG Segmentation data set from the https://github.com/mathworks/physionet_ECG_segmentation by downloading the ZIP file QT_Database-master.zip. The data set is about 72 MB. Depending on your internet connection, the download process can take some time. Set downloadFolder to the location of the data.

downloadFolder = tempdir;

url = "https://github.com/mathworks/physionet_ECG_segmentation/raw/master/QT_Database-master.zip";
filename = fullfile(downloadFolder,"QT_Database-master.zip");

dataFolder = fullfile(downloadFolder,"QT_Database-master");

if ~exist(dataFolder,"dir")
    fprintf("Downloading Physionet ECG Segmentation data set (72 MB)... ")
    websave(filename,url);
    unzip(filename,downloadFolder);
    fprintf("Done.\n")
end

Unzipping creates the folder QT_Database-master in your temporary directory. This folder contains the text file README.md and the following files:

  • QTData.mat

  • Modified_physionet_data.txt

  • License.txt

QTData.mat contains the PhysioNet ECG Segmentation data. The file Modified_physionet_data.txt provides the source attributions for the data and a description of the operations applied to each raw ECG recording. Load the PhysioNet ECG Segmentation data from the MAT file.

load(fullfile(dataFolder,'QTData.mat'))

For an example showing how to process this data for deep learning, see Waveform Segmentation Using Deep Learning.

Sequence-to-label classification, waveform segmentation

Synthetic pedestrian, car, and bicyclist backscattering

Generate a synthetic pedestrian, car, and bicyclist backscattering data set using the helper functions helperBackScatterSignals and helperDopplerSignatures, which are used in the example Pedestrian and Bicyclist Classification Using Deep Learning.

The helper function helperBackScatterSignals generates a specified number of pedestrian, bicyclist, and car radar returns. For each realization, the return signals have dimensions Nfast-by-Nslow, where Nfast is the number of fast-time samples and Nslow is the number of slow-time samples.

The helper function helperDopplerSignatures computes the short-time Fourier transform (STFT) of a radar return to generate the micro-Doppler signature. To obtain the micro-Doppler signatures, use the helper functions to apply the STFT and a preprocessing method to each signal.

oldpath = addpath(fullfile(matlabroot,'examples','phased','main'));
numPed = 1; % Number of pedestrian realizations
numBic = 1; % Number of bicyclist realizations
numCar = 1; % Number of car realizations
[xPedRec,xBicRec,xCarRec,Tsamp] = helperBackScatterSignals(numPed,numBic,numCar);

[SPed,T,F] = helperDopplerSignatures(xPedRec,Tsamp);
[SBic,~,~] = helperDopplerSignatures(xBicRec,Tsamp);
[SCar,~,~] = helperDopplerSignatures(xCarRec,Tsamp);

For an example showing how to process this data for deep learning, see Pedestrian and Bicyclist Classification Using Deep Learning.

To restore the path, use the path function.

path(oldpath);

Sequence-to-label classification

Generated waveforms

Generate rectangular, linear FM, and phase coded waveforms using the helper function helperGenerateRadarWaveforms, which is used in the example Radar Waveform Classification Using Deep Learning.

The helper function helperGenerateRadarWaveforms generates 3000 signals with a sample rate of 100 MHz for each modulation type using phased.RectangularWaveform for rectangular pulses, phased.LinearFMWaveform for linear FM, and phased.PhaseCodedWaveform for phase-coded pulses with Barker code.

oldpath = addpath(fullfile(matlabroot,'examples','phased','main'));
[wav, modType] = helperGenerateRadarWaveforms;

For an example showing how to process this data for deep learning, see Radar Waveform Classification Using Deep Learning.

To restore the path, use the path function.

path(oldpath);

Sequence-to-label classification

Video Data Sets

DataDescriptionTask

HMDB: a large human motion database

(Representative example)

The HMBD51 data set contains about 2 GB of video data for 7000 clips from 51 classes, such as drink, run, and pushup.

Download and extract the HMBD51 data set from HMDB: a large human motion database. The data set is about 2 GB. Depending on your internet connection, the download process can take some time.

After you extract the RAR files, get the file names and the labels of the videos by using the helper function hmdb51Files, which used in the example Classify Videos Using Deep Learning. Set dataFolder to the location of the data.

oldpath = addpath(fullfile(matlabroot,'examples','nnet','main'));
dataFolder = fullfile(tempdir,"hmdb51_org");
[files,labels] = hmdb51Files(dataFolder);

For an example showing how to process this data for deep learning, see Classify Videos Using Deep Learning.

To restore the path, use the path function.

path(oldpath);

Video classification

Text Data Sets

DataDescriptionTask

Factory Reports

The Factory Reports data set is a table containing approximately 500 reports with various attributes including a plain text description in the variable Description and a categorical label in the variable Category.

Read the Factory Reports data from the file "factoryReports.csv". Extract the text data and the labels from the Description and Category columns, respectively.

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');

textData = data.Description;
labels = data.Category;

For an example showing how to process this data for deep learning, see Classify Text Data Using Deep Learning.

Text classification, topic modeling

Shakespeare's Sonnets

The file sonnets.txt contains all of Shakespeare's sonnets in a single text file.

Read the Shakespeare's Sonnets data from the file "sonnets.txt".

filename = "sonnets.txt";
textData = fileread(filename);

The sonnets are indented by two whitespace characters and are separated by two newline characters. Remove the indentations using replace and split the text into separate sonnets using split. Remove the main title from the first three elements and the sonnet titles, which appear before each sonnet.

textData = replace(textData,"  ","");
textData = split(textData,[newline newline]);
textData = textData(5:2:end);

For an example showing how to process this data for deep learning, see Generate Text Using Deep Learning.

Topic modeling, text generation

ArXiv Metadata

The ArXiv API allows you to access the metadata of scientific e-prints submitted to https://arxiv.org including the abstract and subject areas. For more information, see https://arxiv.org/help/api.

Import a set of abstracts and category labels from math papers using the arXiV API.

url = "https://export.arxiv.org/oai2?verb=ListRecords" + ...
    "&set=math" + ...
    "&metadataPrefix=arXiv";
options = weboptions('Timeout',160);
code = webread(url,options);

For an example showing how to parse the returned XML code and import more records, see Multilabel Text Classification Using Deep Learning.

Text classification, topic modeling

Books from Project Gutenberg

You can download many books from Project Gutenberg. For example, download the text from Alice's Adventures in Wonderland by Lewis Carroll from https://www.gutenberg.org/files/11/11-h/11-h.htm using the webread function.

url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url);

The HTML code contains the relevant text inside <p> (paragraph) elements. Extract the relevant text by parsing the HTML code using the htmlTree function and then finding all the elements with the element name "p".

tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector);

Extract the text data from the HTML subtrees using the extractHTMLText function and remove the empty elements.

textData = extractHTMLText(subtrees);
textData(textData == "") = [];

For an example showing how to process this data for deep learning, see Word-By-Word Text Generation Using Deep Learning.

Topic modeling, text generation

Weekend updates

The file weekendUpdates.xlsx contains example social media status updates containing the hashtags "#weekend" and "#vacation". This data set requires Text Analytics Toolbox™.

Extract the text data from the file weekendUpdates.xlsx using the readtable function and extract the text data from the variable TextData.

filename = "weekendUpdates.xlsx";
tbl = readtable(filename,'TextType','string');
textData = tbl.TextData;

For an example showing how to process this data, see Analyze Sentiment in Text (Text Analytics Toolbox).

Sentiment analysis

Roman Numerals

The CSV file "romanNumerals.csv" contains the decimal numbers 1–1000 in the first column and the corresponding Roman numerals in the second column.

Load the decimal-Roman numeral pairs from the CSV file "romanNumerals.csv".

filename = fullfile("romanNumerals.csv");

options = detectImportOptions(filename, ...
    'TextType','string', ...
    'ReadVariableNames',false);
options.VariableNames = ["Source" "Target"];
options.VariableTypes = ["string" "string"];

data = readtable(filename,options);

For an example showing how to process this data for deep learning, see Sequence-to-Sequence Translation Using Attention.

Sequence-to-sequence translation

Finance Reports

The Securities and Exchange Commission (SEC) allows you to access financial reports via the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) API. For more information, see https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm.

To download this data, use the function financeReports attached to the example Generate Domain Specific Sentiment Lexicon (Text Analytics Toolbox) as a supporting file. To access this function, open the example as a Live Script.

year = 2019;
qtr = 4;
maxLength = 2e6;
textData = financeReports(year,qtr,maxLength);

For an example showing how to process this data, see Generate Domain Specific Sentiment Lexicon (Text Analytics Toolbox).

Sentiment analysis

Audio Data Sets

DataDescriptionTask

Speech Commands

The Speech Commands data set [18] consists of approximately 65,000 audio files labeled with 1 of 12 classes including yes, no, on, and off, as well as classes corresponding to unknown commands and background noise.

Download and extract the Speech Commands data set from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. The data set is about 1.4 GB. Depending on your internet connection, the download process can take some time.

Set dataFolder to the location of the data. Use audioDatastore to create a datastore that contains the file names and the corresponding labels.

dataFolder = tempdir;
ads = audioDatastore(dataFolder, ...
    'IncludeSubfolders',true, ...
    'FileExtensions','.wav', ...
    'LabelSource','foldernames');

For an example showing how to process this data for deep learning, see Speech Command Recognition Using Deep Learning.

Audio classification, speech recognition

Mozilla Common Voice

The Mozilla Common Voice data set consists of audio recordings of speech and corresponding text files. The data also includes demographic metadata such as age, gender, and accent.

Download and extract the Mozilla Common Voice data set data set from https://voice.mozilla.org/. The data set is an open data set, which means that it can grow over time. As of October 2019, the data set is about 28 GB. Depending on your internet connection, the download process can take some time. Set dataFolder to the location of the data. Use audioDatastore to create a datastore that contains the file names and the corresponding labels.

dataFolder = tempdir;
ads = audioDatastore(fullfile(dataFolder,"clips"));

For an example showing how to process this data for deep learning, see Classify Gender Using LSTM Networks.

Audio classification, speech recognition.

Free Spoken Digit Dataset

The Free Spoken Digit Dataset, as of January 29, 2019, consists of 2000 recordings of the English digits 0 through 9 obtained from four speakers. Two of the speakers in this version are native speakers of American English and two speakers are nonnative speakers of English with a Belgium French and German accent respectively. The data is sampled at 8000 Hz.

Download the Free Spoken Digit Dataset (FSDD) recordings from https://github.com/Jakobovski/free-spoken-digit-dataset.

Set dataFolder to the location of the data. Use audioDatastore to create a datastore that contains the file names and the corresponding labels.

dataFolder = fullfile(tempdir,'free-spoken-digit-dataset','recordings');
ads = audioDatastore(dataFolder);

For an example showing how to process this data for deep learning, see Spoken Digit Recognition with Wavelet Scattering and Deep Learning.

Audio classification, speech recognition.

Berlin Database of Emotional Speech

The Berlin Database of Emotional Speech [19] contains 535 utterances spoken by 10 actors intended to convey one of the following emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness, or neutral. The emotions are text independent.

The file names are codes indicating the speaker ID, text spoken, emotion, and version. The website contains a key for interpreting the code and additional information about the speakers such as gender and age.

Download the Berlin Database of Emotional Speech from http://emodb.bilderbar.info/index-1280.html. The data set is about 40 MB. Depending on your internet connection, the download process can take some time.

Set dataFolder to the location of the data. Use audioDatastore to create a datastore that contains the file names and the corresponding labels.

dataFolder = tempdir;
ads = audioDatastore(fullfile(dataFolder,"wav"));

For an example showing how to process this data for deep learning, see Speech Emotion Recognition.

Audio classification, speech recognition.

TUT Acoustic scenes 2017

Download and extract the TUT Acoustic scenes 2017 data set [20] from TUT Acoustic scenes 2017, Development dataset and TUT Acoustic scenes 2017, Evaluation dataset.

The data set consists of 10-second audio segments from 15 acoustic scenes including bus, car, and library.

For an example showing how to process this data for deep learning, see Acoustic Scene Recognition Using Late Fusion.

Acoustic scene classification

References

Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. "Acoustic Scene Classification: An Overview of DCASE 2017 Challenge Entries." Proceedings of the International Workshop on Acoustic Signal Enhancement (2018):

411–415.

[1] Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. “Human-Level Concept Learning through Probabilistic Program Induction.” Science 350, no. 6266 (December 11, 2015): 1332–38. https://doi.org/10.1126/science.aab3050.

[3] Kat, Tulips, image, https://www.flickr.com/photos/swimparallel/3455026124. Creative Commons License (CC BY).

[4] Rob Bertholf, Sunflowers, image, https://www.flickr.com/photos/robbertholf/20777358950. Creative Commons 2.0 Generic License.

[5] Parvin, Roses, image, https://www.flickr.com/photos/55948751@N00. Creative Commons 2.0 Generic License.

[6] John Haslam, Dandelions, image, https://www.flickr.com/photos/foxypar4/645330051. Creative Commons 2.0 Generic License.

[7] Krizhevsky, Alex. "Learning Multiple Layers of Features from Tiny Images." MSc thesis, University of Toronto, 2009. https://www.cs.toronto.edu/%7Ekriz/learning-features-2009-TR.pdf.

[8] Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. “Semantic Object Classes in Video: A High-Definition Ground Truth Database.” Pattern Recognition Letters 30, no. 2 (January 2009): 88–97. https://doi.org/10.1016/j.patrec.2008.04.005

[9] Kemker, Ronald, Carl Salvaggio, and Christopher Kanan. “High-Resolution Multispectral Dataset for Semantic Segmentation.” ArXiv:1703.01918 [Cs], March 6, 2017. https://arxiv.org/abs/1703.01918

[10] Isensee, Fabian, Philipp Kickingereder, Wolfgang Wick, Martin Bendszus, and Klaus H. Maier-Hein. “Brain Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017 Challenge.” In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, edited by Alessandro Crimi, Spyridon Bakas, Hugo Kuijf, Bjoern Menze, and Mauricio Reyes, 10670:287–97. Cham, Switzerland: Springer International Publishing, 2018. https://doi.org/10.1007/978-3-319-75238-9_25

[11] Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. “Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer.” JAMA 318, no. 22 (December 12, 2017): 2199. https://doi.org/10.1001/jama.2017.14585

[12] Kudo, Mineichi, Jun Toyama, and Masaru Shimbo. “Multidimensional Curve Classification Using Passing-through Regions.” Pattern Recognition Letters 20, no. 11–13 (November 1999): 1103–11. https://doi.org/10.1016/S0167-8655(99)00077-X

[13] Kudo, Mineichi, Jun Toyama, and Masaru Shimbo. Japanese Vowels Data Set. Distributed by UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Japanese+Vowels

[14] Saxena, Abhinav, Kai Goebel. "Turbofan Engine Degradation Simulation Data Set." NASA Ames Prognostics Data Repository https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/, NASA Ames Research Center, Moffett Field, CA

[15] Rieth, Cory A., Ben D. Amsel, Randy Tran, and Maia B. Cook. "Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation." Harvard Dataverse, Version 1, 2017. https://doi.org/10.7910/DVN/6C3JR1.

[16] Goldberger, Ary L., Luis A. N. Amaral, Leon Glass, Jeffery M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals." Circulation 101, No. 23, 2000, pp. e215–e220. https://circ.ahajournals.org/content/101/23/e215.full

[17] Laguna, Pablo, Roger G. Mark, Ary L. Goldberger, and George B. Moody. "A Database for Evaluation of Algorithms for Measurement of QT and Other Waveform Intervals in the ECG." Computers in Cardiology 24, 1997, pp. 673–676.

[18] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/licenses/by/4.0/legalcode.

[19] Burkhardt, Felix, Astrid Paeschke, Melissa A. Rolfes, Walter F. Sendlmeier, and Benjamin Weiss. "A Database of German Emotional Speech." Proceedings of Interspeech 2005. Lisbon, Portugal: International Speech Communication Association, 2005.

[20] Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. "Acoustic scene classification: an overview of DCASE 2017 challenge entries." In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 411-415. IEEE, 2018.

See Also

|

Related Topics