embed

Embed discrete data

Syntax

dlY = embed(dlX,weights)

dlY = embed(dlX,weights,'DataFormat',FMT)

Description

The embed operation converts numeric indices to numeric vectors, where the indices correspond to discrete data. Use embeddings to map discrete data such as categorical values or words to numeric vectors.

Note

This function applies the embed operation to dlarray data. If you want to apply the embed operation within a layerGraph object or Layer array, use a wordEmbeddingLayer (Text Analytics Toolbox) object.

example

dlY = embed(dlX,weights) returns the embedding vectors in weights corresponding to the numeric indices in the formatted dlarray object dlX.

dlY = embed(dlX,weights,'DataFormat',FMT)also specifies dimension format FMT when dlX is not a formatted dlarray object. The output dlY is an unformatted dlarray with the same dimension order as dlX.

Examples

collapse all

Embed Categorical Data

Open Live Script

Embed a mini-batch of categorical features.

Create an array of categorical features containing 5 observations with values "Male" or "Female".

X = categorical(["Male" "Female" "Male" "Female" "Female"])';

Initialize the embedding weights. Specify an embedding dimension of 10, and a vocabulary corresponding to the number of categories of the input data plus one.

embeddingDimension = 10;
vocabularySize = numel(categories(X));
weights = rand(embeddingDimension,vocabularySize+1);

To embed the categorical data, first convert it to mini-batch of numeric indices.

X = double(X)

For formatted dlarray input, the embed function expands into a singleton 'C' (channel) dimension with size 1. Create a formatted dlarray object containing the data. To specify that the rows correspond to observations, specify the format 'BC' (batch, channel).

dlX = dlarray(X,'BC')

dlX = 
  1(C) x 5(B) dlarray

     2     1     2     1     1

Embed the numeric indices using the embed function. The embed function expands into the 'C' dimension.

dlY = embed(dlX,weights)

dlY = 
  10(C) x 5(B) dlarray

    0.1576    0.8147    0.1576    0.8147    0.8147
    0.9706    0.9058    0.9706    0.9058    0.9058
    0.9572    0.1270    0.9572    0.1270    0.1270
    0.4854    0.9134    0.4854    0.9134    0.9134
    0.8003    0.6324    0.8003    0.6324    0.6324
    0.1419    0.0975    0.1419    0.0975    0.0975
    0.4218    0.2785    0.4218    0.2785    0.2785
    0.9157    0.5469    0.9157    0.5469    0.5469
    0.7922    0.9575    0.7922    0.9575    0.9575
    0.9595    0.9649    0.9595    0.9649    0.9649

In this case, the output is an embeddingDimension-by-N matrix with format 'CB' (channel, batch), where N is the number of observations. Each column contains the embedding vectors.

Embed Text Data

This example uses:

Open Live Script

Embed a mini-batch of text data.

textData = [
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."];

Create an array of tokenized documents.

documents = tokenizedDocument(textData);

To encode text data as sequences of numeric indices, create a wordEncoding object.

enc = wordEncoding(documents);

Initialize the embedding weights. Specify an embedding dimension of 100, and a vocabulary size to be consistent with the vocabulary size corresponding to the number of words in the word encoding plus one.

embeddingDimension = 100;
vocabularySize = enc.NumWords;
weights = rand(embeddingDimension,vocabularySize+1);

Convert the tokenized documents to sequences of word vectors using the doc2sequence function. The doc2sequence function, by default, discards out-of-vocabulary tokens in the input data. To map out-of-vocabulary tokens to the last vector of embedding weights, set the 'UnknownWord' option to 'nan'. The doc2sequence function, by default, left-pads the input sequences with zeros to have the same length

sequences = doc2sequence(enc,documents,'UnknownWord','nan')

sequences=2×1 cell array
    {1x11 double}
    {1x11 double}

The output is a cell array, where each element corresponds to an observation. Each element is a row vector with elements representing the individual tokens in the corresponding observation including the padding values.

Convert the cell array to a numeric array by vertically concatenating the rows.

X = cat(1,sequences{:})

X = 2×11

     0     1     2     3     4     5     6     7     8     9    10
    11    12    13    14    15     2    16    17    18    19    10

Convert the numeric indices to dlarray. Because the rows and columns of X correspond to observations and time steps, respectively, specify the format 'BT'.

dlX = dlarray(X,'BT')

dlX = 
  2(B) x 11(T) dlarray

     0     1     2     3     4     5     6     7     8     9    10
    11    12    13    14    15     2    16    17    18    19    10

Embed the numeric indices using the embed function. The embed function maps the padding tokens (tokens with index 0) and any other out-of-vocabulary tokens to the same out-of-vocabulary embedding vector.

dlY = embed(dlX,weights);

In this case, the output is an embeddingDimension-by-N-by-S matrix with format 'CBT', where N and S are the number of observations and the number of time steps, respectively. The vector dlY(:,n,t) corresponds to the embedding vector of time-step t of observation n.

Input Arguments

collapse all

`dlX` — Input data
`dlarray` object | numeric array

Input data, specified as a dlarray object with or without dimension labels, or a numeric array. The elements of dlX must be nonnegative integers or NaN.

The function returns the embedding vectors in weights corresponding to the numeric indices in dlX. If any values in dlX are zero, NaN, or greater than the vocabulary size, then the function returns the out-of-vocabulary vector for that element.

When dlX is not a formatted dlarray object, you must specify the dimension label format using the 'DataFormat' option. Also, if dlX is a numeric array, then weights must be a dlarray object.

The embed operation expands into a singleton channel dimension of the input data specified by the 'C' dimension label. If the data has no specified channel dimension, then the function assumes an unspecified singleton channel dimension.

`weights` — Embedding weights
`dlarray` object | numeric array

Embedding weights, specified as a dlarray object with or without dimension labels or a numeric array.

The matrix weights specifies the dimension of the embedding, the vocabulary size, and the embedding vectors.

The embedding dimension is the number of components K of the embedding. That is, the embedding maps numeric indices to vectors of length K. The vocabulary size is the number of discrete elements V in the embedding. That is, the number of discrete elements of the underlying data that the embedding supports. The embedding maps out-of-vocabulary indices to the same out-of-vocabulary embedding vector.

If weights is a formatted dlarray object, then it must have format 'CU' or 'UC'. The dimensions corresponding to the labels 'C' and 'U' must have size K and V+1, respectively, where K and V represent the embedding dimension and the vocabulary size, respectively. The extra vector corresponds to the out-of-vocabulary embedding vector.

If weights is not a formatted dlarray object, then weights must be a K-by-(V+1) matrix, where K and V represent the embedding dimension and vocabulary size, respectively.

`FMT` — Dimension order of unformatted data
char array | string

Dimension order of unformatted input data, specified as the comma-separated pair consisting of 'DataFormat' and a character array or string FMT that provides a label for each dimension of the data. Each character in FMT must be one of the following:

'S' — Spatial
'C' — Channel
'B' — Batch (for example, samples and observations)
'T' — Time (for example, sequences)
'U' — Unspecified

You can specify multiple dimensions labeled 'S' or 'U'. You can use the labels 'C', 'B', and 'T' at most once.

You must specify 'DataFormat',FMT when the input data dlX is not a formatted dlarray.

Example: 'DataFormat','SSCB'

Data Types: char | string

Output Arguments

collapse all

`dlY` — Embedding vectors
`dlarray`

Embedding vectors, returned as a dlarray object. The output dlY has the same underlying data type as the input dlX.

The embedding vectors have K elements, where K is the embedding dimension. The size of dimensions dlY depend on the input data:

If dlX is a formatted dlarray with a 'C' dimension label, then the embed operation expands into that dimension. That is, the output has the same dimension labels as the input, the 'C' dimension has size K, the other dimensions have the same size as the corresponding dimensions of the input.
If dlX is a formatted dlarray without a 'C' dimension. Then the operation assumes a singleton channel dimension. The output has a 'C' dimension and all other dimensions have the same size and labels. That is, the output has the same dimension labels as the input and also a 'C' dimension, the 'C' dimension has size K, the other dimensions have the same size as the corresponding dimensions of the input.
If dlX is not a formatted dlarray object and 'DataFormat' contains a 'C' dimension, then the embed operation expands into that dimension. That is, the output has the number of dimensions as the input, the dimension corresponding to the 'C' dimension has size K, the other dimensions have the same size as the corresponding dimensions of the input.
If dlX is not a formatted dlarray object and 'DataFormat' does not contain a 'C' dimension, then the embed operation inserts a new dimension at the beginning. That is, the output has one more dimension as the input, the first dimension corresponding to the 'C' dimension has size K, the other dimensions have the same size as the corresponding dimensions of the input.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

When at least one of the following input arguments is a gpuArray or a dlarray with underlying data of type gpuArray, this function runs on the GPU.
- dlX
- weights

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Documentation

embed

Syntax

Description

Examples

Embed Categorical Data

Embed Text Data

Input Arguments

`dlX` — Input data
`dlarray` object | numeric array

`weights` — Embedding weights
`dlarray` object | numeric array

`FMT` — Dimension order of unformatted data
char array | string

Output Arguments

`dlY` — Embedding vectors
`dlarray`

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

See Also

Topics

Deep Learning Toolbox Documentation

Support

Documentation

embed

Syntax

Description

Examples

Embed Categorical Data

Embed Text Data

Input Arguments

dlX — Input data dlarray object | numeric array

weights — Embedding weights dlarray object | numeric array

FMT — Dimension order of unformatted data char array | string

Output Arguments

dlY — Embedding vectors dlarray

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

See Also

Topics

Deep Learning Toolbox Documentation

Support

`dlX` — Input data
`dlarray` object | numeric array

`weights` — Embedding weights
`dlarray` object | numeric array

`FMT` — Dimension order of unformatted data
char array | string

`dlY` — Embedding vectors
`dlarray`

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.