Embed discrete data
The embed operation converts numeric indices to numeric vectors, where the indices correspond to discrete data. Use embeddings to map discrete data such as categorical values or words to numeric vectors.
Note
This function applies the embed operation to dlarray
data. If
you want to apply the embed operation within a layerGraph
object
or Layer
array, use a
wordEmbeddingLayer
(Text Analytics Toolbox)
object.
Embed a mini-batch of categorical features.
Create an array of categorical features containing 5 observations with values "Male"
or "Female"
.
X = categorical(["Male" "Female" "Male" "Female" "Female"])';
Initialize the embedding weights. Specify an embedding dimension of 10, and a vocabulary corresponding to the number of categories of the input data plus one.
embeddingDimension = 10; vocabularySize = numel(categories(X)); weights = rand(embeddingDimension,vocabularySize+1);
To embed the categorical data, first convert it to mini-batch of numeric indices.
X = double(X)
X = 5×1
2
1
2
1
1
For formatted dlarray
input, the embed function expands into a singleton 'C'
(channel) dimension with size 1. Create a formatted dlarray
object containing the data. To specify that the rows correspond to observations, specify the format 'BC'
(batch, channel).
dlX = dlarray(X,'BC')
dlX = 1(C) x 5(B) dlarray 2 1 2 1 1
Embed the numeric indices using the embed
function. The embed function expands into the 'C'
dimension.
dlY = embed(dlX,weights)
dlY = 10(C) x 5(B) dlarray 0.1576 0.8147 0.1576 0.8147 0.8147 0.9706 0.9058 0.9706 0.9058 0.9058 0.9572 0.1270 0.9572 0.1270 0.1270 0.4854 0.9134 0.4854 0.9134 0.9134 0.8003 0.6324 0.8003 0.6324 0.6324 0.1419 0.0975 0.1419 0.0975 0.0975 0.4218 0.2785 0.4218 0.2785 0.2785 0.9157 0.5469 0.9157 0.5469 0.5469 0.7922 0.9575 0.7922 0.9575 0.9575 0.9595 0.9649 0.9595 0.9649 0.9649
In this case, the output is an embeddingDimension
-by-N
matrix with format 'CB'
(channel, batch), where N
is the number of observations. Each column contains the embedding vectors.
Embed a mini-batch of text data.
textData = [ "Items are occasionally getting stuck in the scanner spools." "Loud rattling and banging sounds are coming from assembler pistons."];
Create an array of tokenized documents.
documents = tokenizedDocument(textData);
To encode text data as sequences of numeric indices, create a wordEncoding
object.
enc = wordEncoding(documents);
Initialize the embedding weights. Specify an embedding dimension of 100, and a vocabulary size to be consistent with the vocabulary size corresponding to the number of words in the word encoding plus one.
embeddingDimension = 100; vocabularySize = enc.NumWords; weights = rand(embeddingDimension,vocabularySize+1);
Convert the tokenized documents to sequences of word vectors using the doc2sequence
function. The doc2sequence
function, by default, discards out-of-vocabulary tokens in the input data. To map out-of-vocabulary tokens to the last vector of embedding weights, set the 'UnknownWord'
option to 'nan'
. The doc2sequence
function, by default, left-pads the input sequences with zeros to have the same length
sequences = doc2sequence(enc,documents,'UnknownWord','nan')
sequences=2×1 cell array
{1x11 double}
{1x11 double}
The output is a cell array, where each element corresponds to an observation. Each element is a row vector with elements representing the individual tokens in the corresponding observation including the padding values.
Convert the cell array to a numeric array by vertically concatenating the rows.
X = cat(1,sequences{:})
X = 2×11
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 2 16 17 18 19 10
Convert the numeric indices to dlarray
. Because the rows and columns of X
correspond to observations and time steps, respectively, specify the format 'BT'
.
dlX = dlarray(X,'BT')
dlX = 2(B) x 11(T) dlarray 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 16 17 18 19 10
Embed the numeric indices using the embed
function. The embed
function maps the padding tokens (tokens with index 0) and any other out-of-vocabulary tokens to the same out-of-vocabulary embedding vector.
dlY = embed(dlX,weights);
In this case, the output is an embeddingDimension
-by-N
-by-S
matrix with format 'CBT'
, where N
and S
are the number of observations and the number of time steps, respectively. The vector dlY(:,n,t)
corresponds to the embedding vector of time-step t
of observation n
.
dlX
— Input datadlarray
object | numeric arrayInput data, specified as a dlarray
object with or without dimension
labels, or a numeric array. The elements of dlX
must be nonnegative
integers or NaN
.
The function returns the embedding vectors in weights
corresponding to
the numeric indices in dlX
. If any values in dlX
are zero, NaN
, or greater than the vocabulary size, then the function
returns the out-of-vocabulary vector for that element.
When dlX
is not a formatted dlarray
object,
you must specify the dimension label format using the 'DataFormat'
option. Also, if dlX
is a numeric array, then
weights
must be a dlarray
object.
The embed operation expands into a singleton channel dimension of the input data
specified by the 'C'
dimension label. If the data has no specified
channel dimension, then the function assumes an unspecified singleton channel
dimension.
weights
— Embedding weightsdlarray
object | numeric arrayEmbedding weights, specified as a dlarray
object with or without
dimension labels or a numeric array.
The matrix weights
specifies the dimension of the embedding,
the vocabulary size, and the embedding vectors.
The embedding dimension is the number of components K
of the
embedding. That is, the embedding maps numeric indices to vectors of length
K
. The vocabulary size is the number of discrete elements
V
in the embedding. That is, the number of discrete elements of the
underlying data that the embedding supports. The embedding maps out-of-vocabulary
indices to the same out-of-vocabulary embedding vector.
If weights
is a formatted dlarray
object, then
it must have format 'CU'
or 'UC'
. The dimensions
corresponding to the labels 'C'
and 'U'
must have
size K
and V
+1, respectively, where
K
and V
represent the embedding dimension and
the vocabulary size, respectively. The extra vector corresponds to the out-of-vocabulary
embedding vector.
If weights
is not a formatted dlarray
object,
then weights
must be a
K
-by-(V
+1) matrix, where K
and V
represent the embedding dimension and vocabulary size,
respectively.
The function returns the embedding vectors in weights
corresponding to
the numeric indices in dlX
. If any values in dlX
are zero, NaN
, or greater than the vocabulary size, then the function
returns the out-of-vocabulary vector for that element.
FMT
— Dimension order of unformatted dataDimension order of unformatted input data, specified as the comma-separated pair
consisting of 'DataFormat'
and a character array or string
FMT
that provides a label for each dimension of the data. Each
character in FMT
must be one of the following:
'S'
— Spatial
'C'
— Channel
'B'
— Batch (for example, samples and
observations)
'T'
— Time (for example, sequences)
'U'
— Unspecified
You can specify multiple dimensions labeled 'S'
or
'U'
. You can use the labels 'C'
,
'B'
, and 'T'
at most once.
You must specify 'DataFormat',FMT
when the input data
dlX
is not a formatted dlarray
.
Example: 'DataFormat','SSCB'
Data Types: char
| string
dlY
— Embedding vectorsdlarray
Embedding vectors, returned as a dlarray
object. The output
dlY
has the same underlying data type as the input
dlX
.
The function returns the embedding vectors in weights
corresponding to
the numeric indices in dlX
. If any values in dlX
are zero, NaN
, or greater than the vocabulary size, then the function
returns the out-of-vocabulary vector for that element.
The embedding vectors have K
elements, where K
is the embedding dimension. The size of dimensions dlY
depend on
the input data:
If dlX
is a formatted dlarray
with a
'C'
dimension label, then the embed operation expands into
that dimension. That is, the output has the same dimension labels as the input,
the 'C'
dimension has size K
, the other
dimensions have the same size as the corresponding dimensions of the input.
If dlX
is a formatted dlarray
without a
'C'
dimension. Then the operation assumes a singleton channel
dimension. The output has a 'C'
dimension and all other
dimensions have the same size and labels. That is, the output has the same
dimension labels as the input and also a 'C'
dimension, the
'C'
dimension has size K
, the other
dimensions have the same size as the corresponding dimensions of the input.
If dlX
is not a formatted dlarray
object
and 'DataFormat'
contains a 'C'
dimension,
then the embed operation expands into that dimension. That is, the output has the
number of dimensions as the input, the dimension corresponding to the
'C'
dimension has size K
, the other
dimensions have the same size as the corresponding dimensions of the input.
If dlX
is not a formatted dlarray
object and
'DataFormat'
does not contain a 'C' dimension, then the embed
operation inserts a new dimension at the beginning. That is, the output has one
more dimension as the input, the first dimension corresponding to the
'C'
dimension has size K
, the other
dimensions have the same size as the corresponding dimensions of the input.
Usage notes and limitations:
When at least one of the following input arguments is a gpuArray
or a dlarray
with underlying data of type
gpuArray
, this function runs on the GPU.
dlX
weights
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
dlarray
| dlfeval
| dlgradient
| lstm
You have a modified version of this example. Do you want to open this example with your edits?