trainWordEmbedding

Train word embedding

Description

example

emb = trainWordEmbedding(filename) trains a word embedding using the training data stored in the text file filename. The file is a collection of documents stored in UTF-8 with one document per line and words separated by whitespace.

example

emb = trainWordEmbedding(documents) trains a word embedding using documents by creating a temporary file with writeTextDocument, and then trains an embedding using the temporary file.

example

emb = trainWordEmbedding(___,Name,Value) specifies additional options using one or more name-value pair arguments. For example, 'Dimension',50 specifies the word embedding dimension to be 50.

Examples

collapse all

Train a word embedding of dimension 100 using the example text file exampleSonnetsDocuments.txt. This file contains preprocessed versions of Shakespeare's sonnets, with one sonnet per line and words separated by a space.

filename = "exampleSonnetsDocuments.txt";
emb = trainWordEmbedding(filename)
Training: 100% Loss: 0        Remaining time: 0 hours 0 minutes.
emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: [1x502 string]

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Train a word embedding using the example data sonnetsPreprocessed.txt. This file contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Train a word embedding using trainWordEmbedding.

emb = trainWordEmbedding(documents)
Training: 100% Loss: 0        Remaining time: 0 hours 0 minutes.
emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: [1x401 string]

Visualize the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Specify the word embedding dimension to be 50. To reduce the number of words discarded by the model, set 'MinCount' to 3. To train for longer, set the number of epochs to 10.

emb = trainWordEmbedding(documents, ...
    'Dimension',50, ...
    'MinCount',3, ...
    'NumEpochs',10)
Training: 100% Loss: 2.68739  Remaining time: 0 hours 0 minutes.
emb = 
  wordEmbedding with properties:

     Dimension: 50
    Vocabulary: [1x750 string]

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb, words);
XY = tsne(V);
textscatter(XY,words)

Input Arguments

collapse all

Name of the file, specified as a string scalar or character vector.

Data Types: string | char

Input documents, specified as a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Dimension',50 specifies the word embedding dimension to be 50.

Dimension of the word embedding, specified as the comma-separated pair consisting of 'Dimension' and a nonnegative integer.

Example: 300

Size of the context window, specified as the comma-separated pair consisting of 'Window' and a nonnegative integer.

Example: 10

Model, specified as the comma-separated pair consisting of 'Model' and 'skipgram' (skip gram) or 'cbow' (continuous bag-of-words).

Example: 'cbow'

Factor to determine the word discard rate, specified as the comma-separated pair consisting of 'DiscardFactor' and a positive scalar. The function discards a word from the input window with probability 1-sqrt(t/f) - t/f where f is the unigram probability of the word, and t is DiscardFactor. Usually, DiscardFactor is in the range of 1e-3 through 1e-5.

Example: 0.005

Loss function, specified as the comma-separated pair consisting of 'LossFunction' and 'ns' (negative sampling), 'hs' (hierarchical softmax), or 'softmax' (softmax).

Example: 'hs'

Number of negative samples for the negative sampling loss function, specified as the comma-separated pair consisting of 'NumNegativeSamples' and a positive integer. This option is only valid when LossFunction is 'ns'.

Example: 10

Number of epochs for training, specified as the comma-separated pair consisting of 'NumEpochs' and a positive integer.

Example: 10

Minimum count of words to include in the embedding, specified as the comma-separated pair consisting of 'MinCount' and a positive integer. The function discards words that appear fewer than MinCount times in the training data from the vocabulary.

Example: 10

Inclusive range for subword n-grams, specified as the comma-separated pair consisting of 'NGramRange' and a vector of two nonnegative integers [min max]. If you do not want to use n-grams, then set 'NGramRange' to [0 0].

Example: [5 10]

Initial learn rate, specified as the comma-separated pair consisting of 'InitialLearnRate' and a positive scalar.

Example: 0.01

Rate for updating the learn rate, specified as the comma-separated pair consisting of 'UpdateRate' and a positive integer. The learn rate decreases to zero linearly in steps every N words where N is the UpdateRate.

Example: 50

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and one of the following:

  • 0 – Do not display verbose output.

  • 1 – Display progress information.

Example: 'Verbose',0

Output Arguments

collapse all

Output word embedding, returned as a wordEmbedding object.

More About

collapse all

Language Considerations

File input to the trainWordEmbedding function requires words separated by whitespace.

For files containing non-English text, you might need to input a tokenizedDocument array to trainWordEmbedding.

To create a tokenizedDocument array from pretokenized text, use the tokenizedDocument function and set the 'TokenizeMethod' option to 'none'.

Tips

The training algorithm uses the number of threads given by the function maxNumCompThreads. To learn how to change the number of threads used by MATLAB®, see maxNumCompThreads.

Introduced in R2017b