trainWordEmbedding

Train word embedding

Syntax

emb = trainWordEmbedding(filename)

emb = trainWordEmbedding(documents)

emb = trainWordEmbedding(___,Name,Value)

Description

emb = trainWordEmbedding(filename) trains a word embedding using the training data stored in the text file filename. The file is a collection of documents stored in UTF-8 with one document per line and words separated by whitespace.

example

emb = trainWordEmbedding(documents) trains a word embedding using documents by creating a temporary file with writeTextDocument, and then trains an embedding using the temporary file.

example

emb = trainWordEmbedding(___,Name,Value) specifies additional options using one or more name-value pair arguments. For example, 'Dimension',50 specifies the word embedding dimension to be 50.

Examples

collapse all

Train Word Embedding from File

Open Live Script

Train a word embedding of dimension 100 using the example text file exampleSonnetsDocuments.txt. This file contains preprocessed versions of Shakespeare's sonnets, with one sonnet per line and words separated by a space.

filename = "exampleSonnetsDocuments.txt";
emb = trainWordEmbedding(filename)

Training: 100% Loss: 0        Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: [1x502 string]

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Train Word Embedding from Documents

Open Live Script

Train a word embedding using the example data sonnetsPreprocessed.txt. This file contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Train a word embedding using trainWordEmbedding.

emb = trainWordEmbedding(documents)

Training: 100% Loss: 0        Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: [1x401 string]

Visualize the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Specify Word Embedding Options

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Specify the word embedding dimension to be 50. To reduce the number of words discarded by the model, set 'MinCount' to 3. To train for longer, set the number of epochs to 10.

emb = trainWordEmbedding(documents, ...
    'Dimension',50, ...
    'MinCount',3, ...
    'NumEpochs',10)

Training: 100% Loss: 2.68739  Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 50
    Vocabulary: [1x750 string]

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb, words);
XY = tsne(V);
textscatter(XY,words)

Input Arguments

collapse all

`filename` — Name of file
string scalar | character vector

Name of the file, specified as a string scalar or character vector.

Data Types: string | char

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Dimension',50 specifies the word embedding dimension to be 50.

`'Dimension'` — Dimension of word embedding
100 (default) | positive integer

Dimension of the word embedding, specified as the comma-separated pair consisting of 'Dimension' and a nonnegative integer.

Example: 300

`'Window'` — Size of context window
5 (default) | nonnegative integer

Size of the context window, specified as the comma-separated pair consisting of 'Window' and a nonnegative integer.

Example: 10

`'Model'` — Model
`'skipgram'` (default) | `'cbow'`

Model, specified as the comma-separated pair consisting of 'Model' and 'skipgram' (skip gram) or 'cbow' (continuous bag-of-words).

Example: 'cbow'

`'DiscardFactor'` — Factor to determine word discard rate
`1e-4` (default) | positive scalar

Factor to determine the word discard rate, specified as the comma-separated pair consisting of 'DiscardFactor' and a positive scalar. The function discards a word from the input window with probability 1-sqrt(t/f) - t/f where f is the unigram probability of the word, and t is DiscardFactor. Usually, DiscardFactor is in the range of 1e-3 through 1e-5.

Example: 0.005

`'LossFunction'` — Loss function
`'ns'` (default) | `'hs'` | `'softmax'`

Loss function, specified as the comma-separated pair consisting of 'LossFunction' and 'ns' (negative sampling), 'hs' (hierarchical softmax), or 'softmax' (softmax).

Example: 'hs'

`'NumNegativeSamples'` — Number of negative samples
5 (default) | positive integer

Number of negative samples for the negative sampling loss function, specified as the comma-separated pair consisting of 'NumNegativeSamples' and a positive integer. This option is only valid when LossFunction is 'ns'.

Example: 10

`'NumEpochs'` — Number of epochs
5 (default) | positive integer

Number of epochs for training, specified as the comma-separated pair consisting of 'NumEpochs' and a positive integer.

Example: 10

`'MinCount'` — Minimum count of words
5 (default) | positive integer

Minimum count of words to include in the embedding, specified as the comma-separated pair consisting of 'MinCount' and a positive integer. The function discards words that appear fewer than MinCount times in the training data from the vocabulary.

Example: 10

`'NGramRange'` — Inclusive range for subword n-grams
`[3 6]` (default) | vector of two nonnegative integers

Inclusive range for subword n-grams, specified as the comma-separated pair consisting of 'NGramRange' and a vector of two nonnegative integers [min max]. If you do not want to use n-grams, then set 'NGramRange' to [0 0].

Example: [5 10]

`'InitialLearnRate'` — Initial learn rate
0.05 (default) | positive scalar

Initial learn rate, specified as the comma-separated pair consisting of 'InitialLearnRate' and a positive scalar.

Example: 0.01

`'UpdateRate'` — Rate for updating learn rate
100 (default) | positive integer

Rate for updating the learn rate, specified as the comma-separated pair consisting of 'UpdateRate' and a positive integer. The learn rate decreases to zero linearly in steps every N words where N is the UpdateRate.

Example: 50

`'Verbose'` — Verbosity level
1 (default) | 0

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and one of the following:

0 – Do not display verbose output.
1 – Display progress information.

Example: 'Verbose',0

Output Arguments

collapse all

`emb` — Output word embedding
word embedding

Output word embedding, returned as a wordEmbedding object.

More About

collapse all

Language Considerations

File input to the trainWordEmbedding function requires words separated by whitespace.

For files containing non-English text, you might need to input a tokenizedDocument array to trainWordEmbedding.

To create a tokenizedDocument array from pretokenized text, use the tokenizedDocument function and set the 'TokenizeMethod' option to 'none'.

Tips

The training algorithm uses the number of threads given by the function maxNumCompThreads. To learn how to change the number of threads used by MATLAB^®, see maxNumCompThreads.

Documentation

trainWordEmbedding

Syntax

Description

Examples

Train Word Embedding from File

Train Word Embedding from Documents

Specify Word Embedding Options

Input Arguments

`filename` — Name of file
string scalar | character vector

`documents` — Input documents
`tokenizedDocument` array

Name-Value Pair Arguments

`'Dimension'` — Dimension of word embedding
100 (default) | positive integer

`'Window'` — Size of context window
5 (default) | nonnegative integer

`'Model'` — Model
`'skipgram'` (default) | `'cbow'`

`'DiscardFactor'` — Factor to determine word discard rate
`1e-4` (default) | positive scalar

`'LossFunction'` — Loss function
`'ns'` (default) | `'hs'` | `'softmax'`

`'NumNegativeSamples'` — Number of negative samples
5 (default) | positive integer

`'NumEpochs'` — Number of epochs
5 (default) | positive integer

`'MinCount'` — Minimum count of words
5 (default) | positive integer

`'NGramRange'` — Inclusive range for subword n-grams
`[3 6]` (default) | vector of two nonnegative integers

`'InitialLearnRate'` — Initial learn rate
0.05 (default) | positive scalar

`'UpdateRate'` — Rate for updating learn rate
100 (default) | positive integer

`'Verbose'` — Verbosity level
1 (default) | 0

Output Arguments

`emb` — Output word embedding
word embedding

More About

Language Considerations

Tips

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

trainWordEmbedding

Syntax

Description

Examples

Train Word Embedding from File

Train Word Embedding from Documents

Specify Word Embedding Options

Input Arguments

filename — Name of file string scalar | character vector

documents — Input documents tokenizedDocument array

Name-Value Pair Arguments

'Dimension' — Dimension of word embedding 100 (default) | positive integer

'Window' — Size of context window 5 (default) | nonnegative integer

'Model' — Model 'skipgram' (default) | 'cbow'

'DiscardFactor' — Factor to determine word discard rate 1e-4 (default) | positive scalar

'LossFunction' — Loss function 'ns' (default) | 'hs' | 'softmax'

'NumNegativeSamples' — Number of negative samples 5 (default) | positive integer

'NumEpochs' — Number of epochs 5 (default) | positive integer

'MinCount' — Minimum count of words 5 (default) | positive integer

'NGramRange' — Inclusive range for subword n-grams [3 6] (default) | vector of two nonnegative integers

'InitialLearnRate' — Initial learn rate 0.05 (default) | positive scalar

'UpdateRate' — Rate for updating learn rate 100 (default) | positive integer

'Verbose' — Verbosity level 1 (default) | 0

Output Arguments

emb — Output word embedding word embedding

More About

Language Considerations

Tips

See Also

Topics

Text Analytics Toolbox Documentation

Support

`filename` — Name of file
string scalar | character vector

`documents` — Input documents
`tokenizedDocument` array

`'Dimension'` — Dimension of word embedding
100 (default) | positive integer

`'Window'` — Size of context window
5 (default) | nonnegative integer

`'Model'` — Model
`'skipgram'` (default) | `'cbow'`

`'DiscardFactor'` — Factor to determine word discard rate
`1e-4` (default) | positive scalar

`'LossFunction'` — Loss function
`'ns'` (default) | `'hs'` | `'softmax'`

`'NumNegativeSamples'` — Number of negative samples
5 (default) | positive integer

`'NumEpochs'` — Number of epochs
5 (default) | positive integer

`'MinCount'` — Minimum count of words
5 (default) | positive integer

`'NGramRange'` — Inclusive range for subword n-grams
`[3 6]` (default) | vector of two nonnegative integers

`'InitialLearnRate'` — Initial learn rate
0.05 (default) | positive scalar

`'UpdateRate'` — Rate for updating learn rate
100 (default) | positive integer

`'Verbose'` — Verbosity level
1 (default) | 0

`emb` — Output word embedding
word embedding