predict

Predict top LDA topics of documents

collapse all in page

Syntax

topicIdx = predict(ldaMdl,documents)

topicIdx = predict(ldaMdl,bag)

topicIdx = predict(ldaMdl,counts)

[topicIdx,score] = predict(___)

___ = predict(___,Name,Value)

Description

example

topicIdx = predict(ldaMdl,documents) returns the LDA topic indices with the largest probabilities for documents based on the LDA model ldaMdl.

topicIdx = predict(ldaMdl,bag) returns the LDA topic indices with the largest probabilities for the documents represented by a bag-of-words or bag-of-n-grams model.

example

topicIdx = predict(ldaMdl,counts) returns the LDA topic indices with the largest probabilities for the documents represented by a matrix of word counts.

example

[topicIdx,score] = predict(___) also returns a matrix of posterior probabilities score.

example

___ = predict(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Predict Top LDA Topics of Documents

Open Live Script

To reproduce the results in this example, set rng to 'default'.

rng('default')

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Fit an LDA model with 20 topics.

numTopics = 20;
mdl = fitlda(bag,numTopics)

Initial topic assignments sampled in 0.073207 seconds.
=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|          0 |       0.26 |            |  1.159e+03 |         5.000 |             0 |
|          1 |       0.09 | 5.4884e-02 |  8.028e+02 |         5.000 |             0 |
|          2 |       0.11 | 4.7400e-03 |  7.778e+02 |         5.000 |             0 |
|          3 |       0.10 | 3.4597e-03 |  7.602e+02 |         5.000 |             0 |
|          4 |       0.10 | 3.4662e-03 |  7.430e+02 |         5.000 |             0 |
|          5 |       0.11 | 2.9259e-03 |  7.288e+02 |         5.000 |             0 |
|          6 |       0.05 | 6.4180e-05 |  7.291e+02 |         5.000 |             0 |
=====================================================================================

mdl = 
  ldaModel with properties:

                     NumTopics: 20
             WordConcentration: 1
            TopicConcentration: 5
      CorpusTopicProbabilities: [1x20 double]
    DocumentTopicProbabilities: [154x20 double]
        TopicWordProbabilities: [3092x20 double]
                    Vocabulary: [1x3092 string]
                    TopicOrder: 'initial-fit-probability'
                       FitInfo: [1x1 struct]

Predict the top topics for an array of new documents.

newDocuments = tokenizedDocument([
    "what's in a name? a rose by any other name would smell as sweet."
    "if music be the food of love, play on."]);
topicIdx = predict(mdl,newDocuments)

topicIdx = 2×1

    19
     8

Visualize the predicted topics using word clouds.

figure
subplot(1,2,1)
wordcloud(mdl,topicIdx(1));
title("Topic " + topicIdx(1))
subplot(1,2,2)
wordcloud(mdl,topicIdx(2));
title("Topic " + topicIdx(2))

Predict Top LDA Topics of Word Count Matrix

Open Live Script

Load the example data. sonnetsCounts.mat contains a matrix of word counts and a corresponding vocabulary of preprocessed versions of Shakespeare's sonnets.

load sonnetsCounts.mat
size(counts)

ans = 1×2

         154        3092

Fit an LDA model with 20 topics. To reproduce the results in this example, set rng to 'default'.

rng('default')
numTopics = 20;
mdl = fitlda(counts,numTopics)

Initial topic assignments sampled in 0.053734 seconds.
=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|          0 |       0.16 |            |  1.159e+03 |         5.000 |             0 |
|          1 |       0.04 | 5.4884e-02 |  8.028e+02 |         5.000 |             0 |
|          2 |       0.04 | 4.7400e-03 |  7.778e+02 |         5.000 |             0 |
|          3 |       0.05 | 3.4597e-03 |  7.602e+02 |         5.000 |             0 |
|          4 |       0.09 | 3.4662e-03 |  7.430e+02 |         5.000 |             0 |
|          5 |       0.04 | 2.9259e-03 |  7.288e+02 |         5.000 |             0 |
|          6 |       0.06 | 6.4180e-05 |  7.291e+02 |         5.000 |             0 |
=====================================================================================

mdl = 
  ldaModel with properties:

                     NumTopics: 20
             WordConcentration: 1
            TopicConcentration: 5
      CorpusTopicProbabilities: [1x20 double]
    DocumentTopicProbabilities: [154x20 double]
        TopicWordProbabilities: [3092x20 double]
                    Vocabulary: [1x3092 string]
                    TopicOrder: 'initial-fit-probability'
                       FitInfo: [1x1 struct]

Predict the top topics for the first 5 documents in counts.

topicIdx = predict(mdl,counts(1:5,:))

topicIdx = 5×1

     3
    15
    19
     3
    14

Calculate Topic Prediction Scores

Open Live Script

To reproduce the results in this example, set rng to 'default'.

rng('default')

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Fit an LDA model with 20 topics. To suppress verbose output, set 'Verbose' to 0.

numTopics = 20;
mdl = fitlda(bag,numTopics,'Verbose',0);

Predict the top topics for a new document. Specify the iteration limit to be 200.

newDocument = tokenizedDocument("what's in a name? a rose by any other name would smell as sweet.");
iterationLimit = 200;
[topicIdx,scores] = predict(mdl,newDocument, ...
    'IterationLimit',iterationLimit)

topicIdx = 19

scores = 1×20

    0.0250    0.0250    0.0250    0.0250    0.1250    0.0250    0.0250    0.0250    0.0250    0.0730    0.0250    0.0250    0.0770    0.0250    0.0250    0.0250    0.0250    0.0250    0.2250    0.1250

View the prediction scores in a bar chart.

figure
bar(scores)
title("LDA Topic Prediction Scores")
xlabel("Topic Index")
ylabel("Score")

Input Arguments

collapse all

`ldaMdl` — Input LDA model
`ldaModel` object

Input LDA model, specified as an ldaModel object.

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is a tokenizedDocument, then it must be a column vector. If documents is a string array or a cell array of character vectors, then it must be a row of the words of a single document.

Tip

To ensure that the function does not discard useful information, you must first preprocess the input documents using the same steps used to preprocess the documents used to train the model.

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object. If bag is a bagOfNgrams object, then the function treats each n-gram as a single word.

`counts` — Frequency counts of words
matrix of nonnegative integers

Frequency counts of words, specified as a matrix of nonnegative integers. If you specify 'DocumentsIn' to be 'rows', then the value counts(i,j) corresponds to the number of times the jth word of the vocabulary appears in the ith document. Otherwise, the value counts(i,j) corresponds to the number of times the ith word of the vocabulary appears in the jth document.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'IterationLimit',200 specifies the iteration limit to be 200.

`'DocumentsIn'` — Orientation of documents
`'rows'` (default) | `'columns'`

Orientation of documents in the word count matrix, specified as the comma-separated pair consisting of 'DocumentsIn' and one of the following:

'rows' – Input is a matrix of word counts with rows corresponding to documents.
'columns' – Input is a transposed matrix of word counts with columns corresponding to documents.

This option only applies if you specify the input documents as a matrix of word counts.

Note

If you orient your word count matrix so that documents correspond to columns and specify 'DocumentsIn','columns', then you might experience a significant reduction in optimization-execution time.

`'IterationLimit'` — Maximum number of iterations
`100` (default) | positive integer

Maximum number of iterations, specified as the comma-separated pair consisting of 'IterationLimit' and a positive integer.

Example: 'IterationLimit',200

`'LogLikelihoodTolerance'` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

Relative tolerance on log-likelihood, specified as the comma-separated pair consisting of 'LogLikelihoodTolerance' and a positive scalar. The optimization terminates when this tolerance is reached.

Example: 'LogLikelihoodTolerance',0.001

Output Arguments

collapse all

`topicIdx` — Predicted topic indices
vector of numeric indices

Predicted topic indices, returned as a vector of numeric indices.

`score` — Predicted topic probabilities
matrix

Predicted topic probabilities, returned as a D-by-K matrix, where D is the number of input documents and K is the number of topics in the LDA model. score(i,j) is the probability that topic j appears in document i. Each row of score sums to one.

Documentation

predict

Syntax

Description

Examples

Predict Top LDA Topics of Documents

Predict Top LDA Topics of Word Count Matrix

Calculate Topic Prediction Scores

Input Arguments

`ldaMdl` — Input LDA model
`ldaModel` object

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`counts` — Frequency counts of words
matrix of nonnegative integers

Name-Value Pair Arguments

`'DocumentsIn'` — Orientation of documents
`'rows'` (default) | `'columns'`

`'IterationLimit'` — Maximum number of iterations
`100` (default) | positive integer

`'LogLikelihoodTolerance'` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

Output Arguments

`topicIdx` — Predicted topic indices
vector of numeric indices

`score` — Predicted topic probabilities
matrix

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

predict

Syntax

Description

Examples

Predict Top LDA Topics of Documents

Predict Top LDA Topics of Word Count Matrix

Calculate Topic Prediction Scores

Input Arguments

ldaMdl — Input LDA model ldaModel object

documents — Input documents tokenizedDocument array | string array of words | cell array of character vectors

bag — Input model bagOfWords object | bagOfNgrams object

counts — Frequency counts of words matrix of nonnegative integers

Name-Value Pair Arguments

'DocumentsIn' — Orientation of documents 'rows' (default) | 'columns'

'IterationLimit' — Maximum number of iterations 100 (default) | positive integer

'LogLikelihoodTolerance' — Relative tolerance on log-likelihood 0.0001 (default) | positive scalar

Output Arguments

topicIdx — Predicted topic indices vector of numeric indices

score — Predicted topic probabilities matrix

See Also

Topics

Text Analytics Toolbox Documentation

Support

`ldaMdl` — Input LDA model
`ldaModel` object

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`counts` — Frequency counts of words
matrix of nonnegative integers

`'DocumentsIn'` — Orientation of documents
`'rows'` (default) | `'columns'`

`'IterationLimit'` — Maximum number of iterations
`100` (default) | positive integer

`'LogLikelihoodTolerance'` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

`topicIdx` — Predicted topic indices
vector of numeric indices

`score` — Predicted topic probabilities
matrix