resume

Resume fitting LDA model

Syntax

updatedMdl = resume(ldaMdl,bag)

updatedMdl = resume(ldaMdl,counts)

updatedMdl = resume(___,Name,Value)

Description

updatedMdl = resume(ldaMdl,bag) returns an updated LDA model by training for more iterations on the bag-of-words or bag-of-n-grams model bag. The input bag must be the same model used to fit ldaMdl.

updatedMdl = resume(ldaMdl,counts) returns an updated LDA model by training for more iterations on the documents represented by the matrix of word counts counts. The input counts must be the same matrix used to fit ldaMdl.

example

updatedMdl = resume(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Resume Fitting of LDA Model

Open Live Script

To reproduce the results in this example, set rng to 'default'.

rng('default')

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Fit an LDA model with four topics. The resume function does not support the default solver for fitlda. Set the LDA solver to be collapsed variational Bayes, zeroth order.

numTopics = 4;
mdl = fitlda(bag,numTopics,'Solver','cvb0')

=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|          0 |       0.01 |            |  3.292e+03 |         1.000 |             0 |
|          1 |       0.01 | 1.4970e-01 |  1.147e+03 |         1.000 |             0 |
|          2 |       0.00 | 7.1229e-03 |  1.091e+03 |         1.000 |             0 |
|          3 |       0.00 | 8.1261e-03 |  1.031e+03 |         1.000 |             0 |
|          4 |       0.00 | 8.8626e-03 |  9.703e+02 |         1.000 |             0 |
|          5 |       0.00 | 8.5486e-03 |  9.154e+02 |         1.000 |             0 |
|          6 |       0.00 | 7.4632e-03 |  8.703e+02 |         1.000 |             0 |
|          7 |       0.00 | 6.0480e-03 |  8.356e+02 |         1.000 |             0 |
|          8 |       0.00 | 4.5955e-03 |  8.102e+02 |         1.000 |             0 |
|          9 |       0.00 | 3.4068e-03 |  7.920e+02 |         1.000 |             0 |
|         10 |       0.00 | 2.5353e-03 |  7.788e+02 |         1.000 |             0 |
|         11 |       0.01 | 1.9089e-03 |  7.690e+02 |         1.222 |            10 |
|         12 |       0.00 | 1.2486e-03 |  7.626e+02 |         1.176 |             7 |
|         13 |       0.00 | 1.1243e-03 |  7.570e+02 |         1.125 |             7 |
|         14 |       0.00 | 9.1253e-04 |  7.524e+02 |         1.079 |             7 |
|         15 |       0.00 | 7.5878e-04 |  7.486e+02 |         1.039 |             6 |
|         16 |       0.00 | 6.6181e-04 |  7.454e+02 |         1.004 |             6 |
|         17 |       0.00 | 6.0400e-04 |  7.424e+02 |         0.974 |             6 |
|         18 |       0.00 | 5.6244e-04 |  7.396e+02 |         0.948 |             6 |
|         19 |       0.00 | 5.0548e-04 |  7.372e+02 |         0.926 |             5 |
|         20 |       0.00 | 4.2796e-04 |  7.351e+02 |         0.905 |             5 |
=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|         21 |       0.00 | 3.4941e-04 |  7.334e+02 |         0.887 |             5 |
|         22 |       0.00 | 2.9495e-04 |  7.320e+02 |         0.871 |             5 |
|         23 |       0.00 | 2.6300e-04 |  7.307e+02 |         0.857 |             5 |
|         24 |       0.00 | 2.5200e-04 |  7.295e+02 |         0.844 |             4 |
|         25 |       0.00 | 2.4150e-04 |  7.283e+02 |         0.833 |             4 |
|         26 |       0.00 | 2.0549e-04 |  7.273e+02 |         0.823 |             4 |
|         27 |       0.00 | 1.6441e-04 |  7.266e+02 |         0.813 |             4 |
|         28 |       0.00 | 1.3256e-04 |  7.259e+02 |         0.805 |             4 |
|         29 |       0.00 | 1.1094e-04 |  7.254e+02 |         0.798 |             4 |
|         30 |       0.00 | 9.2849e-05 |  7.249e+02 |         0.791 |             4 |
=====================================================================================

mdl = 
  ldaModel with properties:

                     NumTopics: 4
             WordConcentration: 1
            TopicConcentration: 0.7908
      CorpusTopicProbabilities: [0.2654 0.2531 0.2480 0.2336]
    DocumentTopicProbabilities: [154x4 double]
        TopicWordProbabilities: [3092x4 double]
                    Vocabulary: [1x3092 string]
                    TopicOrder: 'initial-fit-probability'
                       FitInfo: [1x1 struct]

View information about the fit.

mdl.FitInfo

ans = struct with fields:
          TerminationCode: 1
        TerminationStatus: "Relative tolerance on log-likelihood satisfied."
            NumIterations: 30
    NegativeLogLikelihood: 6.3042e+04
               Perplexity: 724.9445
                   Solver: "cvb0"
                  History: [1x1 struct]

Resume fitting the LDA model with a lower log-likelihood tolerance.

tolerance = 1e-5;
updatedMdl = resume(mdl,bag, ...
    'LogLikelihoodTolerance',tolerance)

=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|         30 |       0.00 |            |  7.249e+02 |         0.791 |             0 |
|         31 |       0.00 | 8.0569e-05 |  7.246e+02 |         0.785 |             3 |
|         32 |       0.00 | 7.4692e-05 |  7.242e+02 |         0.779 |             3 |
|         33 |       0.00 | 6.9802e-05 |  7.239e+02 |         0.774 |             3 |
|         34 |       0.00 | 6.1154e-05 |  7.236e+02 |         0.770 |             3 |
|         35 |       0.00 | 5.3163e-05 |  7.233e+02 |         0.766 |             3 |
|         36 |       0.00 | 4.7807e-05 |  7.231e+02 |         0.762 |             3 |
|         37 |       0.00 | 4.1820e-05 |  7.229e+02 |         0.759 |             3 |
|         38 |       0.00 | 3.6237e-05 |  7.227e+02 |         0.756 |             3 |
|         39 |       0.00 | 3.1819e-05 |  7.226e+02 |         0.754 |             2 |
|         40 |       0.00 | 2.7772e-05 |  7.224e+02 |         0.751 |             2 |
|         41 |       0.00 | 2.5238e-05 |  7.223e+02 |         0.749 |             2 |
|         42 |       0.00 | 2.2052e-05 |  7.222e+02 |         0.747 |             2 |
|         43 |       0.00 | 1.8471e-05 |  7.221e+02 |         0.745 |             2 |
|         44 |       0.00 | 1.5638e-05 |  7.221e+02 |         0.744 |             2 |
|         45 |       0.00 | 1.3735e-05 |  7.220e+02 |         0.742 |             2 |
|         46 |       0.00 | 1.2298e-05 |  7.219e+02 |         0.741 |             2 |
|         47 |       0.00 | 1.0905e-05 |  7.219e+02 |         0.739 |             2 |
|         48 |       0.00 | 9.5581e-06 |  7.218e+02 |         0.738 |             2 |
=====================================================================================

updatedMdl = 
  ldaModel with properties:

                     NumTopics: 4
             WordConcentration: 1
            TopicConcentration: 0.7383
      CorpusTopicProbabilities: [0.2679 0.2517 0.2495 0.2309]
    DocumentTopicProbabilities: [154x4 double]
        TopicWordProbabilities: [3092x4 double]
                    Vocabulary: [1x3092 string]
                    TopicOrder: 'initial-fit-probability'
                       FitInfo: [1x1 struct]

View information about the fit.

updatedMdl.FitInfo

ans = struct with fields:
          TerminationCode: 1
        TerminationStatus: "Relative tolerance on log-likelihood satisfied."
            NumIterations: 48
    NegativeLogLikelihood: 6.3001e+04
               Perplexity: 721.8357
                   Solver: "cvb0"
                  History: [1x1 struct]

Input Arguments

collapse all

`ldaMdl` — Input LDA model
`ldaModel` object

Input LDA model, specified as an ldaModel object. To resume fitting a model, you must fit ldaMdl with solver 'savb', 'avb', or 'cvb0'.

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object. If bag is a bagOfNgrams object, then the function treats each n-gram as a single word.

`counts` — Frequency counts of words
matrix of nonnegative integers

Frequency counts of words, specified as a matrix of nonnegative integers. If you specify 'DocumentsIn' to be 'rows', then the value counts(i,j) corresponds to the number of times the jth word of the vocabulary appears in the ith document. Otherwise, the value counts(i,j) corresponds to the number of times the ith word of the vocabulary appears in the jth document.

Note

The arguments bag and counts must be the same used to fit ldaMdl.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'LogLikelihoodTolerance',0.001 specifies a log-likelihood tolerance of 0.001.

Solver Options

collapse all

`'DocumentsIn'` — Orientation of documents
`'rows'` (default) | `'columns'`

Orientation of documents in the word count matrix, specified as the comma-separated pair consisting of 'DocumentsIn' and one of the following:

'rows' – Input is a matrix of word counts with rows corresponding to documents.
'columns' – Input is a transposed matrix of word counts with columns corresponding to documents.

This option only applies if you specify the input documents as a matrix of word counts.

Note

If you orient your word count matrix so that documents correspond to columns and specify 'DocumentsIn','columns', then you might experience a significant reduction in optimization-execution time.

`'FitTopicConcentration'` — Option for fitting topic concentration parameter
`true` | `false`

Option for fitting topic concentration, specified as the comma-separated pair consisting of 'FitTopicConcentration' and either true or false.

The default value is the value used to fit ldaMdl.

Example: 'FitTopicConcentration',true

Data Types: logical

`'FitTopicProbabilities'` — Option for fitting topic probabilities
`true` | `false`

Option for fitting topic concentration, specified as the comma-separated pair consisting of 'FitTopicConcentration' and either true or false.

The default value is the value used to fit ldaMdl.

The function fits the Dirichlet prior $α = α_{0} (\begin{matrix} p_{1} & p_{2} & \dots & p_{K} \end{matrix})$ on the topic mixtures, where $α_{0}$ is the topic concentration and $p_{1}, \dots, p_{K}$ are the corpus topic probabilities which sum to 1.

Example: 'FitTopicProbabilities',true

Data Types: logical

`'LogLikelihoodTolerance'` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

Relative tolerance on log-likelihood, specified as the comma-separated pair consisting of 'LogLikelihoodTolerance' and a positive scalar. The optimization terminates when this tolerance is reached.

Example: 'LogLikelihoodTolerance',0.001

Batch Solver Options

collapse all

`'IterationLimit'` — Maximum number of iterations
`100` (default) | positive integer

Maximum number of iterations, specified as the comma-separated pair consisting of 'IterationLimit' and a positive integer.

This option supports models fitted with batch solvers only ('cgs', 'avb', and 'cvb0').

Example: 'IterationLimit',200

Stochastic Solver Options

collapse all

`'DataPassLimit'` — Maximum number of passes through data
1 (default) | positive integer

Maximum number of passes through the data, specified as the comma-separated pair consisting of 'DataPassLimit' and a positive integer.

If you specify 'DataPassLimit' but not 'MiniBatchLimit', then the default value of 'MiniBatchLimit' is ignored. If you specify both 'DataPassLimit' and 'MiniBatchLimit', then resume uses the argument that results in processing the fewest observations.

This option supports models fitted with stochastic solvers only ('savb').

Example: 'DataPassLimit',2

`'MiniBatchLimit'` — Maximum number of mini-batch passes
positive integer

Maximum number of mini-batch passes, specified as the comma-separated pair consisting of 'MiniBatchLimit' and a positive integer.

If you specify 'MiniBatchLimit' but not 'DataPassLimit', then resume ignores the default value of 'DataPassLimit'. If you specify both 'MiniBatchLimit' and 'DataPassLimit', then resume uses the argument that results in processing the fewest observations. The default value is ceil(numDocuments/MiniBatchSize), where numDocuments is the number of input documents.

This option supports models fitted with stochastic solvers only ('savb').

Example: 'MiniBatchLimit',200

`'MiniBatchSize'` — Mini-batch size
1000 (default) | positive integer

Mini-batch size, specified as the comma-separated pair consisting of 'MiniBatchLimit' and a positive integer. The function processes MiniBatchSize documents in each iteration.

This option supports models fitted with stochastic solvers only ('savb').

Example: 'MiniBatchSize',512

Display Options

collapse all

`'ValidationData'` — Validation data
`[]` (default) | `bagOfWords` object | `bagOfNgrams` object | sparse matrix of word counts

Validation data to monitor optimization convergence, specified as the comma-separated pair consisting of 'ValidationData' and a bagOfWords object, a bagOfNgrams object, or a sparse matrix of word counts. If the validation data is a matrix, then the data must have the same orientation and the same number of words as the input documents.

`'ValidationFrequency'` — Frequency of model validation
positive integer

Frequency of model validation in number of iterations, specified as the comma-separated pair consisting of 'ValidationFrequency' and a positive integer.

The default value depends on the solver used to fit the model. For the stochastic solver, the default value is 10. For the other solvers, the default value is 1.

`'Verbose'` — Verbosity level
1 (default) | 0

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and one of the following:

0 – Do not display verbose output.
1 – Display progress information.

Example: 'Verbose',0

Output Arguments

collapse all

`updatedMdl` — Updated LDA model
`ldaModel` object (default)

Updated LDA model, returned as an ldaModel object.

Documentation

resume

Syntax

Description

Examples

Resume Fitting of LDA Model

Input Arguments

`ldaMdl` — Input LDA model
`ldaModel` object

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`counts` — Frequency counts of words
matrix of nonnegative integers

Name-Value Pair Arguments

`'DocumentsIn'` — Orientation of documents
`'rows'` (default) | `'columns'`

`'FitTopicConcentration'` — Option for fitting topic concentration parameter
`true` | `false`

`'FitTopicProbabilities'` — Option for fitting topic probabilities
`true` | `false`

`'LogLikelihoodTolerance'` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

`'IterationLimit'` — Maximum number of iterations
`100` (default) | positive integer

`'DataPassLimit'` — Maximum number of passes through data
1 (default) | positive integer

`'MiniBatchLimit'` — Maximum number of mini-batch passes
positive integer

`'MiniBatchSize'` — Mini-batch size
1000 (default) | positive integer

`'ValidationData'` — Validation data
`[]` (default) | `bagOfWords` object | `bagOfNgrams` object | sparse matrix of word counts

`'ValidationFrequency'` — Frequency of model validation
positive integer

`'Verbose'` — Verbosity level
1 (default) | 0

Output Arguments

`updatedMdl` — Updated LDA model
`ldaModel` object (default)

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

resume

Syntax

Description

Examples

Resume Fitting of LDA Model

Input Arguments

ldaMdl — Input LDA model ldaModel object

bag — Input model bagOfWords object | bagOfNgrams object

counts — Frequency counts of words matrix of nonnegative integers

Name-Value Pair Arguments

'DocumentsIn' — Orientation of documents 'rows' (default) | 'columns'

'FitTopicConcentration' — Option for fitting topic concentration parameter true | false

'FitTopicProbabilities' — Option for fitting topic probabilities true | false

'LogLikelihoodTolerance' — Relative tolerance on log-likelihood 0.0001 (default) | positive scalar

'IterationLimit' — Maximum number of iterations 100 (default) | positive integer

'DataPassLimit' — Maximum number of passes through data 1 (default) | positive integer

'MiniBatchLimit' — Maximum number of mini-batch passes positive integer

'MiniBatchSize' — Mini-batch size 1000 (default) | positive integer

'ValidationData' — Validation data [] (default) | bagOfWords object | bagOfNgrams object | sparse matrix of word counts

'ValidationFrequency' — Frequency of model validation positive integer

'Verbose' — Verbosity level 1 (default) | 0

Output Arguments

updatedMdl — Updated LDA model ldaModel object (default)

See Also

Topics

Text Analytics Toolbox Documentation

Support

`ldaMdl` — Input LDA model
`ldaModel` object

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`counts` — Frequency counts of words
matrix of nonnegative integers

`'DocumentsIn'` — Orientation of documents
`'rows'` (default) | `'columns'`

`'FitTopicConcentration'` — Option for fitting topic concentration parameter
`true` | `false`

`'FitTopicProbabilities'` — Option for fitting topic probabilities
`true` | `false`

`'LogLikelihoodTolerance'` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

`'IterationLimit'` — Maximum number of iterations
`100` (default) | positive integer

`'DataPassLimit'` — Maximum number of passes through data
1 (default) | positive integer

`'MiniBatchLimit'` — Maximum number of mini-batch passes
positive integer

`'MiniBatchSize'` — Mini-batch size
1000 (default) | positive integer

`'ValidationData'` — Validation data
`[]` (default) | `bagOfWords` object | `bagOfNgrams` object | sparse matrix of word counts

`'ValidationFrequency'` — Frequency of model validation
positive integer

`'Verbose'` — Verbosity level
1 (default) | 0

`updatedMdl` — Updated LDA model
`ldaModel` object (default)